1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
|
**********************************
phinde - generic web search engine
**********************************
Self-hosted search engine you can use for your static blog or about
any other website you want search functionality for.
My live instance is at http://search.cweiske.de/ and indexes my
website, blog and all linked URLs.
========
Features
========
- Crawler and indexer with the ability to run many in parallel
- Shows and highlights text that contains search words
- Boolean search queries:
- ``foo bar`` searches for ``foo AND bar``
- ``foo OR bar``
- ``title:foo`` searches for ``foo`` only in the page title
- Facets for tag, domain, language and type
- Date search:
- ``before:2016-08-30`` - modification date before that day
- ``after:2016-08-30`` - modified after that day
- ``date::2016-08-30`` - exact modification day match
- Site search
- Query: ``foo bar site:example.org/dir/``
- or use the ``site`` GET parameter:
``/?q=foo&site=example.org/dir``
- OpenSearch support with HTML and Atom result lists
- Instant indexing with WebSub (formerly PubSubHubbub)
============
Dependencies
============
- PHP 8.x
- Elasticsearch 2.0
- MySQL or MariaDB for WebSub subscriptions
- Gearman (Debian 9: ``gearman-job-server``, not ``gearman-server``)
- ``gearadmin`` command line tool (``gearman-tools`` package)
- PHP Gearman extension
- Some PHP libraries that get installed with composer
=====
Setup
=====
#. Install and run Elasticsearch and Gearman
#. Install ``php-gearman`` and ``gearman-tools``
#. Get a local copy of the code::
$ git clone https://git.cweiske.de/phinde.git phinde
#. Install dependencies via composer::
$ composer install --no-dev
#. Point your webserver's document root to phinde's ``www`` directory
#. Copy ``data/config.php.dist`` to ``data/config.php`` and adjust it.
Make sure your add your domain to the crawl whitelist.
#. Create a MySQL database and import the schema from ``data/schema.sql``
#. Run ``bin/setup.php`` which sets up the Elasticsearch schema
#. Put your homepage into the queue::
$ ./bin/process.php http://example.org/
#. Start at least one worker to process the crawl+index queue::
$ ./bin/phinde-worker.php
#. Check phinde's status page in your browser.
The number of open tasks should be > 0, the number of workers also.
Re-index when your site changes
===============================
When your site changed, the search engine needs to re-crawl and re-index
the pages.
Simply tell phinde that something changed by running::
$ ./bin/process.php http://example.org/foo.htm
phinde supports HTML pages and Atom feeds, so if your blog has a feed
it's enough to let phinde reindex that one.
It will find all linked pages automatically.
Website integration
===================
Adding a simple search form to your website is easy.
It needs two things:
- ``<form>`` tag with an action that points to the phinde instance
- Search text field with name of ``q``.
Example::
<form method="get" action="http://phinde.example.org">
<input type="text" name="q" placeholder="Search text"/>
<button type="submit">Search</button>
</form>
System service
==============
When using systemd, you can let it run multiple worker instances when
the system boots up:
#. Copy files ``data/systemd/phinde*.service`` into ``/etc/systemd/system/``
#. Adjust user and group names, and the work directories
#. Enable three worker processes::
$ systemctl daemon-reload
$ systemctl enable phinde@1
$ systemctl enable phinde@2
$ systemctl enable phinde@3
$ systemctl enable phinde
$ systemctl start phinde
#. Now three workers are running. Restarting the ``phinde`` service also
restarts the workers.
Cron job
========
Run ``bin/renew-subscriptions.php`` once a day with cron.
It will renew the WebSub subscriptions.
=====
Howto
=====
Delete index data from one domain::
$ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query
That's delete-by-query 2.0, see
https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html
Subscribe to a website/feed
===========================
Phinde supports WebSub__ to get subscribe to changes of a website.
When phinde gets notified by the website's hub about changes,
it will immediately crawl and index the changed pages.
Subscribe to a website's feed::
$ php bin/subscribe.php http://example.org/feed.atom
Phinde will determine the website's hub and send a registration request to it.
The status page will show the number of working, and the number of open
subscriptions.
Unsubscribing also happens on command line::
$ php bin/unsubscribe.php http://example.org/feed.atom
__ https://www.w3.org/TR/websub/
============
About phinde
============
Source code
===========
phinde's source code is available from http://git.cweiske.de/phinde.git
or the `mirror on github`__.
__ https://github.com/cweiske/phinde
License
=======
phinde is licensed under the `AGPL v3 or later`__.
__ http://www.gnu.org/licenses/agpl.html
Author
======
phinde was written by `Christian Weiske`__.
__ http://cweiske.de/
|