X-Git-Url: https://git.cweiske.de/phinde.git/blobdiff_plain/59f931647a2b4a13be20ba8f2baa4ec93e334ee5..d35cf6a284f57392ef33703ded46174cc48b6bf5:/README.rst diff --git a/README.rst b/README.rst index e34cd15..e52581d 100644 --- a/README.rst +++ b/README.rst @@ -30,18 +30,141 @@ Features - or use the ``site`` GET parameter: ``/?q=foo&site=example.org/dir`` - OpenSearch support with HTML and Atom result lists +- Instant indexing with WebSub (formerly PubSubHubbub) ============ Dependencies ============ - PHP 5.5+ -- elasticsearch 2.0 -- gearman +- Elasticsearch 2.0 +- MySQL or MariaDB for WebSub subscriptions +- Gearman (Debian 9: ``gearman-job-server``, not ``gearman-server``) +- PHP Gearman extension - Console_CommandLine - Net_URL2 +- Twig 1.x +===== +Setup +===== +#. Install and run Elasticsearch and Gearman +#. Install ``php-gearman`` +#. Get a local copy of the code:: + + $ git clone https://git.cweiske.de/phinde.git phinde + +#. Install dependencies via composer:: + + $ composer install + +#. Point your webserver's document root to phinde's ``www`` directory +#. Copy ``data/config.php.dist`` to ``data/config.php`` and adjust it. + Make sure your add your domain to the crawl whitelist. +#. Create a MySQL database and import the schema from ``data/schema.sql`` +#. Run ``bin/setup.php`` which sets up the Elasticsearch schema +#. Put your homepage into the queue:: + + $ ./bin/process.php http://example.org/ + +#. Start at least one worker to process the crawl+index queue:: + + $ ./bin/phinde-worker.php + +#. Check phinde's status page in your browser. + The number of open tasks should be > 0, the number of workers also. + + +Re-index when your site changes +=============================== +When your site changed, the search engine needs to re-crawl and re-index +the pages. + +Simply tell phinde that something changed by running:: + + $ ./bin/process.php http://example.org/foo.htm + +phinde supports HTML pages and Atom feeds, so if your blog has a feed +it's enough to let phinde reindex that one. +It will find all linked pages automatically. + + +Website integration +=================== +Adding a simple search form to your website is easy. +It needs two things: + +- ``
`` tag with an action that points to the phinde instance +- Search text field with name of ``q``. + +Example:: + + + + +
+ + +System service +============== +When using systemd, you can let it run multiple worker instances when +the system boots up: + +#. Copy files ``data/systemd/phinde*.service`` into ``/etc/systemd/system/`` +#. Adjust user and group names, and the work directories +#. Enable three worker processes:: + + $ systemctl daemon-reload + $ systemctl enable phinde@1 + $ systemctl enable phinde@2 + $ systemctl enable phinde@3 + $ systemctl enable phinde + $ systemctl start phinde +#. Now three workers are running. Restarting the ``phinde`` service also + restarts the workers. + + + +Cron job +======== +Run ``bin/renew-subscriptions.php`` once a day with cron. +It will renew the WebSub subscriptions. + + +===== +Howto +===== + +Delete index data from one domain:: + + $ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query + +That's delete-by-query 2.0, see +https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html + + +Subscribe to a website/feed +=========================== +Phinde supports WebSub__ to get subscribe to changes of a website. +When phinde gets notified by the website's hub about changes, +it will immediately crawl and index the changed pages. + +Subscribe to a website's feed:: + + $ php bin/subscribe.php http://example.org/feed.atom + +Phinde will determine the website's hub and send a registration request to it. + +The status page will show the number of working, and the number of open +subscriptions. + +Unsubscribing also happens on command line:: + + $ php bin/unsubscribe.php http://example.org/feed.atom + +__ https://www.w3.org/TR/websub/ + ============ About phinde ============