CERN Accelerating science

HOWTO Run Your Invenio Installation

Overview

This HOWTO guide intends to give you ideas on how to run your CDS Invenio installation and how to take care of its normal day-to-day operation.

Setting up periodical daemon tasks

Many tasks that manipulate the bibliographic record database can be set to run in a periodical mode. For example, we want to have the indexing engine to scan periodically for newly arrived documents to index them as soon as they enter into the system. It is the role of the BibSched system to take care of the task scheduling and the task execution.

Periodical tasks (such as regular metadata indexing) as well as one-time tasks (such as a batch upload of newly acquired metadata file) are not executed straight away but are stored in the BibSched task queue. BibSched daemon looks periodically in the queue and launches the tasks according to their order or the date of programmed runtime. You can consider BibSched to be a kind of cron daemon for bibliographic tasks.

This means that after having installed Invenio you will want to have the BibSched daemon running permanently. To launch BibSched daemon, do:

 $ bibsched start

To setup indexing, ranking, sorting, formatting, and collection cache updating daemons to run periodically with a sleeping period of, say, 5 minutes:

$ bibindex -f50000 -s5m
$ bibreformat -oHB -s5m
$ webcoll -v0 -s5m
$ bibrank -f50000 -s5m
$ bibsort -s5m

Note that if you are using virtual index facility, such as for the global index, then you should schedule them apart:

$ bibindex -w global -f50000 -s5m

It is imperative to have the above tasks run permanently in your BibSched queue so that newly submitted documents will be processed automatically.

You may also want to set up some periodical housekeeping tasks:

$ bibrank -f50000 -R -wwrd -s14d -LSunday
$ bibsort -R -s7d -L 'Sunday 01:00-05:00'
$ inveniogc -a -s7d -L 'Sunday 01:00-05:00'
$ dbdump -s20h -L '22:00-06:00' -o/opt2/mysql-backups -n10

Please consult the sections below for more details about these housekeeping tasks.

There is also the possibility to setup the batch uploader daemon to run periodically, looking for new documents or metadata files to upload:

$ batchuploader --documents -s20m
$ batchuploader --metadata -s20m

Additionally you might want to automatically generate sitemap.xml files for your installation. For this just schedule:

$ bibexport -w sitemap -s1d
You will then need to add a line such as:
Sitemap: https://cds.cern.ch/sitemap-index.xml.gz
to your robots.txt file.

If you are using the WebLinkback module, you may want to run some of the following tasklets:

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbackupdaterdeleteurlsonblacklist \
     -T bst_weblinkback_updater \
     -a "mode=1" \
     -u admin -s1d -L '22:00-05:00'

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbackupdaternewpages \
     -T bst_weblinkback_updater \
     -a "mode=2" \
     -u admin -s1d -L '22:00-05:00'

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbackupdateroldpages \
     -T bst_weblinkback_updater \
     -a "mode=3" \
     -u admin -s7d -L '22:00-05:00'

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbackupdatermanuallysettitles \
     -T bst_weblinkback_updater \
     -a "mode=4" \
     -u admin -s7d -L '22:00-05:00'

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbackupdaterdetectbrokenlinkbacks \
     -T bst_weblinkback_updater \
     -a "mode=5" \
     -u admin -s7d -L 'Sunday 01:00-05:00'

sudo -u www-data /opt/invenio/bin/bibtasklet \
     -N weblinkbacknotifications \
     -T bst_weblinkback_updater \
     -a "mode=6" \
     -u admin -s1d

Monitoring periodical daemon tasks

Note that the BibSched automated daemon stops as soon as some of its tasks end with an error. You will be informed by email about this incident. Nevertheless, it is a good idea to inspect the BibSched queue from time to time anyway, say several times per day, to see what is going on. This can be done by running the BibSched command-line monitor:

$ bibsched

The monitor will permit you to stop/start the automated mode, to delete the tasks that were wrongly submitted, to run some of the tasks manually, etc. Note also that the BibSched daemon writes its log and error files about its own operation, as well as on the operation of its tasks, to the /opt/invenio/var/log directory.

Running alert engine

Invenio users may set up an automatic notification email alerts so that they are automatically alerted about documents of their interest by email, either daily, weekly, or monthly. It is the job of the alert engine to do this. The alert engine has to be run every day:

$ alertengine

You may want to set up an external cron job to call alertengine each day, for example:

$ crontab -l
# invenio: periodically restart Apache:
59 23 * * * /usr/sbin/apachectl restart
# invenio: run alert engine:
30 14 * * * /usr/bin/sudo -u apache /opt/invenio/bin/alertengine

Housekeeping task details

Housekeeping: recalculating ranking weights

When you are adding new records to the system, the word frequency ranking weights for old records aren't recalculated by default in order to speed up the insertion of new records. This may influence a bit the precision of word similarity searches. It is therefore advised to expressely run bibrank in the recalculating mode once in a while during a relatively quiet site operation day, by doing:

$ bibrank -R -w wrd -s 14d -L Sunday
You may want to do this either (i) periodically, say once per month (see the previous section), or (ii) depending on the frequency of new additions to the record database, say when the size grows by 2-3 percent.

Housekeeping: recalculating sorting weights

It is advised to run from time to time the rebalancing of the sorting buckets. In order to speed up the process of insertion of new records, the sorting buckets are not being recalculated, but new records are being added at the end of the corresponding bucket. This might create differences in the size of each bucket which might have a small impact on the speed of sorting.

$ bibsort -R -s 7d -L 'Sunday 01:00-05:00'
The rebalancing might be run weekly or even daily.

Housekeeping: cleaning up the garbage

The tool inveniogc provides a garbage collector for the database, temporary files, and the like.

If you choose to differentiate between guest users (see CFG_WEBSESSION_DIFFERENTIATE_BETWEEN_GUESTS in invenio.conf), then guest users can create a lot of entries in Invenio tables that are related to their web sessions, their search history, personal baskets, etc. This data has to be garbage-collected periodically. You can run this, say every Sunday between 01:00 and 05:00, via:

$ inveniogc -s 7d -L 'Sunday 01:00-05:00'

Different temporary log and err files are created in /opt/invenio/var/log and /opt/invenio/var/tmp directory that is good to clean up from time to time. The previous command could be used to clean those files, too, via:

$ inveniogc -s 7d -d -L 'Sunday 01:00-05:00'

The inveniogc tool can run other cleaning actions; please refer to its help (inveniogc --help) for more details.

Note that in the above section "Setting up periodical daemon tasks", we have set up inveniogc with argument -a in that example, meaning that it would run all possible cleaning actions. Please modify this if it is not what you want.

Housekeeping: backing up the database

You can launch a bibsched task called dbdump in order to take regular snapshot of your database content into SQL dump files. For example, to back up the database content into /opt2/mysql-backups directory every night, keeping at most 10 latest copies of the backup file, you would launch:

$ dbdump -s 20h -L '22:00-06:00' -o /opt2/mysql-backups -n 10

This will create files named like invenio-dbdump-2009-03-10_22:10:28.sql in this folder.

Note that you may use Invenio-independent MySQL backuping tools like mysqldump, but these might generally lock all tables during backup for consistency, hence it could happen that your site might not be accessible during backuping time due to the user session table being locked as well. The dbdump tool does not lock all tables, therefore the site remains accessible to users while the dump files are being created. Note that the dump files are kept consistent with respect to the data, since dbdump runs via bibsched, hence not allowing any other important bibliographic task to run during the backup.

To load a dump file produced by dbdump into a running Invenio instance later, you can use:

$ bibsched stop
$ cat /opt2/mysql-backups/invenio-dbdump-2009-03-10_22\:10\:28.sql | dbexec
$ bibsched start