Home > Admin Area > BibIndex Admin Guide |
WARNING: BIBINDEX ADMIN GUIDE IS UNDER DEVELOPMENT |
---|
BibIndex Admin Guide is not yet completed. Most of admin-level functionality for BibIndex exists only in commandline mode. We are in the process of developing both the guide as well as the web admin interface. If you are interested in seeing some specific things implemented with high priority, please contact us at cds.support@cern.ch. Thanks for your interest! |
To define a new index you must first give the index a internal name. An empty index is then created by preparing the database tables.
Before the index can be used for searching, the fields that should be included in the index must be selected.
When desired to fill the index based on the fields selected, you can schedule the update by running bibindex -w indexname together with other desired parameters.
Can be configured by changing CFG_BIBINDEX_CHARS_ALPHANUMERIC_SEPARATORS and CFG_BIBINDEX_CHARS_PUNCTUATION in the general config file.
How the words are broken up defines what is added to the index. Should only "director-general" be added, or should "director", "general" and "director-general" be added? The index can vary between 300 000 and 3 000 000 terms based the policy for breaking words.
BibIndex supports stopword removal by not adding words which exists in a given stopword list to the index. Stopword removal makes the index smaller by removing much used words.
Which stopword list that should be used can be configured in the general config file file by changing the value of the variable CFG_BIBINDEX_PATH_TO_STOPWORDS_FILE. If no stopword list should be used, the value should be 0.
The BibIndex indexer supports stemming, removing the ending of words thus creating a smaller indexer. For example, using English, the word "information" will be stemmed to "inform"; or "looking", "looks", and "looked" will be all stemmed to "look", thus giving more hits to each word.
Currently you can configure the stemming language on a per-index basis. All searches referring a stemmed index will also be stemmed based on the same language.
By setting the value of CFG_BIBINDEX_MIN_WORD_LENGTH in the general config file higher than 0, only words with the number of characters higher than this will be added to the index.
If you set the Remove HTML Markup parameter in the admin interface to 'Yes' the indexer will try to remove all HTML code from documents before indexing, and index only the text left. (HTML code is defined as everything between '<' and '>' in a text.)
If you set the Remove LATEX Markup parameter in the admin interface to 'Yes', the indexer will try to remove all LaTeX code from documents before indexing, and index only the text left. (LaTeX code is defined as everything between '\command{' and '}' in a text, or '{\command ' and '}').
The metadata tags are usually indexed by its content. There are
special cases however, such as the fulltext indexing. In this case
the tag contains an URL to the fulltext material and we would like to
fetch this material and index words found in this material rather than
in the metadata itself. This is possible via special tag assignement
via tagToWordsFunctions
variable.
The default setup is configured in the way that if the indexer sees
that it has to index tag 8564_u
, it switches into the
fulltext indexing mode described above. It can index locally stored
files or even fetch them from external URLs, depending on the value of
the CFG_BIBINDEX_FULLTEXT_INDEX_LOCAL_FILES_ONLY configuration
variable. When fetching files from remote URLs, when it ends on a
splash page (an intermediate page before getting to fulltext file
itself), it can find and follow any further links to fulltext files.
The default setup also differentiate between metadata and fulltext
indexing, so that any field
index does process only
metadata, not fulltext. If you want to have the fulltext indexed
together with the metadata, so that both are searched by default, you
can go to BibIndex Admin interface and in the Manage Logical Fields
explicitly add the tag 8564_u
under any
field
field.
Solr can be used to index fulltext and to serve fulltext queries. To use it, the following steps are necessary:
First, Solr is installed:
$ cd <invenio source tree> $ sudo make install-solrutils
Second, invenio-local.conf
is amended:
CFG_SOLR_URL = http://localhost:8983/solr
Third, Solr is set to index fulltext:
UPDATE idxINDEX SET indexer='SOLR' WHERE name='fulltext'
Fourth, Solr is started:
<invenio installation>/lib/apache-solr-3.1.0/example$ sudo -u www-data java -jar start.jar
An index can be virtual in which case it is composed of other normal indexes that were described in the preceding section. For example, the global index is by default virtual and it is composed of several normal indexes such as title, author, abstract, etc.
When indexing a virtual index, the indexer does not tokenises terms anew as is the case for normal index. Rather, the indexer simply collects terms that were generated previously when dependent normal indexes were running. When a normal index run, say title, it created list of terms from title fields and it updated not only its own title index, but it also submitted its terms to the overall global index's processing queue. When the global index process runs, it simply processes its incoming queue and updates its term list, without any record lookups.
In this manner, a site can run several parallel indexing processes for independent normal indexes, and several parallel indexing processes for various virtual indexes, taking full advantage of multi-core multi-node architecture.
To index your newly created or modified documents, bibindex must be run periodically via bibsched. This is achieved by the sleep option (-s) to bibindex. For more information please see HOWTO Run admin guide.
Upon each indexing run, bibindex checks and reports any inconsistencies in the indexes. You can also manually check for the index corruption yourself by using the check (-k) option to bibindex.
If a problem is found during the check, bibindex hints you to run repairing (-r). If you run it, then during repair bibindex tries to correct problems automatically by its own means. Usually it succeeds.
When the automatic repairing does not succeed though, then manual intervention is required. The easiest thing to get the indexes back to shape are commands like: (assuming the problem is with the index ID 1):
to leave only the 'CURRENT' reverse index. After that you can rerun the index checking procedure (-k) and, if successful, continue with the normal web site operation. However, a full reindexing should be scheduled for the forthcoming night or weekend.$ echo "DELETE FROM idxWORD01R WHERE type='TEMPORARY' or type='FUTURE';" | \ /opt/invenio/bin/dbexec
The procedure of reindexing is taking place into the real indexes that are also used for searching. Therefore the end users will feel immediately any change in the indexes. If you need to reindex your records from scratch, then the best procedure is the following: reindex the collection index only (fast operation), recreate collection cache, and only after that reindex all the other indexes (slow operation). This will ensure that the records in your system will be at least browsable while the indexes are being rebuilt. The steps to perform are:
First we reindex the collection index:
$ bibindex --reindex -f50000 -wcollection # reindex the collection index (fast) $ echo "UPDATE collection SET reclist=NULL;" | \ /opt/invenio/bin/dbexec # clean collection cache $ webcoll -f # recreate the collection cache $ bibsched # run the two above-submitted tasks $ sudo apachectl restart
Then we launch (slower) reindexing of the remaining indexes:
$ bibindex --reindex -f50000 # reindex other indexes (slow) $ webcoll -f $ bibsched # run the two above-submitted tasks, and put the queue back in auto mode $ sudo apachectl restart
You may optionally want to reindex the word ranking tables:
$ bibsched # wait for all active tasks to finish, and put the queue into manual mode $ cd invenio-0.92.1 # source dir $ grep rnkWORD ./modules/miscutil/sql/tabbibclean.sql | \ /opt/invenio/bin/dbexec # truncate rank indexes $ echo "UPDATE rnkMETHOD SET last_updated='0000-00-00 00:00:00';" | \ /opt/invenio/bin/dbexec # rewind the last ranking time
Secondly, if you have been using custom ranking methods using new rnkWORD* tables (most probably you have not), you would have to truncate them too:
# find out which custom ranking indexes were added: $ echo "SELECT id FROM rnkMETHOD" | /opt/invenio/bin/dbexec id 66 67 [...] # for every ranking index id, truncate corresponding ranking tables: $ echo "TRUNCATE rnkWORD66F" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD66R" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD67F" | /opt/invenio/bin/dbexec $ echo "TRUNCATE rnkWORD67R" | /opt/invenio/bin/dbexec
At last, we launch reindexing of the ranking indexes:
and we are done.$ bibrank -f50000 $ bibsched # run the three above-submitted tasks, and put the queue back in auto mode $ sudo apachectl restart
In the future Invenio should ideally run indexing into invisible tables that would be switched against the production ones once the indexing process is successfully over. For the time being, if reindexing takes several hours in your installation (e.g. if you have 1,000,000 records), you may want to mysqlhotcopy your tables and run reindexing on those copies yourself.
If Solr is used for both fulltext and ranking, only the BibRank
daemon shall run. Since Solr documents can only be overriden and not updated, the
BibRank
daemon also indexes fulltext.