CERN Accelerating science

BibSort Admin Guide

Contents

1. Overview
2. Configuring BibSort
3. Running BibSort
       3.1. Rebalancing
       3.2. Inserting/Updating/Deleting records in BibSort
4. Impact on the sorted search results

1. Overview

BibSort main goal is to make the sorting of search results faster. It does this by creating several sorting buckets (that hold recids) that are then loaded by the search_engine and cached.

BibSort module is active if the search_engine is using the sorting buckets to fast sort the search results. BibSort module can be deactivated by setting the CFG_BIBSORT_BUCKETS=0 in the invenio.conf file. Also, if bsrMETHOD table does not contain any data, it also means that the BibSort module is not active. The search engine will look into the BibSort data structures to see if the method that was requested to sort the search results exists or not. If it does not exist, then the old style sorting function (using bibxxx tables) will be used.

2. Configuring BibSort

Currently there is no web interface for configuring this module. All the configuration is done via a configuration file. The location of this file is:

CFG_ETCDIR/bibsort/bibsort.cfg

Each sorting method has a section in this config file, that looks like this:

[sort_field_1]
name = title
washer = sort_alphanumerically_remove_leading_articles
definition = FIELD: title
Each section of the file corresponds to a method.

For adding a new sorting method, one needs to add a new section to the bibsort.cfg file. Once this is done, the config file needs to be loaded into the database:

$ ./bibsort --load-config
Similar, for deleting a method, one needs to remove the corresponding section from the bibsort.cfg file, and load the config into the database.
To dump the configuration from the database into a file:
$ ./bibsort --dump-config

3. Running BibSort

There are several command line instructions that can be used in order to update the BibSort data. For each instruction, one can define the methods and the records that the command should run on, like this:

$ ./bibsort --methods=method1,method2 --recids=4,7-17,23,1
If these options will be let empty it will mean that the bibsort operations will run on all the defined methods, and either on all the records existing in the database, or on the all updated records (depending on the operation, see 3.1 and 3.2).

3.1. Rebalancing

Rebalancing is the operation that will redo from scratch the sorting and recreate the sorting buckets. This should be performed once at the beginning and then maybe once per day, to be sure that the database is in complete sync with the BibSort data structures, and also, to be sure that the buckets are balanced (Imagine a big upload of new records, that will have the same publication year. All these records will be added to the same bucket for the 'publication date' method, making it much bigger then the others, and slower to perform any data calculations on it, including intersecting with the search engine output). If you have a clear idea of how the data is changing during one day, you can set up the rebalancing only for several methods, that contain data that is frequently updated.

$ ./bibsort -R [--methods=method1,method2]

3.2. Inserting/Updating/Deleting records in BibSort

Inserting/Updating/Deleting records in BibSort is done via the update-sorting operation. Theoretically, this operation should run at short intervals, and for the benefit of the user it would be good to run after BibIndex, so that the updates can be viewed as soon as possible. If no methods are defined it will run for all the methods defined in bibsort.cfg. But, if you have a good overview of the nature of the changes in the data during a period of time, the update-sorting can run more frequently for some methods (like sort by year or sort by title) or less frequently (like sort by most cited, since the citation dictionaries are not updated so frequently). Defining the recids, will result in the update-sorting to run only on those records. If no records are defined bibsort will grab all the modified records since its last run. Since for ranking methods it will anyway grab all the data, update-sorting for a ranking method is basically a rebalancing.

$ ./bibsort -S [--methods=method1,method2] [--recids=4,7-17,23,1]

4. Impact on the sorted search results

Using the BibSort functionality will have the following impact on the 'Sort by' functionality of Invenio: