CERN Accelerating science

The code behind BibClassify: the extraction algorithm

Contents

1. Overview
2. Taxonomy handling
3. Fulltext management
4. Single Keyword processing
5. Composite keyword processing
6. Post-processing
7. Author keywords

1. Overview

This section provides a detailed account of the phrase matching techniques used by BibClassify to automatically extract significant terms from fulltext documents. While reading this guide, you are advised to refer to the original BibClassify code.

BibClassify extracts 2 types of keywords from a fulltext documents: single/main keywords and composite keywords. Single keywords are keywords composed of one or more words ("scalar" or "field theory"). Composite keywords are composed of several single keywords ("field theory: scalar") and are considered as such if the single keywords are found combined in the fulltext. All keywords are stored in a RDF/SKOS taxonomy or in a simple keyword file. When using the keyword file, it is only possible to extract single keywords.

The bulk of the extraction mechanism takes place inside the functions get_single_keywords and get_composite_keywords in bibclassify_keyword_analyzer.py.

2. Taxonomy handling

This paragraph explains the code of bibclassify_ontology_handler.py.

BibClassify handles the taxonomy differently whether it is running in standalone file mode (from the sources) or as an Invenio module. In both cases, the taxonomy is specified through the -k, --taxonomy option. In standalone file mode, the argument has to be a path when in normal mode. In module mode, the argument refers to the ontology short name found in the clsMETHOD table (e.g. "HEP" for the taxonomy "HEP.rdf"). However the ontology long name ("HEP.rdf") or even its reference URL do also work. The reference URL is stored in the table clsMETHOD in the column "location".

In standalone file mode, we just compare the date of modification of the taxonomy file and the date of creation of the cache file. If the cache is older than the ontology, we regenerate it.

In module mode, we first check the modification date of the reference ontology by performing a HTTP HEAD request. We compare this date with the date of the locally stored ontology. If needed we download the newer ontology. This ensures that BibClassify always uses the latest ontology available. The cache management is similar to the standalone mode.

In order to generate the cache file, the taxonomy is stored and parsed into memory using RDFLib.

The cache consists of dictionaries of SingleKeyword and CompositeKeyword objects. These objects contain a meaningful description of the keywords and regular expressions in a compiled form that allow to find the keywords in the fulltext. These regular expressions are described in paragraph 4.

3. Fulltext management

This paragraph discusses the way BibClassify manages the fulltext of records. Source code discussed can be found in bibclassify_text_extractor and bibclassify_text_normalizer.

The code of bibclassify_text_extractor.py will soon be updated and therefore the documentation for this module is pending.

The extraction of PDF documents in the field of HEP can lead to some inconsistencies in the document because of mathematical formulas and Greek letters. bibclassify_text_normalizer.py takes care of these problems by running a set of correcting regular expressions on the extracted fulltext. These regular expressions can be found in the configuration file of BibClassify.

4. Single Keyword processing

For each single and composite keyword, the taxonomy contains different labels:

For each of these labels, we compile and cache regular expressions. The way the regular expressions are built is described in the configuration file of BibClassify.

When searching for single keywords in a fulltext, we run the corresponding set of regular expressions on the text and store the number of matches and the position of the keywords in the text.

5. Composite keyword processing

For each composite keyword, we first run the regular expressions corresponding to alternative and hidden labels. This is similar to the search for single keywords.

Then, for each composite keyword, we check if all of its components were found in the fulltext. If this is the case, then we check the positions of the single keywords in the text. If the single keywords are placed nearby, then we found a composite keyword. If not, then we check if the words placed between the single keywords are valid separators (configured in the configuration file of BibClassify).

The result of this operation is a list of composite keywords with the total number of occurrences. Occurrences for all concerned single keywords are also attached to this list.

6. Post-processing

Before presenting the results to the user, some extra filtering occurs, primarily to refine the output keywords. The main post-processing actions performed on the results are:

The final results that are produced to the user consist of the 20 first (configurable) best single keywords and best composite keywords. The results may be presented in different formats (text output or MARCXML). Sample text output can be found in the BibClassify Admin Guide.

7. Author keywords

BibClassify extracts also author keywords when the option '--detect-author-keywords' is set. BibClassify searches for the string of keywords in the fulltext. Then it separates them and outputs them.