CERN Accelerating science

The HEP taxonomy: rationale and extensions

The DESY Library has been responsible for maintaining a thesaurus of high energy physics (HEP) terms for a long time. The thesaurus is currently used as a subject controlled vocabulary by HEP institutes worldwide. The need to convert the HEP text thesaurus to a more complex taxonomy (in /opt/invenio/etc/bibclassify/HEP.rdf), with richer structure and semantics, was mainly driven by the needs of the BibClassify system. The current taxonomy of high energy physics is expressed in the SKOS syntax. SKOS is a dialect of RDF and it is especially intended for the representation of knowledge organization systems, such as thesauri, taxonomies and basic ontologies.

 NB. The reasons behind the adoption of SKOS, instead of other similar knowledge organization formats - notably OWL - are to do with the simplicity, yet completeness, of SKOS. The SKOS language contains all the basic properties that were needed to express the taxonomy and it allows straightforward conversion from text to RDF. If you are interested in a more detailed discussion on this matter, please check the CERN-DESY email correspondence.

In order to satisfy the needs of typical HEP classification schemes and practices, the SKOS language had to be extended to include additional properties. HEP keywords are often expressed as a combination - a pair - of keywords. An example of this is:

Born-Infeld model: monopole
Here both Born-Infeld model and monopole are standard HEP keywords (single keywords). However, they also combine together to express a collective concept. We call it a composite keyword and have extended the SKOS language in order to include such new paradigm. The two property extensions created for this purpose are:

These two extensions are probably best described by an example. The single keyword Born-Infeld model is expressed in the HEP taxonomy as:

<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel">
  <prefLabel xml:lang="en">Born-Infeld model</prefLabel>
  <hiddenLabel xml:lang="en">Born-Infeld</hiddenLabel>
  <altLabel xml:lang="en">DBI</altLabel>
  <broader rdf:resource="http://cern.ch/thesauri/HEP.rdf#fieldtheoreticalmodel"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelrelativistic"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonlinear"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelnonabelian"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole"/>
  <composite rdf:resource="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelchiral"/>
</Concept>
The concept contains all the usual SKOS tags to express the relations and denominations of a concept (prefLabel, broader, etc.). In addition, it contains five composite tags: these link to five different combinations of this keyword with fellow single keywords. For example one of these points to the composite keyword Born-Infeld model: monopole, whose entry is:
<Concept rdf:about="http://cern.ch/thesauri/HEP.rdf#Composite.Born-Infeldmodelmonopole">
<prefLabel xml:lang="en">Born-Infeld model: monopole</prefLabel>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#Born-Infeldmodel"/>
  <compositeOf rdf:resource="http://cern.ch/thesauri/HEP.rdf#monopole"/>
</Concept>

The structure of single and composite keywords, as well as their associations expressed by properties composite and compositeOf are self-evident. By using such a model, we are able to efficiently extract keyword pairs from fulltext, as explained in the BibClassify extraction guide.

Finally, it is worth pointing out a couple of other syntax practices that might be specific only to the HEP taxonomy: