CERN Accelerating science

Search Engine API

===================
 Search Engine API
===================

There are three kind of API interfaces you can use: XML API, JSON API,
and Python API.

1. XML API
==========

About:

   Invenio has been having stable search API since its inception.  You
   can use regular search interface to refine your query until you
   find what you are looking for, and then amend a few URL parameters
   to turn the query into an XML API one.

Syntax:

   GET /search?p=...&of=...&ot=...&jrec=...&rg=...

Parameters:

   p = pattern (i.e. your query)
   of = output format (e.g. `xm` for MARCXML)
   ot = output tags (e.g. `` to get all fields, `100` to get titles only)
   jrec = jump to record ID (e.g. 1 for first hit)
   rg = records-in-groups-of (e.g. 10 hits per page)

   You can use other parameters as well; the list above mentions the
   most useful one.  For full documentation on these and the other
   `/search` URL parameters, please see Python API section 3.1 below.

Pros:

   Eesy web search -> API search context switch.  Uses the same
   parameters as in visible UI.

Cons:

   The XML API output covers only MARC metadata.

Notes:

   The master format of Invenio records is usually MARC.  Hence
   chances are you would like to use `of=xm` output format parameter
   in your XML API queries in order to get the richest data.

   Set `jrec` and `rg` appropriately to paginate.  For example:

        /search?p=ellis&of=xm&jrec=1&rg=10
        /search?p=ellis&of=xm&jrec=11rg=10
        /search?p=ellis&of=xm&jrec=12rg=10
        [...]

   Do not set `rg` too high; there is a server-wide safety limit on
   it.  (CFG_WEBSEARCH_MAX_RECORDS_IN_GROUPS)

Example: (returning full XML output)

   GET /search?p=ellis&of=xm

   <!-- Search-Engine-Total-Number-Of-Results: 12 -->
   <collection>
     <record>
       <controlfield tag="001">47</controlfield>
       <controlfield tag="005">20140908173007.0</controlfield>
       <datafield tag="037" ind1=" " ind2=" ">
         <subfield code="a">hep-ph/0204132</subfield>
       </datafield>
       <datafield tag="041" ind1=" " ind2=" ">
         <subfield code="a">eng</subfield>
       </datafield>
    ...

Example: (returning XML output, first author (100) and title (245) fields only)

   GET /search?p=ellis&of=xm&ot=100,245

   <!-- Search-Engine-Total-Number-Of-Results: 12 -->
   <collection>
     <record>
       <controlfield tag="001">47</controlfield>
       <controlfield tag="005">20140908173007.0</controlfield>
       <datafield tag="100" ind1=" " ind2=" ">
         <subfield code="a">Shovkovy, I A</subfield>
         <subfield code="u">Minnesota Univ.</subfield>
       </datafield>
       <datafield tag="245" ind1=" " ind2=" ">
         <subfield code="a">Thermal conductivity of dense quark matter and cooling of stars</subfield>
       </datafield>
     </record>
    ...

Example: returning 250th page of a query, with 50 records per page:

   GET /search?p=cern&of=xm&ot=100,245&jrec=12501&rg=50


2. JSON API
===========

About:

   Internally, Invenio records are represented in JSON.  You can ask
   for JSON output format (`of=recjson`) to obtain it.  Otherwise use
   the same parameters as in XML API, see section 1.

Pros:

   The JSON API cover field abstraction (support for virtual fields,
   e.g.  number of citations or book circulation counts) as well as
   master format abstraction (e.g. UNIMARC, EAD).

Cons:

   May be unusably slow if `recjson` is not cached on the server.
   (See `CFG_BIBUPLOAD_SERIALIZE_RECORD_STRUCTURE`.)

   Not yet REST-ified; just an evolution of HTTP XML API described
   above.

Example: (who cites me?)

   GET /search?p=refersto:author:maldacena&of=recjson&ot=recid,creation_date,authors[0],number_of_authors,system_control_number

   [{
       recid: 1290100,
       creation_date: "2014-04-14T04:44:13"
       authors: [{
         first_name: "A.",
         last_name: "Bernui",
         full_name: "Bernui, A."
       }],
       number_of_authors: 3,
       system_control_number: [
         {
           institute: "arXiv",
           value: "oai:arXiv.org:1404.2936"
         }
       ],
     },
     ...]


3. Python API
=============

Invenio Search Engine can be called from within your Python programs
via both a high-level and low-level API interface.

3.1 High-level API
------------------

   Description:

      The high-level access to the search engine is provided by
      exactly the same function as called from the web interface when
      users submit their queries.  This should guarantee exactly the
      same behaviour, and means that you can pass to the high-level
      API all the arguments as you see them in the URL.

      There are two things to note: (i) the function does not check
      for eventual restricted status of the collection, so the
      restricted collections will be searched without asking for a
      password; (ii) the output format argument (``of'') should be set
      to ``id'' (which is the default value) meaning to return list of
      recIDs.  The function returns the list of recIDs in this case.

   Signature:

      def perform_request_search(req=None, cc=CFG_SITE_NAME, c=None, p="", f="", rg=None, sf="", so="a", sp="", rm="", of="id", ot="", aas=0,
                              p1="", f1="", m1="", op1="", p2="", f2="", m2="", op2="", p3="", f3="", m3="", sc=0, jrec=0,
                              recid=-1, recidb=-1, sysno="", id=-1, idb=-1, sysnb="", action="", d1="",
                              d1y=0, d1m=0, d1d=0, d2="", d2y=0, d2m=0, d2d=0, dt="", verbose=0, ap=0, ln=CFG_SITE_LANG, ec=None, tab="",
                              wl=0, em=""):
          """Perform search or browse request, without checking for
             authentication.  Return list of recIDs found, if of=id.
             Otherwise create web page.

             The arguments are as follows:

               req - mod_python Request class instance.

                cc - current collection (e.g. "ATLAS").  The collection the
                     user started to search/browse from.

                 c - collection list (e.g. ["Theses", "Books"]).  The
                     collections user may have selected/deselected when
                     starting to search from 'cc'.

                 p - pattern to search for (e.g. "ellis and muon or kaon").

                 f - field to search within (e.g. "author").

                rg - records in groups of (e.g. "10").  Defines how many hits
                     per collection in the search results page are
                     displayed.  (Note that `rg' is ignored in case of `of=id'.)

                sf - sort field (e.g. "title").

                so - sort order ("a"=ascending, "d"=descending).

                sp - sort pattern (e.g. "CERN-") -- in case there are more
                     values in a sort field, this argument tells which one
                     to prefer

                rm - ranking method (e.g. "jif").  Defines whether results
                     should be ranked by some known ranking method.

                of - output format (e.g. "hb").  Usually starting "h" means
                     HTML output (and "hb" for HTML brief, "hd" for HTML
                     detailed), "x" means XML output, "t" means plain text
                     output, "id" means no output at all but to return list
                     of recIDs found, "intbitset" means to return an intbitset
                     representation of the recIDs found (no sorting or ranking
                     will be performed).  (Suitable for high-level API.)

                ot - output only these MARC tags (e.g. "100,700,909C0b").
                     Useful if only some fields are to be shown in the
                     output, e.g. for library to control some fields.

                em - output only part of the page.

               aas - advanced search ("0" means no, "1" means yes).  Whether
                     search was called from within the advanced search
                     interface.

                p1 - first pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f1 - first field to search within in the advanced search
                     interface.  Much like 'f'.

                m1 - first matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

               op1 - first operator, to join the first and the second unit
                     in the advanced search interface.  ("a" add, "o" or,
                     "n" not).

                p2 - second pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f2 - second field to search within in the advanced search
                     interface.  Much like 'f'.

                m2 - second matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

               op2 - second operator, to join the second and the third unit
                     in the advanced search interface.  ("a" add, "o" or,
                     "n" not).

                p3 - third pattern to search for in the advanced search
                     interface.  Much like 'p'.

                f3 - third field to search within in the advanced search
                     interface.  Much like 'f'.

                m3 - third matching type in the advanced search interface.
                     ("a" all of the words, "o" any of the words, "e" exact
                     phrase, "p" partial phrase, "r" regular expression).

                sc - split by collection ("0" no, "1" yes).  Governs whether
                     we want to present the results in a single huge list,
                     or splitted by collection.

              jrec - jump to record (e.g. "234").  Used for navigation
                     inside the search results.  (Note that `jrec' is ignored
                     in case of `of=id'.)

             recid - display record ID (e.g. "20000").  Do not
                     search/browse but go straight away to the Detailed
                     record page for the given recID.

            recidb - display record ID bis (e.g. "20010").  If greater than
                     'recid', then display records from recid to recidb.
                     Useful for example for dumping records from the
                     database for reformatting.

             sysno - display old system SYS number (e.g. "").  If you
                     migrate to Invenio from another system, and store your
                     old SYS call numbers, you can use them instead of recid
                     if you wish so.

                id - the same as recid, in case recid is not set.  For
                     backwards compatibility.

               idb - the same as recid, in case recidb is not set.  For
                     backwards compatibility.

             sysnb - the same as sysno, in case sysno is not set.  For
                     backwards compatibility.

            action - action to do.  "SEARCH" for searching, "Browse" for
                     browsing.  Default is to search.

                d1 - first datetime in full YYYY-mm-dd HH:MM:DD format
                     (e.g. "1998-08-23 12:34:56"). Useful for search limits
                     on creation/modification date (see 'dt' argument
                     below).  Note that 'd1' takes precedence over d1y, d1m,
                     d1d if these are defined.

               d1y - first date's year (e.g. "1998").  Useful for search
                     limits on creation/modification date.

               d1m - first date's month (e.g. "08").  Useful for search
                     limits on creation/modification date.

               d1d - first date's day (e.g. "23").  Useful for search
                     limits on creation/modification date.

                d2 - second datetime in full YYYY-mm-dd HH:MM:DD format
                     (e.g. "1998-09-02 12:34:56"). Useful for search limits
                     on creation/modification date (see 'dt' argument
                     below).  Note that 'd2' takes precedence over d2y, d2m,
                     d2d if these are defined.

               d2y - second date's year (e.g. "1998").  Useful for search
                     limits on creation/modification date.

               d2m - second date's month (e.g. "09").  Useful for search
                     limits on creation/modification date.

               d2d - second date's day (e.g. "02").  Useful for search
                     limits on creation/modification date.

                dt - first and second date's type (e.g. "c").  Specifies
                     whether to search in creation dates ("c") or in
                     modification dates ("m").  When dt is not set and d1*
                     and d2* are set, the default is "c".

           verbose - verbose level (0=min, 9=max).  Useful to print some
                     internal information on the searching process in case
                     something goes wrong.

                ap - alternative patterns (0=no, 1=yes).  In case no exact
                     match is found, the search engine can try alternative
                     patterns e.g. to replace non-alphanumeric characters by
                     a boolean query.  ap defines if this is wanted.

                ln - language of the search interface (e.g. "en").  Useful
                     for internationalization.

                ec - list of external search engines to search as well
                     (e.g. "SPIRES HEP").

                wl - wildcard limit (ex: 100) the wildcard queries will be
                     limited at 100 results
          """

   Examples: (retrieving record IDs)

      >>> # import the function:
      >>> from invenio.search_engine import perform_request_search
      >>> # get all hits in a collection:
      >>> perform_request_search(cc="ATLAS Communications")
      >>> # search for the word `of' in Theses and Books:
      >>> perform_request_search(p="of", c=["Theses","Books"])
      >>> # search for `muon or kaon' within title:
      >>> perform_request_search(p="muon or kaon", f="title")
      >>> # phrase search (not the quotes):
      >>> perform_request_search(p='"Ellis, J"', f="author")
      >>> # regexp search for a system number
      >>> perform_request_search(p1="^CERN.*2003-001$", f1="reportnumber", m1="r")
      >>> # moi inside Standards gives no hits...
      >>> perform_request_search(p="moi", cc="Standards")
      >>> # but it does if we use alternative patterns:
      >>> perform_request_search(p="moi", cc="Standards", ap=1)

   Example: (retrieving MARCXML)

      >>> import cStringIO
      >>> tmp = cStringIO.StringIO()
      >>> perform_request_search(req=tmp, p='ellis', of='xm')
      >>> out = tmp.getvalue()
      >>> tmp.close()
      >>> # `out' now contains MARCXML of 12 records found

   Example: (retrieving Text MARC, certain tags only)

      >>> import cStringIO
      >>> tmp = cStringIO.StringIO()
      >>> perform_request_search(req=tmp, p='higgs', of='tm', ot=['100', '700'])
      >>> out = tmp.getvalue()
      >>> tmp.close()
      >>> print out
      000000085 100__ $$aGirardello, L$$uINFN$$uUniversita di Milano-Bicocca
      000000085 700__ $$aPorrati, Massimo
      000000085 700__ $$aZaffaroni, A
      000000001 100__ $$aPhotolab

3.2. Mid-level API
------------------

   Description:

      The mid-level API is provided by a search_pattern() function
      that only searches for the given pattern in the given field
      according to the given matching pattern.  This function does not
      know anything about collection.  The function does not wash its
      arguments, it expects them to be `clean' already.  The pattern
      is split into `basic search units' for which a boolean query is
      launched.  The function returns an instance of the intbitset class.
      Note that if you want to obtain the list of recIDs (as with the
      high-level API), you can invoke the ``tolist()'' method on a
      hitset.

   Signature:

      def search_pattern(req=None, p=None, f=None, m=None, ap=0, of="id", verbose=0, ln=CFG_SITE_LANG, display_nearest_terms_box=True, wl=0):
          """Search for complex pattern 'p' within field 'f' according to
             matching type 'm'.  Return hitset of recIDs.

             The function uses multi-stage searching algorithm in case of no
             exact match found.  See the Search Internals document for
             detailed description.

             The 'ap' argument governs whether an alternative patterns are to
             be used in case there is no direct hit for (p,f,m).  For
             example, whether to replace non-alphanumeric characters by
             spaces if it would give some hits.  See the Search Internals
             document for detailed description.  (ap=0 forbits the
             alternative pattern usage, ap=1 permits it.)
             'ap' is also internally used for allowing hidden tag search
             (for requests coming from webcoll, for example). In this
             case ap=-9

             The 'of' argument governs whether to print or not some
             information to the user in case of no match found.  (Usually it
             prints the information in case of HTML formats, otherwise it's
             silent).

             The 'verbose' argument controls the level of debugging information
             to be printed (0=least, 9=most).

             All the parameters are assumed to have been previously washed.

             This function is suitable as a mid-level API.
          """

   Examples:

      >>> # import the function:
      >>> from invenio.search_engine import search_pattern
      >>> # search for muon or kaon in any field:
      >>> search_pattern(p="muon or kaon").tolist()
      >>> # the following finds nothing by default...
      >>> search_pattern(p="cern-moi").tolist()
      >>> # ...but it does find something if we allow alternative patterns:
      >>> search_pattern(p="cern-moi", ap=1).tolist()
      >>> # wildcard search for a report number:
      >>> search_pattern(p="CERN-LHC-PROJECT-REPORT-40*", f="reportnumber").tolist()
      >>> # regexp search for a report number with possible trailing subjects:
      >>> search_pattern(p="^CERN-LHC-PROJECT-REPORT-40(-|$)", f="reportnumber", m="r").tolist()

3.3. Low-level API
------------------

   Description:

      The low-level API is provided by search_unit() function that
      assumes its arguments to be already the basic search units.
      Therefore it does not know anything about boolean queries, etc.
      The function returns an instance of the intbitset class.  Note that
      if you want to obtain the list of recIDs (as with the high-level
      API), you can invoke the ``tolist()'' method on a hitset.

   Signature:

      def search_unit(p, f=None, m=None, wl=0, ignore_synonyms=None):
          """Search for basic search unit defined by pattern 'p' and field
             'f' and matching type 'm'.  Return hitset of recIDs.

             All the parameters are assumed to have been previously washed.
             'p' is assumed to be already a ``basic search unit'' so that it
             is searched as such and is not broken up in any way.  Only
             wildcard and span queries are being detected inside 'p'.

             If CFG_WEBSEARCH_SYNONYM_KBRS is set and we are searching in
             one of the indexes that has defined runtime synonym knowledge
             base, then look up there and automatically enrich search
             results with results for synonyms.

             In case the wildcard limit (wl) is greater than 0 and this limit
             is reached an InvenioWebSearchWildcardLimitError will be raised.
             In case you want to call this function with no limit for the
             wildcard queries, wl should be 0.

             Parameter 'ignore_synonyms' is a list of terms for which we
             should not try to further find a synonym.

             This function is suitable as a low-level API.
          """

   Examples:

      >>> # import the function:
      >>> from invenio.search_engine import search_unit
      >>> # search moi in any field:
      >>> search_unit(p="moi").tolist()
      >>> # this one will not match:
      >>> search_unit(p="muon or kaon").tolist()
      >>> # regexp search for a report number with possible trailing subjects:
      >>> search_unit(p="^CERN-PS-99-037(-|$)", f="reportnumber", m="r").tolist()