CERN Accelerating science

BibRank Record Sorter API

Invenio Bibrank Record Sorter can be called from within your Python
programs via a high-level API and a mid-mid level API.

1. High-level API

   Description:

      The high-level access to the BibRank Record Sorter is provided
      by exactly the same function as called from the web interface
      when users submit their queries, if a rank method has been
      selected. This should guarantee exactly the same behaviour if
      the same parameters are given.

      There are three thing to note: (i) When a search is done, the
      search engine is sending a intbitset containing all the records that
      matches the query. Since only records in the intbitset are ranked, a
      intbitset must be created containing wished records to rank and be sent
      as a parameter to the function. (ii) Some rank methods may choose to
      ignore the intbitset, like the "Similar Records" function. (iii) In case
      of an error ranking the records, the returned data is different.

   Signature:

      def rank_records(rank_method_code, rank_limit_relevance,
                       hitset_global, pattern, verbose=0):
       """
       rank_method_code - 'jif', 'wrd' or other methods
       rank_limit_relevance - a number defining the threshold of which
       to rank records, may be ignored by rank method.
       hitset_global, search engine hits, if all records should be
       ranked, fill the intbitset with ones.
       pattern, search engine query or record ID, must be a
       list. ['CERN', 'fermilab'] or ['recid:12345']
       verbose, verbose level - 0-9 defines how much debug information
       should be shown

       output if successfull:
       list of records - [123, 321, 12451, 123, 12, 4]
       list of rank values - ascending, same length as the list of
       records [0, 10, 20, 30, 40, 100]
       prefix - text to show before the rank value, '<--' hides rank
       value, defined in config file.
       postfix - text to show after the rank value, '-!>' hides rank
       value, defined in config file.
       verbose_output - the debug text depending on the verbose level.

       output if error:
       list of records - is None
       list of rank values - is None
       prefix - Contains headline of error
       postfix - Error message or error box if exception.
       verbose_output - the debug text depending on the verbose level.
       """


   Examples:

      >>> # import the function:
      >>> from invenio.bibrank_record_sorter import rank_records
      >>> # rank all records with the words 'higgs boson' according to the method "wrd"
      >>> rank_records('wrd', 0, a_hitset, ['higgs', 'boson'], 0)
      >>> # find similar records to the record 12345, hitset is here ignored because of 'similar records'
      >>> rank_records('wrd', 0, a_hitset, ['recid:12345'], 0)
      >>> # rank all records based on jif value
      >>> rank_records('jif', 0, a_hitset, [], 0)

2. Mid-level API

   Description:

      Using the mid-level API, you can call directly the various methods
      for ranking. The functions will not return data in a way the search
      engine understands. They will neither find out if it is the correct
      function that is called, but return an error if wrong code/function
      is used.

   Signatures:
      def combine_method(rank_method_code, pattern, hitset, rank_limit_relevance,verbose):
      -This method calls each method mentioned in the config file and add the results together

      def find_similar(rank_method_code, recID, hitset, rank_limit_relevance,
      -This method finds similar records based on the one given in the recID field. The recID field
       must be a integer value. hitset is ignored. rank_method_code has to be 'wrd'.

      def word_frequency(rank_method_code, lwords, hitset, rank_limit_relevance,verbose):
      -This method ranks records based on the list of words in lwords field. rank_method_code has to be
       'wrd'. Only records in hitset is ranked.

      def rank_by_method(rank_method_code, lwords, hitset, rank_limit_relevance,verbose):
      -All other rank methods uses this function together with data from the rnkMETHODDATA table
       (a dictionary of {recid: (text, value)} to rank the data. Only records in the hitset is ranked.

      These mid-level API functions demands that the function create_rnkmethod_cache() has been called,
      since it loads the config options needed.
      The rank methods returns all the same data:
      ([[recid, value],[recid, value]], prefix, postfix, verbose_output)

   Examples:

      >>> # import the function:
      >>> from invenio.bibrank_record_sorter import find_similar
      >>> # find records similar to 12345, hitset must be full
      >>> find_similar('wrd', 12345, hitset, 0, 0)
      >>> # rank records according to a method called jif, using the single_tag...based method.
      >>> # the list of words is here ignored, only the records in the hitset are used.
      >>> rank_by_method('jif',['higgs'], hitset, 0, 0)
      >>> # rank records containing ['higgs', 'boson'] using word similarity ('wrd')
      >>> word_similarity('wrd',['higgs', 'boson'], hitset, 0, 0)
      >>> # rank records using various methods, which methods to use is read from the config file.
      >>> combine_method('cmb', ['higgs','boson'], hitset, 0, 0)