CERN Accelerating science

Search Processing Stages

What the search engine does after it receives a user query?  Let us
explain the steps it takes for a fictive complex query example in
which a user typed

   author:ellis reportnumber:TH-2003-114 -muon +energ* title:"kaon decay"

and selected "Theses" and "Books" collections and the "arrival date"
between the 10th and the 20th of March 2003.

The search engine proceeds as follows.

1. Firstly we classify and break apart our search arguments into a)
   basic search units and b+c) additional search options:

    a) (p_1, f_1), (p_2, f_2), ..., (p_m, f_m) --- the list of basic
          (pattern, field) searching units.  The user-typed boolean
          query is split into the (p_i, f_i) units according to the
          non-significant whitespace.  (A whitespace is considered to
          be significant when it occurs within quoted expressions for
          phrase searching, see Search Tips and user-level
          documentation.)  In our example, the list of basic search
          units is: (author, ellis), (reportnumber, TH-2003-114),
          (anyfield, muon), (anyfield, energ*), (title, "kaon decay").

    b) c_1, c_2, ..., c_n --- the list of collections the user wanted
          to search in.  In our example: Theses, Books.

    c) l_1, l_2, ..., l_o --- the list of additional limiting search
          criteria, such as limit to certain arrival date, certain
          language, or certain subject category.  The user usually
          selects such limits from within Advanced Search interface
          and its selection boxes.  In our example, the limit is on
          the arrival date: (arrivaldate, 20030310->20030320).

   The basic search units and additional search arguments are then
   dealt with subsequently (from a to c with decreasing priority) in
   the following searching stages.

2. For each (p_i, f_i), we verify that at least some hits can be found
   regardless of c_j and l_k.  In other words, we make sure that p_i
   is a known indexed term in f_i.  Note that p_i may contain
   asterisks and may start and end by a single/double quotes, with the
   following special meaning:

      foo*bar -- asterisk is a wildcard character, meaning to match
                 any sequence of characters

      "foo and bar" -- double quotes to denote exact phrase matching

      'foo and bar' -- single quotes to denote partial phrase matching

   The p_i (word or phrase) is then tested for existence in the f_i
   (word or phrase) index.

   2-1. If p_i was found in the f_i index, we retain the (p_i, f_i)
        search unit.  We also retain it in the case when p_i wasn't
        found in the f_i index but when (p_i, f_i) is joined to the
        previous unit or to the next unit by boolean operator OR.

   2-2. If p_i wasn't found inside f_i, we then look whether p_i
        contains some non-alphanumeric characters, such as a dash or a
        slash.  If this is so, we try to replace them with a boolean
        AND query.  In our example, the search unit (reportnumber,
        TH-2003-114) would be replaced by a new boolean query for
        (reportnumber,TH) and (reportnumber,2003) and
        (reportnumber,114).  If this new query succeeds, we retain
        this new boolean query in place of the old search unit.

   2-3. If the preceding step failed, we propose to the end user a
        list of nearest indexed terms (words, phrases) around p_i
        within f_i index and let the user choose one known indexed
        term out of this list.  In our example, the phrase "kaon
        decay" cannot be found in the title index, so we'll propose
        closest titles around "kaon decay ...".  Let's suppose that
        the user will choose "Kaon decays and the flavour problem" out
        of this list.

   After all the basic search units (p_i, f_i) have been treated we
   may proceed to the following stage 3.

3. At this stage, all search units (p_i, f_i) are known to yield at
   least some results.  We now continue by trying boolean query as
   specified by the user.  The execution priority goes from left to
   right, the known boolean operators are:

      +  for set intersection
      -  for set difference
      |  for set union

   In our example, we proceed by doing the set intersections of
   (author, ellis), (reportnumber,TH), (reportnumber,2003),
   (reportnumber,114), followed by set differentiation with (anyfield,
   muon), followed by set intersection with (anyfield, energ*) and
   (title, "Kaon decays and the flavour problem").

   3-1. If this gives some hits, we proceed to stage 4.

   3-2. If this does not give any hit, we display the number of hits
        found for each search unit, advise the user to combine his
        search terms differently, and we stop.

4. At this stage, the boolean query (p_1, f_1), (p_2, f_2), ... ,
   (p_n, f_n) is known to yield some results.  We now continue by
   checking whether these results fall into collections c_j that the
   user has chosen.  This is done by performing a set intersection of
   the results obtained so far with the collection universe for c_j.

   4-1. If this gives some hits, we proceed to stage 5.

   4-2. If this does not give any hit, we first try to look for the
        query in any public collection.

        4-2-1. If this gives some hits, we warn the user that no match
               could have been found in his c_j choice but that there
               are hits in other public collections.  We propose a
               link to get them and we stop.

        4-2-2. If this does not give any hit, then there must have
               been some hits in some of the restricted collections.
               We display a warning that the restricted collections
               must be explicitly selected before searching and we
               stop.

5. At this stage, the boolean query (p_i, f_i) within c_j is known to
   yield some results.  We now proceed by checking additional search
   limits l_k imposed by the user.  This is done by subsequent set
   intersections of the results obtained so far with the universe of
   records matching limiting criteria (l_1, l_2, ..., l_o).

   5-1. If this gives some hits, we proceed to stage 6.

   5-2. If this does not give any hit for a certain l_k, we warn the
        user that no match could have been found for his l_o choices
        and proceed to stage 6 with the results obtained so far.

6. We are done and may display the results.