CERN Accelerating science

WebSearch Admin Guide

WARNING: THIS ADMIN GUIDE IS NOT FULLY COMPLETED
This Admin Guide is not yet completed. Moreover, some admin-level functionality for this module exists only in the form of manual recipes. We are in the process of developing both the guide as well as the web admin interface. If you are interested in seeing some specific things implemented with high priority, please contact us at cds.support@cern.ch. Thanks for your interest!

Contents

1. Overview
2. Edit Collection Tree
       2.1 Add new collection
       2.2 Add collection to tree
       2.3 Modify existing tree
3. Edit Collection Parameters
       3.1. Modify collection query
       3.2. Modify access restrictions
       3.3. Modify translations
       3.4. Delete collection
       3.5. Modify portalboxes
       3.6. Modify search fields
       3.7. Modify search options
       3.8. Modify sort options
       3.9. Modify rank options
       3.10. Modify output formats
       3.11. Configuration of related external collections
       3.12. Detailed record page options
4. External and Hosted Collections
5. Webcoll Status
6. Collections Status
7. Check External Collections
8. Edit Search Engine Parameters
9. Search Engine Cache
10. Search Services
11. Additional Information

1. Overview

WebSearch Admin interface will help you to configure the search collections that the end-users see. The WebSearch Admin functionality can be basically separated into several parts: (i) how to organize collections into collection tree; (ii) how to define and edit collection parameters; (iii) how to update collection cache via the webcoll daemon; and (iv) how to influence the search engine behaviour and set various search engine parameters. These issues will be subsequently described in the rest of this guide.

2. Edit Collection Tree

Metadata corpus in Invenio is organized into collections. The collections are organized in a tree. The collection tree is what the end-users see when they start to navigate at CERN Document Server. The collection tree is similar to what other sites call Web Directories that organize Web into topical categories, such as Google Directory.

Note that Invenio permits every collection in the tree to have either "regular" or "virtual" sons. In other words, every node in the collection tree may see either regular or virtual branches growing out of it. This permits to create a tree with very complex, multi-level, nested structures of regular and virtual branches, if needed, with the aim to ease navigation to end-users from one branch to another. The difference between a regular and a virtual branch will be explained in detail further below in the section 2.2.

2.1 Add new collection

To add a new collection, enter its default name in the default language of the installation and click on the ADD button to add it. There are two important actions that you have to perform after adding a collection:

After you edit these two things, the collection is fully usable for the search interface. It will appear in the search interface after the next run of the WebColl Daemon.

However, you will probably want to customize further things, like define collection name translation in various languages, define collection web page portalboxes, define search options, etc, as explained in this guide under the section Edit Collection Parameters.

2.2 Add collection to tree

To attach a collection to the tree, choose first which collection do you want to attach, then choose the father collection to attach to, and then choose the fathership relation type between them (regular, virtual).

The difference between the regular and the virtual relationship goes as follows:

The example presented above would then give us the following picture:
        M u l t i m e d i a

        Narrow by Collection:        Focus on:
        --------------------         ---------
        [ ] Photos                   University Multimedia Service
        [ ] Videos                   BBC Pictures and Videos

It is important to note that if a collection A is composed of B and C as its regular sons, and offers X and Y as its virtual sons, then every document belonging to A must also belong to either B or C. This requirement does not apply for X and Y, because X and Y offer only a "focus-on" orthogonal view on a (possibly small) part of the document corpus of A. If end-users search the collection A, then they are actually searching inside B and C, not X and Y. If they want to search inside X or Y, they have to click upon X or Y first. One can consider virtual branches as a sort of non-essential searching aid to the end-user that is activated only when users are interested in a particular "focus-on" relationship, provided that this "virtual" point of view on A interests her.

2.3 Modify existing tree

To modify existing tree by WebSearch Admin Interface, click on icons displayed next to collections. The meaning of icons is as follows:

Remove chosen collection with its subcollections from the collection tree, but do not delete the collection itself. (For full deletion of a collection, see section 3.4.)
  Move chosen collection up or down among its brothers and sisters, i.e. change the order of collections inside the same level of the tree.
Move chosen collection among branches of the tree. Press the first icon () to choose a collection to move, and the second icon () to select a new father collection that the chosen collection should be attached to.

3. Edit Collection Parameters

To finalize setting up of a collection, you could and should edit many parameters, such as define list of records belonging to a collection, define search fields, define search interface page portalboxes, etc. In this section we will subsequently describe all the various possibilities as they are presented in the Edit Collection pages of the WebSearch Admin Interface.

3.1 Modify collection query

The collection query defines which documents belong to the given collection. It is equal to the search term that retrieves all documents belonging to the given collection, exactly as you would have typed it into the search interface. For example, to define a collection of all papers written by Ellis, you could set up your collection query to be author:Ellis.

Usually, the collection query is chosen on the basis of the collection identifier that we store in MARC tag 980. This tag is indexed in a logical field called collection so that a collection of Theses could be defined via collection:THESIS, supposing that every thesis metadata record has got the text THESIS in MARC tag 980. (Nitpick: we use the term `collection' in two contexts here: once as a collection of metadata documents, but also and as a logical field name. We should have probably called the latter collectionidentifier or somesuch instead, but we hope the difference is clear from... the context.)

If a collection does not have any collection query defined, then its content is defined by means of the content of its descendants (subcollections). This is the case for composed collections. For example, the composed collection Articles & Preprints (no query defined) will be defined as a father of Articles (query: collection:ARTICLE) and Preprints (query: collection:PREPRINT). In this case the collection query for Articles & Preprints can stay empty.

Note that you should avoid defining non-empty collection query in cases the collection has descendants, since it will prevail and the descendants may not be taken into account. In the same way, if a collection doesn't have any query nor any descendants defined, then its contents will be empty.

To define an external hosted collection set up the query to begin with hostedcollection: (for more detailed information see section 4)

To remove the collection query, set the parameter empty.

3.2 Modify access restrictions

Until Invenio-0.92.1 there was the possibility to directly restrict a collection by specifying an Apache group. Users who had an Apache user and password belonging to the given group would have been able to access the restricted collection.

Collection restriction managament is now integrated with the wider Role Based Access Control facility of Invenio.

In order to restrict access to a collection you just have to create at least an authorization for the action viewrestrcoll specifying the name of the collection as the parameter

If you have just upgraded your installation from CDS Invenio-0.92.1 you probably have run collection_restrictions_migration_kit.py tool in order to migrate to the new framework. For every Apache Group with access to a restricted collection a role will be created, with proper authorization to access the restricted collections. Each role will have a FireRole definition that specifies to allow for the given Apache group. Trough the WebAccess admin interface you will then be able to change these definition in order to softly migrate your restriction to whatever is your need.

3.3 Modify translations

You may define translations of collection names into the languages of your Invenio installation. Moreover, a collection name may be different in different contexts (e.g. long name, short name, etc), so that prior to modifying translations you will be asked to select which name type you want to change.

The interface also lets you customize the labelling (and translations) of the default collection boxes: "Focus on:", "Narrow by:" and "Latest addtions:".

The translations aren't mandatory to define. If a translation does not exist in a language chosen by the end user, the end user will be shown the collection name in the default language of this installation.

Note also that the list of available languages depends on the compile-time configuration (see the general invenio.conf file).

3.4 Delete collection

The collection to be deleted must be first removed from the collection tree. Any metametadata associated with the collection (such as association to portalboxes, association to records belonging to this collection, etc) will be lost, but the metadata itself will be preserved (such as portalboxes themselves, records themselves, etc). In total, association to records, output formats, translations, search options, sort options, search fields, ranking method, and access restriction will be lost. Use with care!

It may be a good idea only to remove the collection from the end users interface, but to keep it "hidden" in a corner they don't see and that they can't search when they search from Home. To achieve this, do not delete the collection but simply remove it from the collection tree so that it won't be attached to any father collection. In this case the search interface page for this collection will stay updated, but won't be neither shown in the tree nor searchable from Home page. It will only be accessible via bookmarked URL, for example.

3.5 Modify portalboxes

The search interface HTML page for a given collection may be customized by what we call portalboxes. Portalboxes are used to show various kinds of information to the end user, such as a text box with some inline help information about the given collection, an illustrative picture, etc.

To create a new portalbox, a title and a body must be given, where the body can contain HTML if necessary.

To add a portalbox to the collection, you must choose an existing portalbox, the language for which the portalbox should be shown, the position of the portalbox on the screen, and the ordering score of portalboxes.

3.6 Modify search fields

The search field is a logical field (such as author, title, etc) that will be proposed to the end users in Simple and Advanced Search interface pages. If you do not set any search fields for a collection, then a default list (author, title, year, etc) will be shown.

Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.

3.7 Modify search options

The search option is like search field in a way that it permits the end user to narrow down his search to some logical field such as "subject", but unlike with the search field the user is not required to type his query in a free text form; rather, the search interface proposes to the end user several interesting predefined values prepared by the administrators that the end user may choose from. For example, an "author search" concept is a good example of search field usage, since there is plenty of author names to be matched, so that the end users would usually type the name they wish to find in free text form; while a "subject search" concept is a good example for search option usage, since usually there is a limited number of subjects in the system given by local subject classification scheme, that the end users do not necessarily know about and that they are free to choose from a list. As a rule of thumb, the search field concept denotes the case of unlimited number possibilites of distinct values to be matched in a given field (e.g. author, title, keyword); while the search option concept denotes the case of only a handful or so distinct values to be matched in a given field (e.g. subject, division, year).

Search options are shown in the "Advanced Search" interfaces only, while search fields are shown both in "Simple Search" and "Advanced Search" interface. (Although if you want to add a search option to the "Simple Search" interface, you can achieve it by creating appropriate HTML code in a portalbox.) The search options order, as well as the order of search option values, may be defined by means of 'move' arrows in the WebSearch Admin interface.

To add a new search option, a field name must first be chosen (for example "subject") and then a list of possible field values must be entered (for example "Mathematics", "Physics", "Chemistry", "Biology", etc). Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.

3.8 Modify sort options

You may define a list of logical fields that the end users will be able to choose for the sorting purposes. For example, "first author" or "year". If you don't select anything, a default list (author, title, year, etc) will be shown.

Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.

3.9 Modify rank options

To enable a certain rank method for a collection, select the method from the "enable rank method" box and add it. The documents in this collection will then be included in the ranking sets the next time the BibRank daemon will run. To disable a method the process is the same, but select the method from the 'disable rank method' box.

Note that if you want to add new ranking method or modify existing ranking method, you have to use the BibRank Admin interface.

3.10 Modify output formats

Each collection may have several output formats defined. The end users will be able to choose a format they want to see their search results list in. Most formats like HTML brief or XML Dublin Core are interesting for each collection, but some formats like HTML portfolio are only interesting for Photographs collection, not for Articles collection. The interface will permit you to choose the formats appropriate for a given collection. The order of formats can be changed using the 'move' arrows.

Note that if you want to add new output format ('behaviour') or modify existing output format, you have to use the BibFormat Admin interface.

3.11 Configuration of related external collections

You can customize each collection to provide your users an additional source of information external to your repository: in a book collection you might want for example to provide a link to Amazon items corresponding to the user's query. Futhermore, for some external services only, you can set the collection to display the results directly in Invenio search results page.

The following settings are available:

Disabled
The external collection is not shown to the user.
See also
A link to the external collection listing the items corresponding to user's query is displayed (only once a query has been performed).
External search
User can ask to perform a search in parallel on your repository and on the external collection. Results are shown in the Invenio search results page. Not available for all external collections.
External search checked
Same as above, but the external collection is searched by default. Not available for all external collections.

You can also apply the settings to sub-collections, by checking the "Apply also to daughter collections" checkboxes when you apply your modifications.

Note that in case you have defined an external hosted collection and you are in fact configuring its related external collections there is no restriction on setting even itself as "See also", "External search" or "External search checked"; directly or recursively via the "Apply also to daughter collections" option. It is up entirely to the admin to keep a clean and consistent installation (for more detailed information see section 4).

3.12 Detailed record page options

These settings let you define how the detailed view (such as https://cds.cern.ch/record/1) of records in this collection will look like.
More details are available in the WebStyle admin guide.

Please note that since a record might belong to several collections, conflicts between collection settings might occur. This is especially true in the case of virtual collections. It is therefore the settings of the primary collection of the record which are applied.

4. External and Hosted Collections

External and hosted collections are a way to provide your users with additional sources of information. The simplest option is the "See also" one: it provides a link to the external collection listing the items corresponding to the user's query. Another option is to set up the external collection an "External search [checked]". This option implies a parser implemented for that external collection and allows the user to perform a parallel search on your repository and on that of the external collection. Read more on how to set up the above options in section section 3.11. Also please note that some external resourses might be under copyright restrictions.

Another, more advanced option, are the external hosted collections. The purpose of these collections is to behave just as if they were local ones. That means the admin should set them up as local collections and attach them to the tree. These collections however are not meant to store their records locally but rather to produce them on the fly when asked to. Once attached to the tree an external hosted collection appears in the search home page along with its number of records and a small graphic (arrow in this case) to indicate their being external.

The admin should define a new external collection (any of the above options) starting with the websearch_external_collections_config.py file, which consists basically of a python dictionary. Let us go through the process of defining a new external collection, starting from the dictionary:

  • add a new key:value pair to the dictionary. The key is the name of the external collection (eg. Amazon Books). The value is another python dictionary with the parameters of the external collection. Let's go through these parameters in key:value pairs:

    • 'engine':the_name_of_engine
      The name of the search engine (no spaces or special characters allowed and its implemented python class (eg. for the 'AmazonBooks' engine the corresponding class should be named AmazonBooksSearchEngine). If not defined the default ExternalSearchEngine class will be used.
    • 'base_url':the_base_url_of_the_external_collection
      The base url of the external collection, used to create actual hyper references to the external collection (eg. 'http://books.amazon.com/' , 'http://www.amazon.com/books/').
    • 'search_url':the_search_url_of_the_external_collection
      The search url of the external collection, to which the search terms will be later appended and therefore looked up (eg. 'http://books.amazon.com/search.php?title=' , 'http://www.amazon.com/books/lookup.asp?book=').
    • 'parser_params':dictionary_of_the_parameters_of_the_parser
      The parameters to be passed to the parser. This way a parser can be dynamically reused for different external collections upon defining different settings. Let's go through the various parameters:

      • 'host':the_host_of_the_external_collection
        The host of the external collection is used to correct the urls when printing out its results (eg. 'books.amazon.com', 'www.amazon.com').
      • 'path':the_path_on_the_host_of_the_external_collection
        The path, along with the host of the external collection, is used to correct the urls when printing out its results (eg. '', 'books/').
      • 'parser':the_actual_parser_class
        The actual parser class to be used by the external collection engine. It should be imported at the beggining of this configuration file (eg. AmazonBooksExternalCollectionResultsParser, AmazonExternalCollectionResultsParser).
      • 'fetch_format':the_format_to_be_used_to_fetch_data
        Usually an abbreviated string that defines the format in which the data should be fetched. The parser must be able to parse this format (eg. 'hb', 'xm').
      • 'num_results_regex_str':the_regular_expression_for_the_number_of_results
        The regular expression used to calculate the returned number of results when the external collection is queried (eg. r'([0-9,]+?) records found'). Should preferably be a python raw string.
      • 'num_results_regex_str':the_regular_expression_for_the_total_number_of_records
        The regular expression used to calculate the total number of records of an external collection (eg. r'Searching ([0-9,]+?) records in total'). This is to be used by external hosted collections that present their total number of records in the search home page. Should preferably be a python raw string.
      • 'nbrecs_url':the_url_that_provides_the_total_number_of_records
        The url that provides information on the total number of records of an external collection (eg. 'http://books.amazon.com/search.php?show_all=yes'). The regular expression defined above will be used on the contents of this url. Again, this is to be used by external hosted collections that present their total number of records in the search home page.

Once the dictionary key:value pair has been added for the new external collection the admin should implement (or simply use if already implemented) the search engine python class defined for this external collection. For the "See also" option the above steps are sufficient. If the admin wants to enable the "External search [checked]" option as well a parser must be (or have been) implemented. Finally to set up an external hosted collection the admin also has to create a new local collection named exactly as the key of the external hosted collection's key:value pair in the python dictionary. The new local collection's query has to begin with hostedcollection: (under the current configuration it is sufficient for the query of any external hosted collection to just be defined as hostedcollection:) and the collection itself has to be attached to the tree to be visible in the search home page. Note that due to the nature of external hosted collections their corresponding local collections cannot have any other collections as sons; in other words they shouldn't have any other branches growing from them.

5. Webcoll Status

WebColl is the daemon that normally periodically runs via BibSched and that updates the collection cache with the collection parameters configured in the previous section. Alternatively to running webcoll via BibSched, you can also run it any time you want from the command line, either for all collections or for selected collection only. See the --help option.

The WebSearch Admin interface has got a WebColl Status menu that shows when the collection cache was last updated and when the next update is scheduled. It warns in case something suspicious was discovered.

6. Collections Status

The Collection Status menu of the WebSearch Admin interface shows the list of all collections and checks if there is anything wrong regarding configuration of collections, together with the languages the collection name has been translated into, etc. Here is the detailed explanation of the functionality:

ID
ID of the collection.
Name
Name of the collection.
Query
The collection definition query. Note that it should be empty if a collection got subcollections. If not, then a query is needed.
Subcollections
The subcollections that the collection is composed of. Note that a collection which got defined by a query should not have any subcollections.
Restricted
A restricted collection can only be accessed by users belonging to the Apache groups mentioned in this column.
Hosted
A hosted collection is practicly an external one behaving just as if it were local.
I18N
Show which languages the collection name has been translated into.
Status
If no errors was found, OK is displayed for each collection. If an error was found, then an error number and short message are shown. The meaning of the error messages is the following: 1:Conflict means that the collection was defined via a query but also via subcollections too; 2:Empty means that the collection wasn't defined neither via query nor via subcollections.

7. Check External Collections

The Check External Collections menu of the WebSearch Admin interface is a simple tool to check and control the consistency of the external collections the user has defined. External collections exist both in their own database table as well in a user defined configuration file. This tool will check the consistency between the two and report back to the user giving them the option to fix any potential inconsistencies.

8. Edit Search Engine Parameters

9. Search Engine Cache

10. Search Services

Search services are meant to display information contextual to a search query in very specialized way, in the sense that they can search/retrieve/display data beyond the traditional concept of records. Typical search services could for example include:

  • Spell-check user queries by calling an external spellchecking library, and offering "Did you mean ...?" options.
  • Parse user input and display an author profile when searching for a well-defined author.
  • Search for submission names matching the user input.
  • Retrieve phone number from the institutional LDAP.
  • Etc.

Search services are displayed (in addition) just before the results returned by the standard Invenio search engine. They can be enabled by dropping a plug-in file at /opt/invenio/lib/python/invenio/search_services/. A few sample search services are included by default in Invenio. You can see which are installed in the WebSearch Admin interface 8. Search services page. Additionnal plug-ins (not installed by default) might be available in the Invenio package at /modules/websearch/lib/search_services/.

More information about search services might be found in each plug-in file. An advanced Search Services hacking guide is also available.

11. Additional Information

WebSearch Internals