Home > Admin Area > WebSearch Admin Guide |
WARNING: THIS ADMIN GUIDE IS NOT FULLY COMPLETED |
---|
This Admin Guide is not yet completed. Moreover, some admin-level functionality for this module exists only in the form of manual recipes. We are in the process of developing both the guide as well as the web admin interface. If you are interested in seeing some specific things implemented with high priority, please contact us at cds.support@cern.ch. Thanks for your interest! |
WebSearch Admin interface will help you to configure the search collections that the end-users see. The WebSearch Admin functionality can be basically separated into several parts: (i) how to organize collections into collection tree; (ii) how to define and edit collection parameters; (iii) how to update collection cache via the webcoll daemon; and (iv) how to influence the search engine behaviour and set various search engine parameters. These issues will be subsequently described in the rest of this guide.
Metadata corpus in Invenio is organized into collections. The collections are organized in a tree. The collection tree is what the end-users see when they start to navigate at CERN Document Server. The collection tree is similar to what other sites call Web Directories that organize Web into topical categories, such as Google Directory.
Note that Invenio permits every collection in the tree to have either "regular" or "virtual" sons. In other words, every node in the collection tree may see either regular or virtual branches growing out of it. This permits to create a tree with very complex, multi-level, nested structures of regular and virtual branches, if needed, with the aim to ease navigation to end-users from one branch to another. The difference between a regular and a virtual branch will be explained in detail further below in the section 2.2.
To add a new collection, enter its default name in the default language of the installation and click on the ADD button to add it. There are two important actions that you have to perform after adding a collection:
After you edit these two things, the collection is fully usable for the search interface. It will appear in the search interface after the next run of the WebColl Daemon.
However, you will probably want to customize further things, like define collection name translation in various languages, define collection web page portalboxes, define search options, etc, as explained in this guide under the section Edit Collection Parameters.
To attach a collection to the tree, choose first which collection do you want to attach, then choose the father collection to attach to, and then choose the fathership relation type between them (regular, virtual).
The difference between the regular and the virtual relationship goes as follows:
M u l t i m e d i a Narrow by Collection: Focus on: -------------------- --------- [ ] Photos University Multimedia Service [ ] Videos BBC Pictures and Videos
It is important to note that if a collection A is composed of B and C as its regular sons, and offers X and Y as its virtual sons, then every document belonging to A must also belong to either B or C. This requirement does not apply for X and Y, because X and Y offer only a "focus-on" orthogonal view on a (possibly small) part of the document corpus of A. If end-users search the collection A, then they are actually searching inside B and C, not X and Y. If they want to search inside X or Y, they have to click upon X or Y first. One can consider virtual branches as a sort of non-essential searching aid to the end-user that is activated only when users are interested in a particular "focus-on" relationship, provided that this "virtual" point of view on A interests her.
To modify existing tree by WebSearch Admin Interface, click on icons displayed next to collections. The meaning of icons is as follows:
Remove chosen collection with its subcollections from the collection tree, but do not delete the collection itself. (For full deletion of a collection, see section 3.4.) | |
Move chosen collection up or down among its brothers and sisters, i.e. change the order of collections inside the same level of the tree. | |
Move chosen collection among branches of the tree. Press the first icon () to choose a collection to move, and the second icon () to select a new father collection that the chosen collection should be attached to. |
To finalize setting up of a collection, you could and should edit many parameters, such as define list of records belonging to a collection, define search fields, define search interface page portalboxes, etc. In this section we will subsequently describe all the various possibilities as they are presented in the Edit Collection pages of the WebSearch Admin Interface.
The collection query defines which documents belong to the
given collection. It is equal to the search term that retrieves all
documents belonging to the given collection, exactly as you would have
typed it into the search interface. For example, to define a
collection of all papers written by Ellis, you could set up your
collection query to be author:Ellis
.
Usually, the collection query is chosen on the basis of the
collection identifier that we store in MARC tag 980. This tag is
indexed in a logical field called collection
so that a
collection of Theses could be defined via
collection:THESIS
, supposing that every thesis metadata
record has got the text THESIS
in MARC tag 980.
(Nitpick: we use the term `collection' in two contexts here: once as a
collection of metadata documents, but also and as a logical field
name. We should have probably called the latter
collectionidentifier
or somesuch instead, but we hope the
difference is clear from... the context.)
If a collection does not have any collection query defined, then
its content is defined by means of the content of its descendants
(subcollections). This is the case for composed collections. For
example, the composed collection Articles & Preprints (no
query defined) will be defined as a father of Articles
(query: collection:ARTICLE
) and Preprints
(query: collection:PREPRINT
). In this case the
collection query for Articles & Preprints can stay empty.
Note that you should avoid defining non-empty collection query in cases the collection has descendants, since it will prevail and the descendants may not be taken into account. In the same way, if a collection doesn't have any query nor any descendants defined, then its contents will be empty.
To define an external hosted collection set up the query to begin with
hostedcollection:
(for more detailed information see section 4)
To remove the collection query, set the parameter empty.
Until Invenio-0.92.1 there was the possibility to directly restrict a collection by specifying an Apache group. Users who had an Apache user and password belonging to the given group would have been able to access the restricted collection.
Collection restriction managament is now integrated with the wider Role Based Access Control facility of Invenio.
In order to restrict access to a collection you just have to create
at least an authorization for the action viewrestrcoll
specifying the name of the collection as the parameter
If you have just upgraded your installation from CDS
Invenio-0.92.1 you probably have run
collection_restrictions_migration_kit.py
tool in order
to migrate to the new framework. For every Apache Group with access to a
restricted collection a role will be created, with proper
authorization to access the restricted collections. Each role will have
a FireRole definition that specifies to allow for the given
Apache group. Trough the WebAccess admin interface you will then be able
to change these definition in order to softly migrate your restriction
to whatever is your need.
You may define translations of collection names into the languages of your Invenio installation. Moreover, a collection name may be different in different contexts (e.g. long name, short name, etc), so that prior to modifying translations you will be asked to select which name type you want to change.
The interface also lets you customize the labelling (and translations) of the default collection boxes: "Focus on:", "Narrow by:" and "Latest addtions:".
The translations aren't mandatory to define. If a translation does not exist in a language chosen by the end user, the end user will be shown the collection name in the default language of this installation.
Note also that the list of available languages depends on the
compile-time configuration (see the general invenio.conf
file).
The collection to be deleted must be first removed from the collection tree. Any metametadata associated with the collection (such as association to portalboxes, association to records belonging to this collection, etc) will be lost, but the metadata itself will be preserved (such as portalboxes themselves, records themselves, etc). In total, association to records, output formats, translations, search options, sort options, search fields, ranking method, and access restriction will be lost. Use with care!
It may be a good idea only to remove the collection from the end users interface, but to keep it "hidden" in a corner they don't see and that they can't search when they search from Home. To achieve this, do not delete the collection but simply remove it from the collection tree so that it won't be attached to any father collection. In this case the search interface page for this collection will stay updated, but won't be neither shown in the tree nor searchable from Home page. It will only be accessible via bookmarked URL, for example.
The search interface HTML page for a given collection may be customized by what we call portalboxes. Portalboxes are used to show various kinds of information to the end user, such as a text box with some inline help information about the given collection, an illustrative picture, etc.
To create a new portalbox, a title and a body must be given, where the body can contain HTML if necessary.
To add a portalbox to the collection, you must choose an existing portalbox, the language for which the portalbox should be shown, the position of the portalbox on the screen, and the ordering score of portalboxes.
The search field is a logical field (such as author, title, etc) that will be proposed to the end users in Simple and Advanced Search interface pages. If you do not set any search fields for a collection, then a default list (author, title, year, etc) will be shown.
Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.
The search option is like search field in a way that it permits the end user to narrow down his search to some logical field such as "subject", but unlike with the search field the user is not required to type his query in a free text form; rather, the search interface proposes to the end user several interesting predefined values prepared by the administrators that the end user may choose from. For example, an "author search" concept is a good example of search field usage, since there is plenty of author names to be matched, so that the end users would usually type the name they wish to find in free text form; while a "subject search" concept is a good example for search option usage, since usually there is a limited number of subjects in the system given by local subject classification scheme, that the end users do not necessarily know about and that they are free to choose from a list. As a rule of thumb, the search field concept denotes the case of unlimited number possibilites of distinct values to be matched in a given field (e.g. author, title, keyword); while the search option concept denotes the case of only a handful or so distinct values to be matched in a given field (e.g. subject, division, year).
Search options are shown in the "Advanced Search" interfaces only, while search fields are shown both in "Simple Search" and "Advanced Search" interface. (Although if you want to add a search option to the "Simple Search" interface, you can achieve it by creating appropriate HTML code in a portalbox.) The search options order, as well as the order of search option values, may be defined by means of 'move' arrows in the WebSearch Admin interface.
To add a new search option, a field name must first be chosen (for example "subject") and then a list of possible field values must be entered (for example "Mathematics", "Physics", "Chemistry", "Biology", etc). Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.
You may define a list of logical fields that the end users will be able to choose for the sorting purposes. For example, "first author" or "year". If you don't select anything, a default list (author, title, year, etc) will be shown.
Note that if you want to add a new logical field or modify existing physical MARC tags for a logical field, you have to use the BibIndex Admin interface.
To enable a certain rank method for a collection, select the method from the "enable rank method" box and add it. The documents in this collection will then be included in the ranking sets the next time the BibRank daemon will run. To disable a method the process is the same, but select the method from the 'disable rank method' box.
Note that if you want to add new ranking method or modify existing ranking method, you have to use the BibRank Admin interface.
Each collection may have several output formats defined. The end users will be able to choose a format they want to see their search results list in. Most formats like HTML brief or XML Dublin Core are interesting for each collection, but some formats like HTML portfolio are only interesting for Photographs collection, not for Articles collection. The interface will permit you to choose the formats appropriate for a given collection. The order of formats can be changed using the 'move' arrows.
Note that if you want to add new output format ('behaviour') or modify existing output format, you have to use the BibFormat Admin interface.
You can customize each collection to provide your users an additional source of information external to your repository: in a book collection you might want for example to provide a link to Amazon items corresponding to the user's query. Futhermore, for some external services only, you can set the collection to display the results directly in Invenio search results page.
The following settings are available:
You can also apply the settings to sub-collections, by checking the "Apply also to daughter collections" checkboxes when you apply your modifications.
Note that in case you have defined an external hosted collection and you are in fact configuring its related external collections there is no restriction on setting even itself as "See also", "External search" or "External search checked"; directly or recursively via the "Apply also to daughter collections" option. It is up entirely to the admin to keep a clean and consistent installation (for more detailed information see section 4).
These settings let you define how the detailed view (such as https://cds.cern.ch/record/1) of records in this
collection will look like.
More details are available in the WebStyle admin
guide.
Please note that since a record might belong to several collections, conflicts between collection settings might occur. This is especially true in the case of virtual collections. It is therefore the settings of the primary collection of the record which are applied.
External and hosted collections are a way to provide your users with additional sources of information. The simplest option is the "See also" one: it provides a link to the external collection listing the items corresponding to the user's query. Another option is to set up the external collection an "External search [checked]". This option implies a parser implemented for that external collection and allows the user to perform a parallel search on your repository and on that of the external collection. Read more on how to set up the above options in section section 3.11. Also please note that some external resourses might be under copyright restrictions.
Another, more advanced option, are the external hosted collections. The purpose of these collections is to behave just as if they were local ones. That means the admin should set them up as local collections and attach them to the tree. These collections however are not meant to store their records locally but rather to produce them on the fly when asked to. Once attached to the tree an external hosted collection appears in the search home page along with its number of records and a small graphic (arrow in this case) to indicate their being external.
The admin should define a new external collection (any of the above options)
starting with the websearch_external_collections_config.py
file, which
consists basically of a python dictionary. Let us go through the process of defining
a new external collection, starting from the dictionary:
key:value
pair to the dictionary. The key is the
name of the external collection (eg. Amazon Books). The value is another
python dictionary with the parameters of the external collection. Let's go
through these parameters in key:value
pairs:'engine':the_name_of_engine
'base_url':the_base_url_of_the_external_collection
'search_url':the_search_url_of_the_external_collection
'parser_params':dictionary_of_the_parameters_of_the_parser
'host':the_host_of_the_external_collection
'path':the_path_on_the_host_of_the_external_collection
'parser':the_actual_parser_class
'fetch_format':the_format_to_be_used_to_fetch_data
'num_results_regex_str':the_regular_expression_for_the_number_of_results
'num_results_regex_str':the_regular_expression_for_the_total_number_of_records
'nbrecs_url':the_url_that_provides_the_total_number_of_records
Once the dictionary key:value
pair has been added for the new
external collection the admin should implement (or simply use if already implemented)
the search engine python class defined for this external collection. For the
"See also" option the above steps are sufficient. If the admin wants
to enable the "External search [checked]" option as well a parser must
be (or have been) implemented. Finally to set up an external hosted collection
the admin also has to create a new local collection named exactly as the key of
the external hosted collection's key:value
pair in the python
dictionary. The new local collection's query has to begin with
hostedcollection:
(under the current configuration it is sufficient
for the query of any external hosted collection to just be defined as
hostedcollection:
) and the collection itself has to be attached to
the tree to be visible in the search home page. Note that due to the nature of
external hosted collections their corresponding local collections cannot have any
other collections as sons; in other words they shouldn't have any other branches
growing from them.
WebColl is the daemon that normally periodically runs via BibSched and that updates the collection cache with the collection parameters configured in the previous section. Alternatively to running webcoll via BibSched, you can also run it any time you want from the command line, either for all collections or for selected collection only. See the --help option.
The WebSearch Admin interface has got a WebColl Status menu that shows when the collection cache was last updated and when the next update is scheduled. It warns in case something suspicious was discovered.
The Collection Status menu of the WebSearch Admin interface shows the list of all collections and checks if there is anything wrong regarding configuration of collections, together with the languages the collection name has been translated into, etc. Here is the detailed explanation of the functionality:
- ID
- ID of the collection.
- Name
- Name of the collection.
- Query
- The collection definition query. Note that it should be empty if a collection got subcollections. If not, then a query is needed.
- Subcollections
- The subcollections that the collection is composed of. Note that a collection which got defined by a query should not have any subcollections.
- Restricted
- A restricted collection can only be accessed by users belonging to the Apache groups mentioned in this column.
- Hosted
- A hosted collection is practicly an external one behaving just as if it were local.
- I18N
- Show which languages the collection name has been translated into.
- Status
- If no errors was found, OK is displayed for each collection. If an error was found, then an error number and short message are shown. The meaning of the error messages is the following: 1:Conflict means that the collection was defined via a query but also via subcollections too; 2:Empty means that the collection wasn't defined neither via query nor via subcollections.
The Check External Collections menu of the WebSearch Admin interface is a simple tool to check and control the consistency of the external collections the user has defined. External collections exist both in their own database table as well in a user defined configuration file. This tool will check the consistency between the two and report back to the user giving them the option to fix any potential inconsistencies.
Search services are meant to display information contextual to a search query in very specialized way, in the sense that they can search/retrieve/display data beyond the traditional concept of records. Typical search services could for example include:
Search services are displayed (in addition) just before the
results returned by the standard Invenio search engine. They can be
enabled by dropping a plug-in file
at /opt/invenio/lib/python/invenio/search_services/
. A
few sample search services are included by default in Invenio. You can
see which are installed in the WebSearch Admin
interface 8. Search
services page. Additionnal plug-ins (not installed by default)
might be available in the Invenio package
at /modules/websearch/lib/search_services/
.
More information about search services might be found in each plug-in file. An advanced Search Services hacking guide is also available.