CERN Accelerating science

OAIHarvest Admin Guide

Contents

1. Overview
2. OAI Data Harvesting
       2.1 OAIHarvest Admin Interface
       2.2 oaiharvest commmand-line tool

1. Overview

The OAI Harvest module handles metadata gathering between OAI-PMH v.2.0 compliant repositories. Metadata exchange is performed on top of the OAI-PMH, the Open Archives Initiative's Protocol for Metadata Harvesting. The OAI Harvest Admin Interface can be used to set up a database of OAI sources and open your repository for OAI harvesting.

2. OAI Data Harvesting

In order to periodically harvest metadata from one or several repositories, it is possible to organize OAI sources through the OAI Harvest Admin Interface. The interface allows the administrator to add new repositories as well as edit and delete existing ones. Once defined, run the oaiharvest command-line tool to start periodical harvesting of these repositories based on your settings.

(Note that you can alternatively use the oaiharvest command-line tool to perform a manual harvesting of a repository independently of the settings in OAI Harvest admin interface, and save the result into a file. This lets you specify the OAI arguments for the harvesting (verb, dates, sets, metadata format, etc.). You would then need to run bibconvert and bibupload to integrate the harvested records into your repository. This method can be useful to play with repositories, or for writing custom harvesting scripts, but it is not recommended for periodical harvesting. Run oaiharvest -h for additional information.)

2.1 OAI Harvest Admin Interface

Add OAI sources

The first step requires the administrator to enter the baseURL of the OAI repository. This is done for validation purposes - i.e. to check that the baseURL actually points to an OAI-compliant repository. (Note: the validation simply performs an 'Identify' query to the baseURL and parses the reply with crucial tags such as OAI-PMH and Identify).
Once the baseURL is validated, the administrator is required to fill into the following fields:

Edit and delete OAI sources

Once a source has been added to the database, it will be visible in the overview page, as shown below

name baseURL metadataprefix frequency bibconvertfile postprocess actions
cds https://cds.cern.ch/oai2d marcxml daily NULL h-u edit / delete / test / history / harvest

At this point it will be possible to edit the definition of this source by clicking on the appropriate action button. All the fields described in 2.2.1 can be modified (except for the Starting date). There is no more validation at this stage, hence, please take extra care when editing important fields such as baseurl and metadataprefix.
OAI repositories can be removed from the database by clicking on the appropriate action button in the overview page.

Testing OAI sources

This interface provides the possibility to test OAI harvesting settings. In order to see the harvesting and postprocessing results, administrator has to provide record identifier (insige the harvested OAI source) and click "test" button. After doing this, two new frames will appear. The upper contatins original OAI XML harvested from the source. The second contains the result of all the postprocessing activities or an error message.

Viewing the harvesting history

This page allows the administrator to see which records have been recently harvested. All the data is shown in database insertion order. If there was more than 10 inserted records during a day, only first 10 will be displayed. In order to do manipulations connected with the rest, the administrator has to proceed to day details page which is available by "View next entries..." link.

The same page provides the possibility of reharvesting the records present in the past. All have to be done in order to achieve this is selecting appropriate records by checking the checkboxes on the right side of the entry and clicking "reharvest selected records" button.

Harvesting particular records

This page provides the possibility of harvesting records manually. The administrator has to provide internal OAI source identifier. After this, record will be harvested, converted and filtered according to the source settings and scheduled to be uploaded into the database.

2.2 oaiharvest commmand-line tool

Once administrators have set up their desired OAI repositories in the database through the Admin Interface they can invoke oaiharvest to start up periodical harvesting.

Oaiharvest usage

oaiharvest [options]

Manual single-shot harvesting mode:
  -o, --output         specify output file
  -v, --verb           OAI verb to be executed
  -m, --method         http method (default POST)
  -p, --metadataPrefix metadata format
  -i, --identifier     OAI identifier
  -s, --set            OAI set(s). Whitespace-separated list
  -r, --resuptionToken Resume previous harvest
  -f, --from           from date (datestamp)
  -u, --until          until date (datestamp)
  -c, --certificate    path to public certificate (in case of certificate-based harvesting)
  -k, --key            path to private key (in case of certificate-based harvesting)
  -l, --user           username (in case of password-protected harvesting)
  -w, --password       password (in case of password-protected harvesting)

Automatic periodical harvesting mode:
  -r, --repository="repo A"[,"repo B"] 	 which repositories to harvest (default=all)
  -d, --dates=yyyy-mm-dd:yyyy-mm-dd 	 reharvest given dates only

Scheduling options:
  -u, --user=USER	User name under which to submit this task.
  -t, --runtime=TIME	Time to execute the task. [default=now]
			Examples: +15s, 5m, 3h, 2002-10-27 13:57:26.
  -s, --sleeptime=SLEEP	Sleeping frequency after which to repeat the task.
			Examples: 30m, 2h, 1d. [default=no]
  -L  --limit=LIMIT	Time limit when it is allowed to execute the task.
			Examples: 22:00-03:00, Sunday 01:00-05:00.
			Syntax: [Wee[kday]] [hh[:mm][-hh[:mm]]].
  -P, --priority=PRI	Task priority (0=default, 1=higher, etc).
  -N, --name=NAME	Task specific name (advanced option).

General options:
  -h, --help		Print this help.
  -V, --version		Print version information.
  -v, --verbose=LEVEL	Verbose level (0=min, 1=default, 9=max).
      --profile=STATS	Print profile information. STATS is a comma-separated
			list of desired output stats (calls, cumulative,
			file, line, module, name, nfl, pcalls, stdname, time).

oaiharvest performs a number of operations on the repositories listed in the database. By default oaiharvest considers all repositories, one by one (this gets overridden when --repository argument is passed).
For each repository that is considered, oaiharvest behaves according to the arguments passed at the command line:
Oaiharvest usage examples

In most cases, administrators will want oaiharvest to run in the background, i.e. run in sleep mode and wake up periodically (e.g. every 24 hours) to check whether updates are needed:

$ oaiharvest -s 24h

In other cases, administrators may want to perform periodical harvesting only on specific sources:

$ oaiharvest -r cds -s 12h

Another option is that administrators may want to harvest from certain repositories within two specific dates. This will be regarded as a one-off operation and will not affect the last update value of the source:

$oaiharvest -r cds -d 2005-05-05:2005-05-30