BibSort Internals



BibSort main goal is to make the sorting of search results faster.
It does this by creating several buckets that are then loaded by
the search_engine and cached.

This is what happens per sorting method (which can be either
a ranking method or a sorting field/list of MARC tags):

    I. Rebalancing the buckets (reconstructing the buckets):
	- (1) the data for all the records is loaded into a dictionary
	- (2) based on the washer method (if any is defined in the
		cfg file) date is adjusted (dates normalized, removing
		leading articles..)
	- (3) the data dictionary (containing recid:value) is then
		sorted based on values
	- (4) after the sorting, each record receives a weight (it can
		be the index in the sorted list or a multiple of the index)
	- (5) based on the sorted list of recids, the buckets are created,
		by splitting this list in CFG_BIBSORT_BUCKETS parts
	- (6) the maximum weight per bucket will also be stored in
		the database, along with each bucket
	- (7) everything is stored in the database

    II. Updating the buckets (can run each 5 min, to update
	the buckets with the latest changes. Probably it would be best
	to run it after BibIndex, in order for the user to see the new
	changes as soon as they are indexed)
	- (1) detect which methods should be updated: either
		ranking - from rnkMETHOD tables, either usual sorting
		(by field or MARC)
	- (1i) if RNK method -> then run rebalancing, since anyway,
		we have to read all the data for all the record
		(check also if the method needs to be deleted
		rather then updated) - END
	- (1ii) if sort by field -> detect which recids have been
		modified since the last run of bibsort
	- (2) based on the washer method (if any is defined in
		the cfg file) date is adjusted (dates normalized,
		removing leading articles..)
	- (3) compute the records that need to be inserted, modified and
		deleted (with the help of the data dictionary (see I.2)
		that is loaded from db)
	- (3i) insert: new recids with their values will be inserted into
		the data dictionary; binary search will be used to compute
		the index of the new record in the sorted list of records;
		a new weight will be given to this record based on this index
		and on the weights of its neighbours (ex: recid 3 needs to be
		inserted between recid1-weight=4 and recid2: weight=8 =>
		recid3 will have weight=6); initially, the weights are given
		as such that there are free weights between any 2 recids;
		when these spaces are filled, the weights will be recomputed,
		to recreate spaces between them.)
	- (3ii) delete: the recids will be deleted from all the data structures
		(except the buckets, since there is no actual need for that,
		the intersection with the search_engine output will guarantee
		that the record will  not be displayed)
	- (3iii) modify: combination between insert and delete
	- (4) update the buckets to reflect the latest changes; here, we will
		take advantage of the maximum weight per bucket = the highest
		weight value per bucket (I.6); since the buckets are not ordered
		(are intbitsets, for taking advantage of fast intersections)
		we will add the new records based on their weight (the main idea
		is to compare the weight of the modified records with the max weight
		and decide in which bucket to be added/deleted.
CERN Document Server

Access articles, reports and multimedia content in HEP

Main menu

BibSort Internals

CERN Accelerating science

BibSort Internals