After the recent incident, we have restored access to the website from outside the CERN network, however access from certain worldwide locations is still being blocked.

CERN Accelerating science

BibAuthorID Internals

BibAuthorID
-----------

DEFINITIONS
    Inspire: Publication system for high-energy physics
    Hepnames: A collection of known authors including their scientific history in the field
    Document: A document is, in a broader sense, a scientific artifact inside Inspire
        (a publication, a pre-print, a picture, a data set, etc.)
    Virtual Author (VA): The author as it appears on a document
    Real Author (RA): The final entity which ideally is a real individual researcher
    Feature: A feature of a VA or RA is a metric, the owner can be compared to other entities


THE ALGORITHM IN SHORT
    Before any algorithm can run, one step of preparation is to be done:
    - Read all names from the Inspire database and store them in new table

    The algorithm itself is divided into several steps:
    1. Clustering--finding potentially related authors
    2. Matching--create RA entities by pairwise comparison of VAs within a cluster
    3. Post-matching comparison--identifying identical RA entities through cross-cluster comparison


STEP I: CLUSTERING
    [CODE]
    FOR every last name in the database:
        FOR every name with that last name:
            FOR every documents associated with the name:
                create new VA (record):
                    update cluster id of VA and create new cluster if necessary
                    mark as updated
                    mark as not connected
    [/CODE]


STEP II: MATCHING
    The matching algorithm effectively runs several times on different
    conditions.

    Run 1: run matching algorithm and let it find only updated VAs (the ones
        that have the update flag set) that have a full name (i.e. at least
        one first name).
    Run 2: run matching algorithm and let it find the rest of the updated and
        not-connected VAs.
    Run 3: run matching algorithm once or more and let it find only orphaned
        VAs (i.e. VAs that are neither updated nor connected)


    [CODE]
    IF mode == updated_fullname:
        queue := find all updated VAs that have a full name
           and sort them by the number of their features
    ELSEIF mode == updated:
        queue := find all updated VAs and sort them by their features
    ELSEIF mode == orphaned:
        queue := find all disconnected VAs and sort them by their features

    FOR every qVA in queue:
        cluster := find all VAs in the same cluster
        other_RAs := find all RAs that have VAs attached from the same cluster

        IF no other_RAs exist:
            create new RA and copy all features from qVA
        ELSE:
            FOR every other_RA in other_RAs:
                probabilities.add(compare qVA features with other_RA features)

            IF all probabilities < adding threshold:
                create new RA and copy all features from qVA
            ELSEIF (number of probabilities > adding threshold) == 1:
                add qVA to RA with highest probability
                    and copy features from qVA to that RA
            ELSE:
                mark qVA as not connected
                mark qVA as not updated
                continue with next qVA in queue

        mark qVA as connected
        mark qVA as not updated
    [/CODE]

STEP II: MODULES
    Up to this point, the algorithm is build in the fashion of a framework. The
    framework provides all the methods needed to access features of RAs or VAs
    and is extensible through the means of modules. A module's purpose is to
    provide several functions to be able to compare features of a VA to the
    features of a RA. Currently, four modules are implemented to determine the
    correct attribution of a VA to a RA. In particular, these are created to
    compare names, affiliations, paper-equality and co-authorship. The following
    snippets shall show the overall functionality of each of the modules.

    MODULE 1: NAME COMPARISON
    [CODE]
        clean names by removing special chars (.-_/\[]{}())
        split names in last name, initials and names
        build name combinations

        FOR all possible name combinations on both sides:
            initials_p := compare initials:
                attribute weight to position of initial: pos/((1+n/2)*n)
                add up weights of matching initials

            names_p := compare names:
                attribute weight to position of name: pos/((1+n/2)*n)
                add up weights of matching names

            IF names_p > 0.6:
                initials_p_weight := 0.3
                names_p_weight := 0.7
            ELSEIF initials_p_weight == 1.0 and names_p_weight <= 0:
                initials_p_weight := 0
                names_p_weight := 0
            ELSE:
                initials_p_weight := 0.5
                names_p_weight := 0.5

            probabilities.add(names_p_weight * names_p +
                              initials_p_weight * initials_p)

        RETURN MAX(probabilities)
    [/CODE]

    MODULE 2: AFFILIATION COMPARISON
    [CODE]
        IF no affiliation in common:
            RETURN 0.0

        FOR every common affiliation:
            IF common affiliation == "Unknown":
                common_affiliations.add(0)
            ELSE:
                common_affiliations.add(1)

            date_difference := find date difference in month

            IF date_difference > 600 (50 years):
                date_probabilities.add(0)
            ELSE:
                date_probabilities.add(e^(-0.05 * date_difference ^ 0.7))

        affiliation_p := AVERAGE(common_affiliations)
        date_p = AVERAGE(date_probabilities)

        RETURN (affiliation_p + date_p) / 2
    [/CODE]

    MODULE 3: COAUTHORSHIP COMPARISON
    [CODE]
        IF no coauthors on both sides:
            RETURN 0.0

        IF number of VA coauthors > 50:
            create hash of sorted coauthor list.
            IF RA holds same hash:
                RETURN 1.0
            ELSE:
                RETURN 0.0

        parity = find intersection of RA and VA coauthors

        FOR every coauthor in parity:
            if coauthor is a collaboration:
                RETURN 1.0

        IF number of VA coauthors > 0:
            RETURN 1 - e^(-0.8 * len(parity)^0.7))
    [/CODE]

    MODULE 4: PAPER-EQUALITY TEST
    [CODE]
        IF qVA is on a paper that is already part of the RA:
            RETURN impossible match
        ELSE:
            RETURN possible match
    [/CODE]


STEP III: POST-MATCHING COMPARISON
    [CODE]
        FOR every RA:
            compare features of RA with all features of all other RAs

            IF compatablility with other RA > 0.75:
                merge the two RAs into one RA

    [/CODE]