CERN Accelerating science

HOWTO Manage Fulltext Files

How to manage your fulltext files through BibDocFile

Usage: /opt/invenio/bin/bibdocfile [options]

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -D, --debug
  -H, --human-readable  print sizes in human readable format (e.g., 1KB 234MB
                        2GB)

  Query options:
    -r RECIDS, --recids=RECIDS
                        matches records by recids, e.g.: --recids=1-3,5-7
    -d DOCIDS, --docids=DOCIDS
                        matches documents by docids, e.g.: --docids=1-3,5-7
    -a, --all           Select all the records
    --with-deleted-recs=yes/no/only
                        'Yes' to also match deleted records, 'no' to exclude
                        them, 'only' to match only deleted ones
    --with-deleted-docs=yes/no/only
                        'Yes' to also match deleted documents, 'no' to exclude
                        them, 'only' to match only deleted ones (e.g. for
                        undeletion)
    --with-empty-recs=yes/no/only
                        'Yes' to also match records without attached
                        documents, 'no' to exclude them, 'only' to consider
                        only such records (e.g. for statistics)
    --with-empty-docs=yes/no/only
                        'Yes' to also match documents without attached files,
                        'no' to exclude them, 'only' to consider only such
                        documents (e.g. for sanity checking)
    --with-record-modification-date=date1,date2
                        matches records modified date1 and date2; dates can be
                        expressed relatively, e.g.:"-5m,2030-2-23 04:40" #
                        matches records modified since 5 minutes ago until the
                        2030...
    --with-record-creation-date=date1,date2
                        matches records created between date1 and date2; dates
                        can be expressed relatively
    --with-document-modification-date=date1,date2
                        matches documents modified between date1 and date2;
                        dates can be expressed relatively
    --with-document-creation-date=date1,date2
                        matches documents created between date1 and date2;
                        dates can be expressed relatively
    --url=URL           matches the document referred by the URL, e.g. "http:/
                        /pcsk.cern.ch/record/1/files/foobar.pdf?version=2"
    --path=PATH         matches the document referred by the internal
                        filesystem path, e.g. /opt/cds-
                        dev/var/data/files/g0/1/foobar.pdf\;1
    --with-docname=DOCNAME
                        matches documents with the given docname (accept
                        wildcards)
    --with-doctype=DOCTYPE
                        matches documents with the given doctype
    -p PATTERN, --pattern=PATTERN
                        matches records by pattern
    -c COLLECTION, --collection=COLLECTION
                        matches records by collection
    --force             force an action even when it's not necessary e.g.
                        textify on an already textified bibdoc.

  Actions for getting information:
    --get-info          print all the informations about the matched
                        record/documents
    --get-disk-usage    print disk usage statistics of the matched documents
    --get-history       print the matched documents history

  Actions for setting information:
    --set-doctype=doctype
                        specify the new doctype
    --set-description=description
                        specify a description
    --set-comment=comment
                        specify a comment
    --set-restriction=restriction
                        specify a restriction tag
    --set-docname=docname
                        specifies a new docname for renaming
    --unset-comment     remove any comment
    --unset-descriptions
                        remove any description
    --unset-restrictions
                        remove any restriction
    --hide              hides matched documents and revisions
    --unhide            hides matched documents and revisions

  Action for revising content:
    --append=PATH/URL   specify the URL/path of the file that will appended to
                        the bibdoc
    --revise=PATH/URL   specify the URL/path of the file that will revise the
                        bibdoc
    --revert            reverts a document to the specified version
    --delete            soft-delete the matched documents (applies to all
                        revisions and formats)
    --hard-delete       hard-delete the matched documents (applies to matched
                        revisions and formats)
    --undelete          undelete previosuly soft-deleted documents (applies to
                        all revisions and formats)
    --purge             purge (i.e. hard-delete previous versions) the matched
                        documents
    --expunge           expunge (i.e. hard-delete any version and formats) the
                        matched documents
    --with-versions=VERSION
                        specifies the version(s) to be used with hard-delete,
                        hide, revert, e.g.: 1-2,3 or all
    --with-format=FORMAT
                        to specify a format when
                        appending/revising/deleting/reverting a document, e.g.
                        "pdf"
    --with-hide-previous
                        when revising, hides previous versions

  Actions for housekeeping:
    --check-md5         check md5 checksum validity of files
    --check-format      check if any format-related inconsistences exists
    --check-duplicate-docnames
                        check for duplicate docnames associated with the same
                        record
    --update-md5        update md5 checksum of files
    --fix-all           fix inconsistences in filesystem vs database vs MARC
    --fix-marc          synchronize MARC after filesystem/database
    --fix-format        fix format related inconsistences
    --fix-duplicate-docnames
                        fix duplicate docnames associated with the same record

  Experimental options (do not expect to find them in the next release):
    --set-icon=URL/PATH
                        attache the specified icon to the matched documents
    --unset-icon        remove any icon on the matched documents
    --textify           extract text from matched documents and store it for
                        later indexing
    --with-ocr=yes/no/always
                        when used with --textify, wether to perform OCR (yes
                        will perform it only if necessary, based on an
                        heuristic)


Examples:
    $ bibdocfile --append foo.tar.gz --recid=1
    $ bibdocfile --revise http://foo.com?search=123 --with-docname='sam'
            --format=pdf --recid=3 --set-docname='pippo' # revise for record 3
    $ bibdocfile --delete *sam --all # delete all documents starting ending
    $ bibdocfile --undelete -c "Test Collection" # undelete documents for
    $ bibdocfile --get-info --recids=1-4,6-8 # obtain informations
    $ bibdocfile -r 1 --with-docname=foo --set-docname=bar # Rename a document

How to convert your fulltext files among different formats

Usage: python /opt/invenio/lib/python/invenio/websubmit_file_converter.py [options]

Options:
  -h, --help            show this help message and exit
  -c FILE, --convert=FILE
                        convert the specified FILE
  -d, --debug           Enable debug information
  --special-pdf2hocr2pdf=FILE
                        convert the given scanned PDF into a PDF with OCRed
                        text
  -f FORMAT, --format=FORMAT
                        the desired output format
  -o OUTPUT_NAME, --output=OUTPUT_NAME
                        the desired output FILE (if not specified a new file
                        will be generated with the desired output format)
  --without-pdfa        don't force creation of PDF/A  PDFs
  --without-pdfopt      don't force optimization of PDFs files
  --without-ocr         don't force OCR
  --can-convert=FORMAT  display all the possible format that is possible to
                        generate from the given format
  --is-ocr-needed=FILE  check if OCR is needed for the FILE specified
  -t TITLE, --title=TITLE
                        specify the title (used when creating PDFs)
  -l LN, --language=LN  specify the language (used when performing OCR, e.g.
                        en, it, fr...)

  Examples:
    python /opt/invenio/lib/python/invenio/websubmit_file_converter.py \
              --convert=foo.docx -f pdf

    python /opt/invenio/lib/python/invenio/websubmit_file_converter.py \
              --special-pdf2hocr2pdf=scanned-foo.pdf --output=ocred-foo.pdf

This module is used internally by Invenio to convert from one format to another

It can also be invoked manually by a system administrator to obtain the same level of conversion quality offered by Invenio.

How to create icons

Usage:
  python /opt/invenio/lib/python/invenio/websubmit_icon_creator.py \
                           [options] input-file.jpg

  websubmit_icon_creator.py is used to create an icon for an image.

  Options:
   -h, --help                      Print this help.
   -V, --version                   Print version information.
   -v, --verbose=LEVEL             Verbose level (0=min, 1=default, 9=max).
                                    [NOT IMPLEMENTED]
   -s, --icon-scale
                                   Scaling information for the icon that is to
                                   be created. Must be an integer. Defaults to
                                   180.
   -m, --multipage-icon
                                   A flag to indicate that the icon should
                                   consist of multiple pages. Will only be
                                   respected if the requested icon type is GIF
                                   and the input file is a PS or PDF consisting
                                   of several pages.
   -d, --multipage-icon-delay=VAL
                                   If the icon consists of several pages and is
                                   an animated GIF, a delay between frames can
                                   be specified. Must be an integer. Defaults
                                   to 100.
   -f, --icon-file-format=FORMAT
                                   The file format of the icon to be created.
                                   Must be one of:
                                       [pdf, gif, jpg, jpeg, ps, png, bmp]
                                   Defaults to gif.
   -o, --icon-name=XYZ
                                   The optional name to be given to the created
                                   icon file. If this is omitted, the icon file
                                   will be given the same name as the input
                                   file, but will be prefixed by "icon-";

  Examples:
    python /opt/invenio/lib/python/invenio/websubmit_icon_creator.py \
              --icon-scale=200 \
              --icon-name=test-icon \
              --icon-file-format=jpg \
              test-image.jpg

    python /opt/invenio/lib/python/invenio/websubmit_icon_creator.py \
              --icon-scale=200 \
              --icon-name=test-icon2 \
              --icon-file-format=gif \
              --multipage-icon \
              --multipage-icon-delay=50 \
              test-image2.pdf

How to stamp PDFs

Usage:
  python /opt/invenio/lib/python/invenio/websubmit_file_stamper.py \
                           [options] input-file.pdf

  websubmit_file_stamper.py  is used to add a "stamp" to a PDF file.
  A LaTeX template is used to create the stamp and this stamp is then
  concatenated with the original PDF file.
  The stamp can take the form of either a separate "cover page" that is
  appended to the document; or a "mark" that is applied somewhere either
  on the document's first page or on all of its pages.

  Options:
   -h, --help                Print this help.
   -V, --version             Print version information.
   -v, --verbose=LEVEL       Verbose level (0=min, 1=default, 9=max).
                              [NOT IMPLEMENTED]
   -t, --latex-template=PATH
                             Path to the LaTeX template file that should be used
                             for the creation of the PDF stamp. (Note, if it's
                             just a basename, it will be sought first in the
                             current working directory, and then in the invenio
                             file-stamper templates directory; If there is a
                             qualifying path to the template name, it will be
                             sought only in that location);
   -c, --latex-template-var='VARNAME=VALUE'
                             A variable that should be replaced in the LaTeX
                             template file with its corresponding value. Of the
                             following format:
                                 VARNAME=VALUE
                             This option is repeatable - one for each template
                             variable;
   -s, --stamp=STAMP-TYPE
                             The type of stamp to be applied to the subject
                             file. Must be one of 3 values:
                              + "first" - stamp only the first page;
                              + "all"   - stamp all pages;
                              + "coverpage" - add a cover page to the
                                document;
                             The default value is "first";
   -o, --output-file=XYZ
                             The optional name to be given to the finished
                             (stamped) file. If this is omitted, the stamped
                             file will be given the same name as the input
                             file, but will be prefixed by "stamped-";

  Example:
    python /opt/invenio/lib/python/invenio/websubmit_file_stamper.py \
              --latex-template=demo-stamp-left.tex \
              --latex-template-var='REPORTNUMBER=TEST-THESIS-2008-019' \
              --latex-template-var='DATE=27/02/2008' \
              --stamp='first' \
              --output-file=testfile_stamped.pdf \
              testfile.pdf