Note: oaimi is deprectated. For a better experience, please take a look at
metha - it supports incremental harvesting,
compresses results and has overall a simplified interface and internals.

README

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. https://www.openarchives.org/pmh/

No frills OAI harvesting. It acts as cache and will take care of incrementally retrieving new records.

Installation

$ go get github.com/miku/oaimi/cmd/oaimi

There are deb and rpm packages as well.

Usage

Show repository information:

$ oaimi -id http://digital.ub.uni-duesseldorf.de/oai
{
  "formats": [
    {
      "prefix": "oai_dc",
      "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
    },
    ...
    {
      "prefix": "epicur",
      "schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd"
    }
  ],
  "identify": {
    "name": "Visual Library Server der Universitäts- und Landesbibliothek Düsseldorf",
    "url": "http://digital.ub.uni-duesseldorf.de/oai/",
    "version": "2.0",
    "email": "docserv@uni-duesseldorf.de",
    "earliest": "2008-04-18T07:54:14Z",
    "delete": "no",
    "granularity": "YYYY-MM-DDThh:mm:ssZ"
  },
  "sets": [
    {
      "spec": "ulbdvester",
      "name": "Sammlung Vester (DFG)"
    },
    ...
    {
      "spec": "ulbd_rsh",
      "name": "RSH"
    }
  ]
}

Harvest the complete repository into a single file (default format is oai_dc, might take a few minutes on first run):

$ oaimi -verbose http://digital.ub.uni-duesseldorf.de/oai > metadata.xml

Harvest only a slice (e.g. set ulbdvester in format epicur for 2010 only):

$ oaimi -set ulbdvester -prefix epicur -from 2010-01-01 \
        -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai > slice.xml

Harvest, and add an artificial root element, so the result gets a bit more valid XML:

$ oaimi -root records http://digital.ub.uni-duesseldorf.de/oai > withroot.xml

To list the harvested files, run:

$ ls $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Add any parameter to see the resulting cache dir:

$ ls $(oaimi -dirname -set ulbdvester -prefix epicur -from 2010-01-01 \
             -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai)

To remove all cached files:

$ rm -rf $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Play well with others:

$ oaimi http://acceda.ulpgc.es/oai/request | \
    xmlcutty -path /Response/ListRecords/record/metadata -root collection | \
    xmllint --format -
<?xml version="1.0"?>
<collection>
  <metadata>
    <oai_dc:dc xmlns:oai_dc="ht...... dc.xsd">
      <dc:title>Elementos míticos y paralelos estructurales en la ...</dc:title>
...

Options:

$ oaimi -h
Usage of oaimi:
  -cache string
      oaimi cache dir (default "/Users/tir/.oaimicache")
  -dirname
      show shard directory for request
  -from string
      OAI from
  -id
      show repository info
  -prefix string
      OAI metadataPrefix (default "oai_dc")
  -root string
      name of artificial root element tag to use
  -set string
      OAI set
  -until string
      OAI until (default "2015-11-30")
  -v  prints current program version
  -verbose
      more output

Experimental oaimi-id and oaimi-sync for identifying or harvesting in parallel:

$ oaimi-id -h
Usage of oaimi-id:
  -timeout duration
      deadline for requests (default 30m0s)
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)
$ oaimi-sync
Usage of oaimi-sync:
  -cache string
      where to cache responses (default "/Users/tir/.oaimicache")
  -v  prints current program version
  -verbose
      be verbose
  -w int
      requests in parallel (default 8)

How it works

The harvesting is performed in chunks (weekly at the moment). The raw data is
downloaded and appended to a single temporary file per source, set, prefix and
month. Once a month has been harvested successfully, the temporary file is
moved below a cache dir. In short: The cache dir will not contain partial files.

If you request the data for a given data source, oaimi will try to reuse the
cache and only harvest not yet cached data. The output file is the
concatenated content for the requested date range. The output is no valid XML
because a root element is missing. You can add a custom root element with the
-root flag.

The value proposition of oaimi is that you get a single file containing the
raw data for a specific source with a single command and that incremental
updates are relatively cheap - at most the last 7 days need to be fetched.

For the moment, any further processing must happen in the client (like
handling deletions).

More Docs: http://godoc.org/github.com/miku/oaimi

Similar projects

More sites

Distributions

Over 2038 repositories.

supported formats
earliest date
Format representants

Miscellaneous

1min of harvest, 2min parallelism

License

GPLv3