No frills OAI PMH harvesting for the command line.
Note: oaimi is deprectated. For a better experience, please take a look at
metha - it supports incremental harvesting,
compresses results and has overall a simplified interface and internals.
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. https://www.openarchives.org/pmh/
No frills OAI harvesting. It acts as cache and will take care of incrementally retrieving new records.
$ go get github.com/miku/oaimi/cmd/oaimi
There are deb and rpm packages as well.
Show repository information:
$ oaimi -id http://digital.ub.uni-duesseldorf.de/oai
{
"formats": [
{
"prefix": "oai_dc",
"schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
},
...
{
"prefix": "epicur",
"schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd"
}
],
"identify": {
"name": "Visual Library Server der Universitäts- und Landesbibliothek Düsseldorf",
"url": "http://digital.ub.uni-duesseldorf.de/oai/",
"version": "2.0",
"email": "docserv@uni-duesseldorf.de",
"earliest": "2008-04-18T07:54:14Z",
"delete": "no",
"granularity": "YYYY-MM-DDThh:mm:ssZ"
},
"sets": [
{
"spec": "ulbdvester",
"name": "Sammlung Vester (DFG)"
},
...
{
"spec": "ulbd_rsh",
"name": "RSH"
}
]
}
Harvest the complete repository into a single file (default format is oai_dc, might take a few minutes on first run):
$ oaimi -verbose http://digital.ub.uni-duesseldorf.de/oai > metadata.xml
Harvest only a slice (e.g. set ulbdvester in format epicur for 2010 only):
$ oaimi -set ulbdvester -prefix epicur -from 2010-01-01 \
-until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai > slice.xml
Harvest, and add an artificial root element, so the result gets a bit more valid XML:
$ oaimi -root records http://digital.ub.uni-duesseldorf.de/oai > withroot.xml
To list the harvested files, run:
$ ls $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)
Add any parameter to see the resulting cache dir:
$ ls $(oaimi -dirname -set ulbdvester -prefix epicur -from 2010-01-01 \
-until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai)
To remove all cached files:
$ rm -rf $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)
Play well with others:
$ oaimi http://acceda.ulpgc.es/oai/request | \
xmlcutty -path /Response/ListRecords/record/metadata -root collection | \
xmllint --format -
<?xml version="1.0"?>
<collection>
<metadata>
<oai_dc:dc xmlns:oai_dc="ht...... dc.xsd">
<dc:title>Elementos míticos y paralelos estructurales en la ...</dc:title>
...
Options:
$ oaimi -h
Usage of oaimi:
-cache string
oaimi cache dir (default "/Users/tir/.oaimicache")
-dirname
show shard directory for request
-from string
OAI from
-id
show repository info
-prefix string
OAI metadataPrefix (default "oai_dc")
-root string
name of artificial root element tag to use
-set string
OAI set
-until string
OAI until (default "2015-11-30")
-v prints current program version
-verbose
more output
Experimental oaimi-id
and oaimi-sync
for identifying or harvesting in parallel:
$ oaimi-id -h
Usage of oaimi-id:
-timeout duration
deadline for requests (default 30m0s)
-v prints current program version
-verbose
be verbose
-w int
requests in parallel (default 8)
$ oaimi-sync
Usage of oaimi-sync:
-cache string
where to cache responses (default "/Users/tir/.oaimicache")
-v prints current program version
-verbose
be verbose
-w int
requests in parallel (default 8)
The harvesting is performed in chunks (weekly at the moment). The raw data is
downloaded and appended to a single temporary file per source, set, prefix and
month. Once a month has been harvested successfully, the temporary file is
moved below a cache dir. In short: The cache dir will not contain partial files.
If you request the data for a given data source, oaimi
will try to reuse the
cache and only harvest not yet cached data. The output file is the
concatenated content for the requested date range. The output is no valid XML
because a root element is missing. You can add a custom root element with the-root
flag.
The value proposition of oaimi
is that you get a single file containing the
raw data for a specific source with a single command and that incremental
updates are relatively cheap - at most the last 7 days need to be fetched.
For the moment, any further processing must happen in the client (like
handling deletions).
More Docs: http://godoc.org/github.com/miku/oaimi
Over 2038 repositories.