项目作者: miku

项目描述 :
No frills OAI PMH harvesting for the command line.
高级语言: Go
项目地址: git://github.com/miku/oaimi.git
创建时间: 2015-09-07T13:02:26Z
项目社区:https://github.com/miku/oaimi

开源协议:GNU General Public License v3.0

下载


Note: oaimi is deprectated. For a better experience, please take a look at
metha - it supports incremental harvesting,
compresses results and has overall a simplified interface and internals.


README

The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) is a low-barrier mechanism for repository interoperability. https://www.openarchives.org/pmh/

No frills OAI harvesting. It acts as cache and will take care of incrementally retrieving new records.

Build Status

Installation

  1. $ go get github.com/miku/oaimi/cmd/oaimi

There are deb and rpm packages as well.

Usage

Show repository information:

  1. $ oaimi -id http://digital.ub.uni-duesseldorf.de/oai
  2. {
  3. "formats": [
  4. {
  5. "prefix": "oai_dc",
  6. "schema": "http://www.openarchives.org/OAI/2.0/oai_dc.xsd"
  7. },
  8. ...
  9. {
  10. "prefix": "epicur",
  11. "schema": "http://www.persistent-identifier.de/xepicur/version1.0/xepicur.xsd"
  12. }
  13. ],
  14. "identify": {
  15. "name": "Visual Library Server der Universitäts- und Landesbibliothek Düsseldorf",
  16. "url": "http://digital.ub.uni-duesseldorf.de/oai/",
  17. "version": "2.0",
  18. "email": "docserv@uni-duesseldorf.de",
  19. "earliest": "2008-04-18T07:54:14Z",
  20. "delete": "no",
  21. "granularity": "YYYY-MM-DDThh:mm:ssZ"
  22. },
  23. "sets": [
  24. {
  25. "spec": "ulbdvester",
  26. "name": "Sammlung Vester (DFG)"
  27. },
  28. ...
  29. {
  30. "spec": "ulbd_rsh",
  31. "name": "RSH"
  32. }
  33. ]
  34. }

Harvest the complete repository into a single file (default format is oai_dc, might take a few minutes on first run):

  1. $ oaimi -verbose http://digital.ub.uni-duesseldorf.de/oai > metadata.xml

Harvest only a slice (e.g. set ulbdvester in format epicur for 2010 only):

  1. $ oaimi -set ulbdvester -prefix epicur -from 2010-01-01 \
  2. -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai > slice.xml

Harvest, and add an artificial root element, so the result gets a bit more valid XML:

  1. $ oaimi -root records http://digital.ub.uni-duesseldorf.de/oai > withroot.xml

To list the harvested files, run:

  1. $ ls $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Add any parameter to see the resulting cache dir:

  1. $ ls $(oaimi -dirname -set ulbdvester -prefix epicur -from 2010-01-01 \
  2. -until 2010-12-31 http://digital.ub.uni-duesseldorf.de/oai)

To remove all cached files:

  1. $ rm -rf $(oaimi -dirname http://digital.ub.uni-duesseldorf.de/oai)

Play well with others:

  1. $ oaimi http://acceda.ulpgc.es/oai/request | \
  2. xmlcutty -path /Response/ListRecords/record/metadata -root collection | \
  3. xmllint --format -
  4. <?xml version="1.0"?>
  5. <collection>
  6. <metadata>
  7. <oai_dc:dc xmlns:oai_dc="ht...... dc.xsd">
  8. <dc:title>Elementos míticos y paralelos estructurales en la ...</dc:title>
  9. ...

Options:

  1. $ oaimi -h
  2. Usage of oaimi:
  3. -cache string
  4. oaimi cache dir (default "/Users/tir/.oaimicache")
  5. -dirname
  6. show shard directory for request
  7. -from string
  8. OAI from
  9. -id
  10. show repository info
  11. -prefix string
  12. OAI metadataPrefix (default "oai_dc")
  13. -root string
  14. name of artificial root element tag to use
  15. -set string
  16. OAI set
  17. -until string
  18. OAI until (default "2015-11-30")
  19. -v prints current program version
  20. -verbose
  21. more output

Experimental oaimi-id and oaimi-sync for identifying or harvesting in parallel:

  1. $ oaimi-id -h
  2. Usage of oaimi-id:
  3. -timeout duration
  4. deadline for requests (default 30m0s)
  5. -v prints current program version
  6. -verbose
  7. be verbose
  8. -w int
  9. requests in parallel (default 8)
  10. $ oaimi-sync
  11. Usage of oaimi-sync:
  12. -cache string
  13. where to cache responses (default "/Users/tir/.oaimicache")
  14. -v prints current program version
  15. -verbose
  16. be verbose
  17. -w int
  18. requests in parallel (default 8)

How it works

The harvesting is performed in chunks (weekly at the moment). The raw data is
downloaded and appended to a single temporary file per source, set, prefix and
month. Once a month has been harvested successfully, the temporary file is
moved below a cache dir. In short: The cache dir will not contain partial files.

If you request the data for a given data source, oaimi will try to reuse the
cache and only harvest not yet cached data. The output file is the
concatenated content for the requested date range. The output is no valid XML
because a root element is missing. You can add a custom root element with the
-root flag.

The value proposition of oaimi is that you get a single file containing the
raw data for a specific source with a single command and that incremental
updates are relatively cheap - at most the last 7 days need to be fetched.

For the moment, any further processing must happen in the client (like
handling deletions).

More Docs: http://godoc.org/github.com/miku/oaimi

Similar projects

More sites

Distributions

Over 2038 repositories.

Miscellaneous

License

  • GPLv3
  • This project uses ioutil2, Copyright 2012, Google Inc. All rights reserved.
    Use of this source code is governed by a BSD-style license.