Deduplication of records from csv using SRU service of swissbib or GVI
Perl Script for data deduplication with SRU service of swissbib or GVI.
The deduplication process is specially adapted to data from IFF institute of the university of St. Gallen
There are several versions of this script.
The recommended version is v4_combined/dedup.pl. This Readme is for the recommended version.
Older versions can be found in directory old_versions, there is a separate readme for versions 1 and 2.
You need to have perl and libxml2 installed to run this.
Developed with Strawberry Perl v5.28.1 (LibXML is included).
For Strawberry Perl (Windows):
include path to .\iff in @INC see here
An image for Ubuntu (VM) with all necessary modules installed can be downloaded here:
switch drive.
Please read the installation notes.
The script calls a SRU service for each document in the input file, so you will need an active internet connection.
Performance depends on internet connection as well as availability of SRU service.
Call the script like this:
perl dedup.pl -c [swissbib|gvi] -f [filename]
For more information about the script, see the POD documentation for dedup.pl.
You can choose between swissbib or GVI SRU interface.
The dedup.pl script would also need to be adapted slightly when getting the options from command line.
In the configuration file, you can parametrize several values:
Each section in the configuration file needs a header, each value needs its own line.
Full documentation on the configuration file (how to edit or add entries, how to call it in the script) can be found here:
Config:
You can feed this script with an input file of your choice. It needs to be in csv format.
The data needs to be arranged in rows like the example files in subdirectory ./data, otherwise this script will not work.
Data needs to be in the following rows (rows may be empty unless stated otherwise):
Console output will show a progress bar and give you the logfile name at the end.
The script creates the following output:
It contains all documents (equal to input file)
Additional mapping info can be found in following columns:
w: what to do with the documents. Cases:
x: docnr. of document to be replaced (swissbib only: system number)
MARCXML-Export for cases reimport and replace. Docnr. can be found in controlfield 001 and corresponds to the export file .
Contains debugging info (quite chatty) for each document, its result set and matching values.
Logfile with statistics