Convert images to PDF, extract PDF text and execute commands depending on the content
Convert images to PDFs with Tesseract, extract PDF text with
pdftotext and execute commands depending on the content. The purpose is to
scan paperwork, generate a PDF with text overlay and apply a set of rules to
rename and move the PDF.
The pdfmatch
command is used like this:
pdfmatch [options] source.{pdf,jpeg,tif,...} [target.pdf]
Options:
--config Use the given config file
--delete Remove source file if match was found and command executed
--debug Don't create the PDF or execute commands, but print the text
-l Use the given language(s), overrides the configured "lang"
If no config file is specified, pdfmatch
will look for a file namedpdfmatch.json
in the current directory.
If the source file is a PDF, the text is extracted with pdftotext
and the
configured rules are applied.
If the source file is not a PDF, it is expected to be an image and is converted
to a PDF with searchable text using tesseract
. If no target file is given,
the base name of the image is used for the PDF. In a second step, the text is
extracted with pdftotext
and the configured rules are applied.
The configuration file can specify the language(s) to use with tesseract
and
a set of rules to apply. After the first match, the associated command is
executed and processing is stopped. If no match was found the no-match
command is executed.
Here is an example:
{
"rules": [{
"matches": [{
"invoiceDate": "Invoive Date: ${DATE}"
}, {
"invoiceDate": "Ausstellungsdatum: ${DATE}"
}],
"command": "mv ${file} ${invoiceDate.format('YYYY-MM-DD')}\\ invoice.pdf"
}],
"no-match": "mv ${file} ${now.format('YYYY-MM-DD_HHmmss')}.pdf"
}
The configuration properties are:
lang
: The language(s) to pass to Tesseractrules
: An array of rules to run, where each rule is an object with thesematch
: A single match object or an array of match objects, passed tocommand
: The command to execute, after substituting any JavaScriptno-match
: A default command to execute if no matching rule was foundThe command
can contain variables in the form ${...}
where ...
is a
JavaScript expression with access to the matched properties. After successful
substitution, the command is written to the console and executed usingchild_process.execSync(command)
.
These special properties can be accessed in commands:
file
: The PDF filenow
: The current date as a moment objectInstalling Tesseract with brew:
$ brew install tesseract --with-all-languages
Installing pdftotext
(or download from
http://www.foolabs.com/xpdf/download.html):
$ brew install Caskroom/cask/pdftotext
Installing this tool:
$ npm install pdfmatch -g
My working setup is a ~/Documents/Scans
folder containing only mypdfmatch.json
configuration. The commands in the rules move the matched files
one level up:
{
"lang": "deu+eng",
"rules": [{
"match": {
"company": "npm, Inc",
"invoiceDate": "${DATE}"
},
"command": "mv ${file} ../${invoiceDate.format('YYYY-MM-DD')}\\ npm.pdf"
}],
"no-match": "mv ${file} ../${now.format('YYYY-MM-DD_HHmmss')}.pdf"
}
If you’re following the above example setup, there is an AppleScript folder
action in ./scripts
which allows you to save or drop files in a special
folder and have pdfmatch
invoked automatically. Follow the instructions in
the header comments on how to use it.
This module exposes an API if require
d as a node module:
processText(pdf_file, config, callback)
: Extract text from a PDF file andprocessImage(image_file, pdf_file, config, callback)
: Converts an image toprocessText
with the result.MIT