项目作者: lucab85

项目描述 :
Export text in PDF files to CSV using pattern matching.
高级语言: Java
项目地址: git://github.com/lucab85/PDF-processor.git
创建时间: 2017-07-06T08:34:05Z
项目社区:https://github.com/lucab85/PDF-processor

开源协议:MIT License

下载


PDF processor

Java desktop application to extract a sting matching a pattern from a
PDF file.
Every PDF input files is transliterated to text (Apache PDFBox) and
then, using pattern matching, you are able to search anything you want.
The output is a CSV file (Apache Commons CSV) with patterns in columns
and data of the file in rows.

Usage

  1. Setup the required library
  2. Prepare the .property file
  3. Launch the application

Description of property file fields

  • debug=false - [true/false] enable/disable debug print messages
  • rotation_degree=0 - [0-360] rotate the PDF input file of the specified degrees before transliterate it
  • TXT_enabled=false - [true/false] enable/disable TXT transliterated text file creation (same filename of source)
  • TXT_encoding=UTF-8 - encoding of TXT file
  • TXT_append=false - [true/false] overwrite (default) or append
  • ETL_from=\r\n|\r|\n - [regex] transform the selected text with pattern
  • ETL_to=\ - [text] transform the selected text to text
  • filename_entry=filename - CSV column with filename
  • CSV_filename=output.csv - output filename
  • patterns_prefix=pattern. - prefix of the following patterns
  • pattern.1=[text[A-z]*, text2[A-z]*] - list of regex to match in order for column “1”
  • copyPDF=true - [true/false] enable/disable copyPDF feature
  • PDFformat=[1, 2] - [list] copyPDF: filename format (use field 1 and 2)
  • copyPDFsep=\ - [text] copyPDF: filename fiels separator (default space)
  • copyPDFETL_from=/ - [regex] copyPDF: replace source regex (default dash not allowed in filename)
  • copyPDFETL_to=. - [text] copyPDF: replace destination string (default dot)

Dependency

Main library components:

Complete list of “lib/*.jar“:

  1. libs/commons-csv-1.4.jar
  2. libs/commons-io-2.5.jar
  3. libs/commons-logging-1.2.jar
  4. libs/fontbox-2.0.6.jar
  5. libs/pdfbox-2.0.6.jar
  6. libs/pdfbox-tools-2.0.6.jar