项目作者: annacprice

项目描述 :
PDF parser using pdfminer and pytesseract for OCR support
高级语言: Python
项目地址: git://github.com/annacprice/pdf-scraper.git
创建时间: 2019-05-22T13:30:19Z
项目社区:https://github.com/annacprice/pdf-scraper

开源协议:

下载


PDFscraper

PDFscraper uses PDFMiner and Python Tesseract to text mine pdfs.

Requirements

PDFscraper requires python 3.x

The following python packages are prerequisites:

  • pdfminer.six
  • pytesseract
  • chardet
  • Python Imaging Library (PIL) or Pillow
  • pdf2image

Other requirements:
Install of Google Tesseract OCR and Poppler

Usage

  1. usage: pdfscraper.py [-h] -i INPDF -o OUTTXT [-t]
  2. optional arguments:
  3. -h, --help show this help message and exit
  4. -i INPDF, --input-dir INPDF
  5. Path to the input pdf files
  6. -o OUTTXT, --output-dir OUTTXT
  7. Path for the output txt files
  8. -t, --token-gen Use flag to generate tokenized output

E.g. To run

  1. python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory

PDFscraper also has an optional flag -t, which produces tokenized text for use in Natural Language Processing (NLP) tasks. E.g. to produce tokenized output:

  1. python pdfscraper.py -i /path/to/input/pdfs -o /path/to/output/directory -t

Docker

Alternatively, the accompanying Dockerfile can be used to run the program in a docker container.

E.g. To run

  1. docker run -v "/path/to/input/pdfs:/data" --rm pdfscraper:latest -i /data -o /data