项目作者: sidmishraw

项目描述 :
PDF parsing and extraction utility using Apache Tika
高级语言: Java
项目地址: git://github.com/sidmishraw/autobot.git
创建时间: 2017-08-25T08:30:02Z
项目社区:https://github.com/sidmishraw/autobot

开源协议:

下载


autobot - PDF parsing and extraction utility using Apache Tika

Autobot parses the PDF files using Apache Tika and extracts the title, authorString and contents of the IEEE Xplore PDFs.

Please download the utility jar from the link below:
https://github.com/sidmishraw/autobot/blob/master/build/libs/autobot-1.0.0.jar

Description:

It requires 2 inputs:

1> Absolute file-path of a file named “conf.txt”

  1. This file will have the list of all file-paths of the input PDF documents on each line

For eg:

  1. path-to-pdfs\04403110.pdf
  2. path-to-pdfs\04403128.pdf
  3. path-to-pdfs\04403127.pdf

2> Absolute file-path of the output directory.

Usage:

java -jar autobot-1.0.0.jar “path-to-conf.txt” “path-to-output-directory”.

For eg:

  1. java -jar autobot-1.0.0.jar "/Users/sidmishraw/Downloads/conf.txt" "/Users/sidmishraw/Downloads/outpdfs"

Caveats:

• It cannot get the exact author names, but I’ve made it to extract and group together the author name area string together and it is named “authorString”.

  1. {
  2. "title": "Incompleteness Errors in Ontology",
  3. "authorString": [
  4. "1 Muhammad Abdul Qadir, 2Muhammad Fahad, 3Syed Adnan Hussain Shah Muhammad Ali Jinnah University, Islamabad, Pakistan",
  5. "1aqadir@jinnah.edu.pk, 2mhd.fahad@gmail.com, 3syedadnan@gmail.com"
  6. ],
  7. "content": "Abstract\nOntology ev…"
  8. }

As you can see from the example, if there are numbered bullets in-front of the name’s etc, it is still difficult to remove them.

Some, PDF documents turn out good:

  1. {
  2. "title": "Privacy Preserving Collaborative Filtering using Data Obfuscation",
  3. "authorString": [
  4. "Rupa Parameswaran Georgia Institute of Technology",
  5. "School of Electrical and Computer Engineering Atlanta, GA",
  6. "rupa@ece.gatech.edu",
  7. "Douglas M Blough Georgia Institute of Technology",
  8. "School of Electrical and Computer Engineering Atlanta, GA",
  9. "doug.blough@ece.gatech.edu"
  10. ],
  11. "content": "Abstract\n…"
  12. }