项目作者: aledbetter

项目描述 :
Yet Another Document 2 Text for pdf/doc/html/rft/etc - Extract text - or - convert to simplified HTML to retain layout information
高级语言: Java
项目地址: git://github.com/aledbetter/yadoc2text.git
创建时间: 2018-03-12T14:03:37Z
项目社区:https://github.com/aledbetter/yadoc2text

开源协议:BSD 2-Clause "Simplified" License

下载


Yet Another Document 2 Text

Extract Text or simplified HTML

  1. This utility is extracts text or text and some structural information from documents
  2. so that the information can be processed. The general use case is for NLP / NLU where the
  3. document structural information are needed to add semantic context to the content.
  4. The text output is the same, just without the html tags.
  5. OCR is not supported, this project does not (currently) work with images

build and run locally

  1. 1. go to base directory of branch, this will build the package with everything in it
  2. 2. prompt# brew install maven
  3. 3. mvn clean
  4. 4. mvn install
  5. 5. cd web
  6. 6. mvn jetty:run
  7. mvn jetty:run -Djetty.port=8099
  8. old: mvn jetty:run -Dhttp.port=8099
  9. old: mvn jetty:run -Djetty.http.port=8099
  10. 7. index is test page for conversion (could use some additions)

Supported Document Types

  1. Word: .doc, .docx, .dot
  2. PDF: .pdf
  3. html: .html, .htm, .mht
  4. text: .text, .txt
  5. richtext: .rtf

Converted file html tags

  1. Title: <title>
  2. Headings: <h1>, <h2>, <h3>...<hn>
  3. Text: <b>, <u>, <i>
  4. Structure: <p>, <header>, <footer>
  5. Lists: <ol>, <ul>, <li>
  6. Sections: <section>, <article> TODO
  7. Tables: TBD

Converted meta info

  1. Document Type:
  2. <meta name="doc-type" content="html">
  3. Original Document Name:
  4. <meta name="doc-name" content="test.html">
  5. Created Time
  6. <meta name="doc-created" content="xxxxx">
  7. Modified Time
  8. <meta name="doc-modified" content="xxx">
  9. Author
  10. <meta name="doc-author" content="Bober Simthsonsons">
  11. Language
  12. <meta name="doc-language" content="en">
  13. Url
  14. <meta name="doc-url" content="http://www.sample.com/moby-dick.html">