项目作者: ghostdogpr

项目描述 :
Scala library to extract relevant content from an article HTML
高级语言: Scala
项目地址: git://github.com/ghostdogpr/readability4s.git
创建时间: 2017-10-06T06:47:47Z
项目社区:https://github.com/ghostdogpr/readability4s

开源协议:Apache License 2.0

下载


readability4s Build Status License

A Scala library to extract content from an article HTML: title, full text, favicon, image, etc.

This project is a scala port of Mozilla’s Readability.js with a few tweaks and improvements.
Scala version is 2.12.

Usage

Import the project with Maven as follows:

  1. <dependency>
  2. <groupId>com.github.ghostdogpr</groupId>
  3. <artifactId>readability4s</artifactId>
  4. <version>1.0.9</version>
  5. </dependency>

To parse a document, you must create a new Readability object from a URI string and an HTML string, and then call parse(). Here’s an example:

  1. val article = Readability(url, htmlString).parse()

It returns an Option[Article].
It is either None when the article could not be processed, or an Article with the following properties:

  • uri: original URI string that was passed to constructor
  • title: article title
  • byline: author metadata
  • content: HTML string of processed article content
  • textContent: text of processed article content
  • length: length of article, in characters
  • excerpt: article description, or short excerpt from content
  • faviconUrl: URL of the favicon image
  • imageUrl: URL of an image representing the article