Automatically extract the main text content (and more) from an HTML document
An automatic web page content extractor for Kotlin and Java.
Given an HTML document, essence automatically extracts the main text content (and much more).
Try out the demo - a simple webapp to demonstrate essence.
This library is inspired by node-unfluff and its lineage
Java
import io.github.cdimascio.essence.Essence;
EssenceResult data = Essence.extract(html);
System.out.println(data.getText());
Kotlin
val data = Essence.extract(html)
println(data.text)
See Extracted data elements for additional extracted metadata.
Maven
<dependency>
<groupId>io.github.cdimascio</groupId>
<artifactId>essence</artifactId>
<version>0.13.0</version>
<type>pom</type>
</dependency>
Gradle
compile 'io.github.cdimascio:essence:0.13.0'
Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.
The essence web project lives here
essence attempts to extract the following content:
title
- The document’s titlesoftTitle
- A version of title
with less truncationdate
- The document’s publication datecopyright
- The document’s copyright line, if presentauthor
- The document’s authorpublisher
- The document’s publisher (website name)text
- The main text of the document with all the junk thrown awayimage
- The main image for the document (what’s used by facebook, etc.)videos
- An array of videos that were embedded in the article. Each video has src, width and height.tags
- Any tags or keywords that could be found by checking canonicalLink
- The canonical url of the document, if given.lang
- The language of the document, either detected or supplied by you.description
- The description of the document, from tagsfavicon
- The url of the document’s favicon.links
- An array of links embedded within the article text. (text and href for each)Thanks goes to these wonderful people (emoji key):
Clément P. 💻 |
This project follows the all-contributors specification. Contributions of any kind welcome!