essence

An automatic web page content extractor for Kotlin and Java.

Given an HTML document, essence automatically extracts the main text content (and much more).

Try out the demo - a simple webapp to demonstrate essence.

This library is inspired by node-unfluff and its lineage

Usage

Java

import io.github.cdimascio.essence.Essence;
EssenceResult data = Essence.extract(html);
System.out.println(data.getText());

Kotlin

val data = Essence.extract(html)
println(data.text)

See Extracted data elements for additional extracted metadata.

Maven

<dependency>
  <groupId>io.github.cdimascio</groupId>
  <artifactId>essence</artifactId>
  <version>0.13.0</version>
  <type>pom</type>
</dependency>

Gradle

compile 'io.github.cdimascio:essence:0.13.0'

Essence web is a simple web page that fetches content at a given url and passes the HTML to this essence library.

The essence web project lives here

essence attempts to extract the following content:

title - The document’s title
softTitle - A version of title with less truncation
date - The document’s publication date
copyright - The document’s copyright line, if present
author - The document’s author
publisher - The document’s publisher (website name)
text - The main text of the document with all the junk thrown away
image - The main image for the document (what’s used by facebook, etc.)
(coming soon…)videos - An array of videos that were embedded in the article. Each video has src, width and height.
tags- Any tags or keywords that could be found by checking tags or by looking at href urls.
canonicalLink - The canonical url of the document, if given.
lang - The language of the document, either detected or supplied by you.
description - The description of the document, from tags
favicon - The url of the document’s favicon.
links - An array of links embedded within the article text. (text and href for each)

Thanks goes to these wonderful people (emoji key):

This project follows the all-contributors specification. Contributions of any kind welcome!