项目作者: SbstnErhrdt

项目描述 :
Simple node server to extract relevant content from website source code using Mozilla's Readability.js
高级语言: JavaScript
项目地址: git://github.com/SbstnErhrdt/node-readability.git
创建时间: 2021-01-03T20:06:20Z
项目社区:https://github.com/SbstnErhrdt/node-readability

开源协议:Apache License 2.0

下载


Readability Service

This is a small node server for processing html content
with the Readability library of Firefox developed by Mozilla.

See: https://github.com/mozilla/readability/

The goal of this project is to provide an endpoint to use the Readability library
to extract the most relevant content of a rendered website.

Docker

Simply run the docker container

  1. docker run -p8080:8080 ese7en/node-readability

Request

The request object must contain the following:

  • data: the html source code as escaped string
  1. HTTP PUT /
  2. HTTP HEADER: Content-Type: application/json
  3. {
  4. "data": "...HTML SROUCE CODE AS STRING ..."
  5. }

Response

This response object will contain the following properties:

  • title: article title
  • content: HTML string of processed article content
  • textContent: text content of the article (all HTML removed)
  • length: length of an article, in characters
  • excerpt: article description, or short excerpt from the content
  • byline: author metadata
  • dir: content direction

Environment Variables

  • PORT: sets the port on which the server is running

End2End example

Website

  1. <html>
  2. <head>
  3. <title>Hello World</title>
  4. </head>
  5. <body>
  6. <h1>This is a website</h1>
  7. <p>With some text</p>
  8. </body>
  9. </html>

HTTP PUT Request to http://localhost:8080

  1. {
  2. "data": "<html>\r\n <head>\r\n <title>Hello World<\/title>\r\n <\/head>\r\n <body>\r\n <h1>This is a website<\/h1>\r\n <p>With some text<\/p>\r\n <\/body>\r\n<\/html>"
  3. }

with curl

  1. curl --request POST \
  2. --url http://localhost:8080/ \
  3. --header 'Content-Type: application/json' \
  4. --data '{
  5. "data": "<html>\r\n <head>\r\n <title>Hello World<\/title>\r\n <\/head>\r\n <body>\r\n <h1>This is a website<\/h1>\r\n <p>With some text<\/p>\r\n <\/body>\r\n<\/html>"
  6. }'

HTTP Response

  1. {
  2. "title": "Hello World",
  3. "byline": null,
  4. "dir": null,
  5. "content": "<div id=\"readability-page-1\" class=\"page\">\n <h2>This is a website</h2>\n <p>With some text</p>\n \n</div>",
  6. "textContent": "\n This is a website\n With some text\n \n",
  7. "length": 55,
  8. "excerpt": "With some text",
  9. "siteName": null
  10. }