项目作者: oAGoulart

项目描述 :
A small Python package to extract content from web pages.
高级语言: Python
项目地址: git://github.com/oAGoulart/markout.git
创建时间: 2019-11-30T06:35:55Z
项目社区:https://github.com/oAGoulart/markout

开源协议:MIT License

下载


Markout

License
PyPI - Downloads
PyPI - Status
Python package

A small Python package I made to extract HTML content from web pages. It is very customizable and I made it to fit my needs (extract multiple pages’ code to Markdown, but only some HTML tags which I needed). Due to its purpose being able to convert specific HTML tags into a desired Markdown format this script does not generate any standard output, rather, it uses custom tokens specified in a configuration file, so the output can be formatted into any anything.

Usage

Importing into your code

To use this package you’ll need to install it using pip:

  1. pip install markout-html

Then just import it into your code:

  1. from markout_html import *

After that you can use the extract_url and extract_html functions:

  1. result = extract_url(
  2. # HTML page link
  3. 'http://example.page.com/blog/some_post.html',
  4. # Tokens to format each HTML tags contents (you can extract only the ones you want)
  5. {
  6. 'p': "\n** {} **"
  7. },
  8. # Only extract contents inside this tag
  9. 'article'
  10. )
  11. result = extract_html(
  12. # HTML code string
  13. '<html>some html code</html>',
  14. # Tokens to format each HTML tags contents (you can extract only the ones you want)
  15. {
  16. 'p': "\n** {} **"
  17. },
  18. # Only extract contents inside this tag
  19. 'article'
  20. )

Using the CLI command

Below are a few examples with better description on how to use this package command if you don’t want to create a Python script!

If you just want to extract using a string in the terminal, you can use markout_html --extract [string].

You can use the command markout_html with the flag --help for more info.

Configuration

All configurations can be found into a single file: .markoutrc.json (you can specify another name in the terminal with the flag --config), if you don’t load a configuration file the script will use its default values. There is an example of configuration in the repository root!

To specify a different configuration file use:

  1. markout_html --config [filename]

The configuration file values

links - object of links to be extracted, each link has a destination value (output file).
Example:

  1. {
  2. "links": {
  3. "http://example.page.com/blog/some_post.html": "out/post.md",
  4. "http://example.page.com/blog/some_other_post.html": "out/other_post.md"
  5. }
  6. }

The example above will get the HTML from http://example.page.com/blog/some_post.html and extract the results into out/post.md.

only_on - string that specify where (which HTML tag) to extract the contents from (e.g. : html, body, main).
Example:

  1. {
  2. "only_on": "article"
  3. }

tokens - object in which each specified HTML tag will be extract into a formatted string and then placed on the output file.
Example:

  1. {
  2. "tokens": {
  3. "header": "# {}",
  4. "h1": "\n# {}",
  5. "h2": "\n# {}",
  6. "b": "\n## {}",
  7. "li": "+ {}",
  8. "i": "** {} **",
  9. "p": "\n{}",
  10. "span": "{}"
  11. }
  12. }

On the example above, the contents of the HTML tag <header> will be extract into the # {} string, so for example, if we had <header>Some text here!</header> the result would’ve been # Some text here! (this formats the text into Markdown).

Contributions

Feel free to leave your contribution here, I would really appreciate it!
Also, if you have any doubts or troubles using this package just contact me or leave an issue.