项目作者: yuanxu-li

项目描述 :
extract data from html table
高级语言: Python
项目地址: git://github.com/yuanxu-li/html-table-extractor.git
创建时间: 2017-04-10T22:04:42Z
项目社区:https://github.com/yuanxu-li/html-table-extractor

开源协议:MIT License

下载


HTML Table Extractor

Build Status

HTML Table Extractor is a python library that uses Beautiful Soup to extract data from complicated and messy html table

Installation

  1. pip install 'beautifulsoup4==4.5.3'
  2. pip install html-table-extractor

Usage

Example 1 - Simple

12
34
  1. from html_table_extractor.extractor import Extractor
  2. table_doc = """
  3. <table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
  4. """
  5. extractor = Extractor(table_doc)
  6. extractor.parse()
  7. extractor.return_list()

It will print out:

  1. [[u'1', u'2'], [u'3', u'4']]

Example 2 - Transformer

12
34
  1. from html_table_extractor.extractor import Extractor
  2. table_doc = """
  3. <table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
  4. """
  5. extractor = Extractor(table_doc, transformer=int)
  6. extractor.parse()
  7. extractor.return_list()

It will print out:

  1. [[1, 2], [3, 4]]

Example 3 - Pass BS4 Tag

12
34
  1. from html_table_extractor.extractor import Extractor
  2. from bs4 import BeautifulSoup
  3. table_doc = """
  4. <html><table id='wanted'><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table><table id='unwanted'><tr><td>not wanted</td></tr></table></html>
  5. """
  6. soup = BeautifulSoup(table_doc, 'html.parser')
  7. extractor = Extractor(soup, id_='wanted')
  8. extractor.parse()
  9. extractor.return_list()

It will print out:

  1. [[u'1', u'2'], [u'3', u'4']]

Example 4 - Complex













1 2 3
4
5
  1. from html_table_extractor.extractor import Extractor
  2. table_doc = """
  3. <table>
  4. <tr>
  5. <td rowspan=2>1</td>
  6. <td>2</td>
  7. <td>3</td>
  8. </tr>
  9. <tr>
  10. <td colspan=2>4</td>
  11. </tr>
  12. <tr>
  13. <td colspan=3>5</td>
  14. </tr>
  15. </table>
  16. """
  17. extractor = Extractor(table_doc)
  18. extractor.parse()
  19. extractor.return_list()

It will print out:

  1. [[u'1', u'2', u'3'], [u'1', u'4', u'4'], [u'5', u'5', u'5']]

Example 5 - Conflicted













1 2 3
4
5
  1. from html_table_extractor.extractor import Extractor
  2. table_doc = """
  3. <table>
  4. <tr>
  5. <td rowspan=2>1</td>
  6. <td>2</td>
  7. <td rowspan=3>3</td>
  8. </tr>
  9. <tr>
  10. <td colspan=2>4</td>
  11. </tr>
  12. <tr>
  13. <td colspan=2>5</td>
  14. </tr>
  15. </table>
  16. """
  17. extractor = Extractor(table_doc)
  18. extractor.parse()
  19. extractor.return_list()

It will print out:

  1. [[u'1', u'2', u'3'], [u'1', u'4', u'3'], [u'5', u'5', u'3']]

Example 6 - Write to file

12
34
  1. from html_table_extractor.extractor import Extractor
  2. table_doc = """
  3. <table><tr><td>1</td><td>2</td></tr><tr><td>3</td><td>4</td></tr></table>
  4. """
  5. extractor = Extractor(table_doc).parse()
  6. extractor.write_to_csv(path='.')

It will write to a given path and create a new csv file called output.csv:

  1. 1,2
  2. 3,4

Team

Errors/ Bugs

If something is not working correctly, or if you have any suggestion on improvements, report it here

Copyright (c) 2017 Justin Li. Released under the MIT License

Third-party copyright in this distribution is noted where applicable.

Misc

How to upload the package to pypi (for the reference of the owner)

  • python setup.py bdist_wheel —universal
  • twine upload dist/* —verbose