项目作者: vladserkoff

项目描述 :
Load htmls from Common Crawl
高级语言: Python
项目地址: git://github.com/vladserkoff/common-crawler.git
创建时间: 2019-07-02T15:49:54Z
项目社区:https://github.com/vladserkoff/common-crawler

开源协议:

下载


Common Crawler

An app that lets you find and download web pages contents from common crawl.

Instalation

pip install git+https://github.com/vladserkoff/common-crawler.git

Usage

(Optional) Deploy Common Crawl index server

Best practice is to deploy your own index server as to not overuse the server hosted by Common Crawl.

  1. # deploy local common crawl index
  2. git clone https://github.com/commoncrawl/cc-index-server.git
  3. cd cc-index-server
  4. # edit install-collections.sh to only include recent indexes, otherwise it will load gigabytes of data.
  5. docker build -t cc-index-server .
  6. docker run -d -p 8080:8080 cc-index-server

Find available urls for a domain, then load an html with additional metadata

  1. In [1]: from common_crawler import CommonCrawler
  2. In [2]: cc = CommonCrawler('http://localhost:8080') # or leave it blank to use Common Crawl's server.
  3. In [3]: urls = cc.find_domain_urls('http://example.com')
  4. In [4]: len(urls)
  5. Out[4]: 2958
  6. In [5]: dat = cc.load_page_data(urls[0])
  7. In [6]: dat.keys()
  8. Out[6]: dict_keys(['filename', 'length', 'offset', 'status', 'timestamp', 'index', 'warc_header', 'http_header', 'html'])