项目作者: cannadayr

项目描述 :
A minimal web spider and indexer
高级语言: Shell
项目地址: git://github.com/cannadayr/sql-crawl.git
创建时间: 2018-11-25T03:13:41Z
项目社区:https://github.com/cannadayr/sql-crawl

开源协议:MIT License

下载


Overview

A minimal web spider and indexer

USE AT YOUR OWN RISK - possibly unsafe

Currently implemented:

  • robots.txt disallow rules (untested)
  • pagerank (needs tuning)

Dependencies

  • lynx
  • sqlite3
  • libsqlite3-dev
  • sqlitepipe

Setup

  • fill out seed_data.sql (see sample_seed_data.sql)
  • compile sqlitepipe extension
    1. cd sqlitepipe/ && make
  • initialize schema and seed_data
    1. sqlite3 pages.db < schema.sql && sqlite3 pages.db < seed_data.sql
  • initialize robots.txt for a domain (no trailing backslash!)
    1. ./robo-parse.sh https://example.com | sqlite3 pages.db

Usage

  1. ./wrapper.sh

Dependencies

  • libsqlite3-dev

TODO

  • more thorough robots.txt testing
    • add in ‘allow’ logic
  • respect 429 ratelimit responses
  • respect ‘crawl-delay’ rules
  • consolidate whitelist query in wrapper.sh w/ CTE in crawl.sql
  • add full text search
  • add bayesian spam filtering
  • tuning of PageRank’s ‘alpha’ parameter & iteration count

PageRank Attribution

Taken from the Stack Overflow network

Original question

Answer provided by: Geng Liang

Attribution details