时空索引-sql-crawl-PROSAGA-码农传奇

项目作者： cannadayr

项目描述：
A minimal web spider and indexer

高级语言： Shell

项目主页：

项目地址: git://github.com/cannadayr/sql-crawl.git

创建时间： 2018-11-25T03:13:41Z
项目社区：https://github.com/cannadayr/sql-crawl
开源协议：MIT License
下载

Overview

A minimal web spider and indexer

USE AT YOUR OWN RISK - possibly unsafe

Currently implemented:

robots.txt disallow rules (untested)
pagerank (needs tuning)

Dependencies

lynx
sqlite3
libsqlite3-dev
sqlitepipe

Setup

fill out seed_data.sql (see sample_seed_data.sql)
compile sqlitepipe extension
```
cd sqlitepipe/ && make
```

initialize schema and seed_data

sqlite3 pages.db < schema.sql && sqlite3 pages.db < seed_data.sql

initialize robots.txt for a domain (no trailing backslash!)
```
./robo-parse.sh https://example.com | sqlite3 pages.db
```

Usage

./wrapper.sh

Dependencies

libsqlite3-dev

TODO

more thorough robots.txt testing
- add in ‘allow’ logic
respect 429 ratelimit responses
respect ‘crawl-delay’ rules
consolidate whitelist query in wrapper.sh w/ CTE in crawl.sql
add full text search
add bayesian spam filtering
tuning of PageRank’s ‘alpha’ parameter & iteration count

PageRank Attribution

Taken from the Stack Overflow network

Original question

Answer provided by: Geng Liang

Attribution details


