项目作者: abel0b

项目描述 :
Crawler and web search engine
高级语言: Rust
项目地址: git://github.com/abel0b/gotoint.git
创建时间: 2020-02-24T17:02:40Z
项目社区:https://github.com/abel0b/gotoint

开源协议:

下载


gotoint

Web search engine.

Tasks

  • Crawler
    • HTML parser
    • Content extraction
    • Database
    • Visited pages bloom filter
    • Multithreading
    • Message queue
    • Priority queue
    • Politeness
    • Re-crawling
    • Handling crawling traps, too long urls
    • Distributed
    • Language detection
    • Duplicate detection
    • DNS cache
  • Index
  • Query
    • Webapp
  • Project name

Check out

Deploy for development

Crawl pages.

  1. docker-compose -f deploy/crawler.dev.yml up

Build inverted index.

  1. docker-compose -f deploy/index.dev.yml up

Start web server.

  1. docker-compose -f deploy/dev.yml up

References

[1]
Web Crawling
http://infolab.stanford.edu/~olston/publications/crawling_survey.pdf

[2]
Introduction to Information Retrieval
https://nlp.stanford.edu/IR-book/