项目作者: alexgrigoras

项目描述 :
Web Crawler developed in Python
高级语言: Python
项目地址: git://github.com/alexgrigoras/web_crawler.git
创建时间: 2020-02-27T07:42:34Z
项目社区:https://github.com/alexgrigoras/web_crawler

开源协议:MIT License

下载


Map-Reduce Web Crawler

Description

Map Reduce Web Crawler application for the course Big Data Techniques

Architecture

The main components of the application are:

  1. Information Gathering - web crawler

    1. Web Crawler (crawler.py) -> inputs a queue and crawls the websites by storing the html resource and parsing all the links found in the pages
    2. Robot Parser (robot_parser.py) -> checks if the robot protocol allows to crawl that page
  2. MapReduce - parallel application using MPI

    1. Master -> sends the links to the workers to be processed in two phase: map and reduce
    2. Worker -> process the links and store the data to the file system

Application structure

  1. map-reduce-crawler
  2. ├── application
  3. | ├── files
  4. | ├── output
  5. | ├── modules
  6. | | ├── __init__.py
  7. | | ├── crawler.py
  8. | | ├── map_reduce.py
  9. | | ├── master_worker.py
  10. | | └── robot_parser.py
  11. | ├── __init__.py
  12. | └── __main__.py
  13. ├── README.md
  14. ├── requirements.txt
  15. └── setup.py

Execution

It is done in two phases:

  1. Cloning from the git: git clone https://github.com/grigoras.alexandru/web-crawler.git
  2. Selecting the application folder: cd web-crawler/
  3. Creating virtual environment: virtualenv ENVIRONMENT_NAME
  4. Selecting virtual environment: source ENVIRONMENT_NAME/bin/activate
  5. Installing: python setup.py install
  6. Running:
    1. Crawler + MapReduce: python -m application
    2. (Optional) MapReduce: mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py

License

The application is licensed under the MIT License.