Map-Reduce Web Crawler

Description

Map Reduce Web Crawler application for the course Big Data Techniques

Architecture

The main components of the application are:

Information Gathering - web crawler
1. Web Crawler (crawler.py) -> inputs a queue and crawls the websites by storing the html resource and parsing all the links found in the pages
2. Robot Parser (robot_parser.py) -> checks if the robot protocol allows to crawl that page
MapReduce - parallel application using MPI
1. Master -> sends the links to the workers to be processed in two phase: map and reduce
2. Worker -> process the links and store the data to the file system

Application structure

map-reduce-crawler
├── application
|   ├── files
|   ├── output
|   ├── modules
|   |   ├── __init__.py
|   |   ├── crawler.py
|   |   ├── map_reduce.py
|   |   ├── master_worker.py
|   |   └── robot_parser.py
|   ├── __init__.py
|   └── __main__.py
├── README.md
├── requirements.txt
└── setup.py

Execution

It is done in two phases:

Cloning from the git: git clone https://github.com/grigoras.alexandru/web-crawler.git
Selecting the application folder: cd web-crawler/
Creating virtual environment: virtualenv ENVIRONMENT_NAME
Selecting virtual environment: source ENVIRONMENT_NAME/bin/activate
Installing: python setup.py install
Running:
1. Crawler + MapReduce: python -m application
2. (Optional) MapReduce: mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py

License

The application is licensed under the MIT License.