Web Crawler developed in Python
Map Reduce Web Crawler application for the course Big Data Techniques
The main components of the application are:
Information Gathering - web crawler
MapReduce - parallel application using MPI
map-reduce-crawler
├── application
| ├── files
| ├── output
| ├── modules
| | ├── __init__.py
| | ├── crawler.py
| | ├── map_reduce.py
| | ├── master_worker.py
| | └── robot_parser.py
| ├── __init__.py
| └── __main__.py
├── README.md
├── requirements.txt
└── setup.py
It is done in two phases:
git clone https://github.com/grigoras.alexandru/web-crawler.git
cd web-crawler/
virtualenv ENVIRONMENT_NAME
source ENVIRONMENT_NAME/bin/activate
python setup.py install
python -m application
mpiexec -np NUMBER_OF_PROCESSES python application/modules/master_worker.py
The application is licensed under the MIT License.