Minimalistic example of crawling any kind of web page, including a dynamic one page app
Minimalistic example of crawling any kind of web page, including a dynamic one page app
You’ll need to have Python 3 installed on your system. Be sure to check “Add Python environment variables” if asked during installation.
Now either download this project or clone with git:
git clone https://github.com/gittyeric/one-page-app-web-crawler.git
Download the Selenium web driver for chrome (you can use Firefox too but you’ll have to change the code a bit). This will allow you to control your browser with Python code.
Drop the downloaded web driver file into the root of this project.
From command line, install the selenium library for Python:
pip install selenium
Now simply run the python script from command line:
cd path/to/project
python clark_crawl.py
This is a simple example written for the Clark County PD, it’ll run through an inmate database, pull out all the records page-by-page and print them out at the end. You can change the print statement in clark_crawl.py to save the result however you’d like, and you can follow the code in crawler.py to see how it’s done.