Python web-scraper for online novels
This repo is to scrape certain websites for their fictions and compile them into an epub format using selenium for scraping and calibre for epub packing
Selenium
sudo -H pip install selenium
Gecko Driver
Firefox
wget https://github.com/mozilla/geckodriver/releases/download/v0.19.0/geckodriver-v0.19.0-linux64.tar.gz
tar -xzvf geckodriver-v0.19.0-linux64.tar.gz
rm -rf geckodriver-v0.19.0-linux64.tar.gz
sudo ln -sf geckodriver /usr/bin/
PhantomJS Driver
sudo apt-get update
sudo apt-get install build-essential chrpath libssl-dev libxft-dev
sudo apt-get install libfreetype6 libfreetype6-dev
sudo apt-get install libfontconfig1 libfontconfig1-dev
export PHANTOM_JS="phantomjs-2.5.0-beta-linux-ubuntu-xenial-x86_64"
wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.gz
sudo tar xvjf $PHANTOM_JS.tar.gz
rm -rf $PHANTOM_JS.tar.gz
sudo mv $PHANTOM_JS /usr/local/share
sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
Calibre
sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"
sudo apt-get update
sudo apt-get install default-jdk
sudo -H pip install pyspark
Regular Version
./driver.py urlOfFirstChapter fictionName
Spark Version
spark-submit spark_driver.py urlOfLatestChapter fictionName
Spark shows a 40% decrease in time over the regular version for 50 chapter test and a 60% decrease over 1300 chapters on a 4 core computer. Note that partitions should be increased/decreased to better optimize for the number of cores one has.
Splitting epubs into books based on urls
Handling chapter titles
Take out calibre and spark