项目作者: xRxGx

项目描述 :
Python web-scraper for online novels
高级语言: Python
项目地址: git://github.com/xRxGx/novels.git
创建时间: 2017-10-15T14:42:30Z
项目社区:https://github.com/xRxGx/novels

开源协议:MIT License

下载


novels

This repo is to scrape certain websites for their fictions and compile them into an epub format using selenium for scraping and calibre for epub packing

Requirements

  1. Selenium

    1. sudo -H pip install selenium
  2. Gecko Driver
    Firefox

    1. wget https://github.com/mozilla/geckodriver/releases/download/v0.19.0/geckodriver-v0.19.0-linux64.tar.gz
    2. tar -xzvf geckodriver-v0.19.0-linux64.tar.gz
    3. rm -rf geckodriver-v0.19.0-linux64.tar.gz
    4. sudo ln -sf geckodriver /usr/bin/
  3. PhantomJS Driver

    1. sudo apt-get update
    2. sudo apt-get install build-essential chrpath libssl-dev libxft-dev
    3. sudo apt-get install libfreetype6 libfreetype6-dev
    4. sudo apt-get install libfontconfig1 libfontconfig1-dev
    5. export PHANTOM_JS="phantomjs-2.5.0-beta-linux-ubuntu-xenial-x86_64"
    6. wget https://bitbucket.org/ariya/phantomjs/downloads/$PHANTOM_JS.tar.gz
    7. sudo tar xvjf $PHANTOM_JS.tar.gz
    8. rm -rf $PHANTOM_JS.tar.gz
    9. sudo mv $PHANTOM_JS /usr/local/share
    10. sudo ln -sf /usr/local/share/$PHANTOM_JS/bin/phantomjs /usr/local/bin
  4. Calibre

    1. sudo -v && wget -nv -O- https://download.calibre-ebook.com/linux-installer.py | sudo python -c "import sys; main=lambda:sys.stderr.write('Download failed\n'); exec(sys.stdin.read()); main()"

Requirements for Spark

  1. Java
  1. sudo apt-get update
  2. sudo apt-get install default-jdk
  1. Spark
  1. sudo -H pip install pyspark

How to run

Regular Version

  1. ./driver.py urlOfFirstChapter fictionName

Spark Version

  1. spark-submit spark_driver.py urlOfLatestChapter fictionName

Spark vs Regular

Spark shows a 40% decrease in time over the regular version for 50 chapter test and a 60% decrease over 1300 chapters on a 4 core computer. Note that partitions should be increased/decreased to better optimize for the number of cores one has.

Desired Features

  1. Splitting epubs into books based on urls

  2. Handling chapter titles

  3. Take out calibre and spark

Supported Sites

  • www.wuxiaworld.com
  • www.royalroadl.com
  • www.gravitytales.com
  • www.lightnovelbastion.com

Planned Sites

  • www.bluesilvertranslations.com