项目作者: louisguitton

项目描述 :
Crawl DISQUS comments from a blog into a local MongoDB database
高级语言: Python
项目地址: git://github.com/louisguitton/disqus-crawler.git
创建时间: 2016-01-29T17:40:49Z
项目社区:https://github.com/louisguitton/disqus-crawler

开源协议:MIT License

下载


disqus-crawler

Crawl DISQUS comments from a blog into a local MongoDB database

Installation

  • Clone the github repository and cd into it
  1. git clone git@github.com:louisguitton/disqus-crawler.git
  2. cd disqus-crawler
  3. python3 -m venv venv
  4. source venv/bin/activate
  5. pip install --upgrade -r requirements.txt

Usage example

  • Open main.sh and change the url to the blog page you want to crawl
  • Make sure a mongod instance is running on your computer (Installation instructions for MongoDB are here)
  1. mongod --config /usr/local/etc/mongod.conf
  1. $ docker run -p 8050:8050 scrapinghub/splash
  2. 2019-10-10 12:03:39.116598 [-] Server listening on http://0.0.0.0:8050
  • Run the main.sh script
  1. $ sh main.sh
  2. CRAWLING ... http://www.purseblog.com/louis-vuitton/louis-vuitton-spring-2016-bag-ad-campaign/
  3. 2019-10-10 14:07:28 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: purseblog)
  4. ...
  • Usage
  1. mongo
  1. use disqus
  2. db.comments.count()
  3. db.comments.find().pretty().limit(2)

Meta

I wrote this project for my master thesis in 2016 on Paid/Owned/Earned Media, and measuring brands on social channels and blogs.

For the crawling, this project uses scrapy.
It stores the comments in a MongoDB database, using the pymongo client.
A good tutorial to follow is this one.

When scrapping the web, 2 kinds of problems arise:

  • the target page is too slow to render because it uses a lot of javascript
  • the target page renders everything really fast but what you were interested in was something that disappears when the page is rendered

To overcome these situations, one can deploy a tiny web-browser on a local machine
that will render the pages at his will.
This project uses Splash, on a local Docker container.
A good tutorial to follow is this one.