pttCrawler

In this project we try to collect data from the ptt website. We adopt scrapy framework based on python language and use mongoDB as our storage. However, crawler handles it job only on single machine. To explore efficently, scrapy-redis provides distributed mechanism that helps us running spider on clients. For the purpose of deployment, we use scrapyd to achieve it.

pttCrawler

Dependencies

Full dependency installation on Ubuntu 16.04

Python 3 (tested on python 3.7.2)
redis 3.4.1
mongodb 4.0.16

Requirements

pymongo==3.10.1 (used nosql db)
Scrapy==2.0.0 (framework of crawler)
scrapy-redis==0.6.8 (achieve distributed scrawling)
scrapyd==1.2.1 (provide a crawling daemon )
scrapyd-client==1.1.0 (used to deploy our spider)
scrapyd-web==1.4.0* (show the UI for the crawler)

Setup

mongodb settings

In settings.py, we should define the mongodb settings:

## in settings.py
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'ptt-sandbox'

redis settings

## in settings.py
EDIS_HOST = 'localhost'
REDIS_PARAMS = {
    'password':'yourpassword'
}
REDIS_PORT = 6379

(Optional) filter duplicates

DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'

Usage

Running spider by following command:

scrapy crawl ptt -a start={m/d} -a end={m/d}

where -a received an argument that is a parameter to the spider.
{m/d} means month/day. 3/5 just represents March 5th.
For example, the command would be scrapy crawl ptt -a start=3/5 -a end=3/8

Start the redis server and get in terminal

redis-cli

Before crawling, we need to obtain the authentication by specific keyword

auth yourpassword

where yourpassword is in settings.py and it can be modified directly.

Push url to redis and running Crawler

lpush ptt:start_urls https://www.ptt.cc/{board}/index.html

where {board} can be described Soft_Job, Gossiping or etc.

SnapShot

Result in db

post info

Workflow in the local

interaction with redis using `redis-cli`

terminal1

Run the crawler by `scrapy crawl ptt -a start={date} -a end={date}`

terminal2

Collections

There are three collections in mongoDB:

Post
Author
Comment

Post

schema	Description
*canonicalUrl	url where the page visited
authorId	who post the article
title	title in the article
content	content in the article
publishedTime	the date this post was created
updateTime	the date this post was updated
board	what post belong with in ptt

Author

schema	Description
*authorId	who post the article
authorName	the author’s nickname

Comment

schema	Description
commentId	who post the conmment
commentTime	when user posted
commentContent	the content in comment
board	what comment belong with in ptt

Note: where schema prefix * represents primary key.

Scrapy-Redis Framework

Distributed crawler

master-slaver architecture

the master runs spider by following command:
```
scrapy crawl pttCrawl
```
start redis service and run it :
```
redis-cil
```
the most important step is to push your url that you attempt to crawl. Here, we use lpush to attain this goal. The following redis key is pttCrawl:start_urls. We push urls to redis.
```
lpush pttCrawl:start_urls {ptt url}
```
(optimal) wake our slaver machines up which have a little bit different declaration in settings.py:
```
scrapy crawl pttCrawl
```

Benefits

filter duplicates

In settings.py, we just add a line that would prevent from repetitive redirection:

## in settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

scheduler persist

In settings.py, we just add a line that can keep tracking processes of the crawler. As requests in the redis queue just exist after crawling process stopped. It make convenient start to crawl again.

## in settings.py
# Enable scheduling storing requests queue in redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# Start from the last endpoint
SCHEDULER_PERSIST = True

Deploy with scrapyd

scrapyd provide a daemon for crawling. Like http server, we run it by typing the following command:
```
scrapyd
```
for the purpose of deployment, we install the package scrapy-client and run it:
```
scrapyd-deploy pttCrawler
```

Pipeline

DuplicatesPipeline

In case of duplicates in database, we filter the data here.
```python
def process_item(self, item, spider):

if isinstance(item, PostItem):
    logging.debug("filter duplicated post items.")
    if item['canonicalUrl'] in self.post_set:
        raise DropItem("Duplicate post found:%s" % item)
    self.post_set.add(item['canonicalUrl'])
elif isinstance(item, AuthorItem):
    logging.debug("filter duplicated author items.")
    if item['authorId'] in self.author_set:
        raise DropItem("Duplicate author found:%s" % item)
    self.author_set.add(item['authorId'])
return item

### MongoPipeline
> save data in mongodb. 
### JsonPipeline
> generate a json file.
## Security Methodology
To avoid getting banned, we adopt some tricks while we are crawling web pages.
1. **Download delays**
> We set the `DOWLOAD_DELAY` in `settings.py` to limit the dowmload behavior. 
```python
## in settings.py
DOWNLOAD_DELAY = 2

Distrbuted downloader

scrapy-redis has already helped us indeed.

User Agent Pool

Randomly choose one user-agent through middleware.

```python

in middlewares.py
class RandomUserAgentMiddleware(object):

def process_request(self, request, spider):

 agent = random.choice(list(UserAgentList))
 request.headers['User-Agent'] = agent

```python
## in settings.py
UserAgentList = [
 'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17'
]
DOWNLOADER_MIDDLEWARES = {
 'pttCrawler.middlewares.RandomUserAgentMiddleware': 543,
}

Note: we cannot disable cookies because we have to pass the ‘over18’ message to some ptt boards.

Web UI for scrapyd server

First we need to download the scrapydweb immediately.

pip install scrapydweb

Then running it by following command:

scrapydweb

We can get in from localhost:5000 and monitor our crawler.

monitor1

Also, we can track the crawler in here.
monitor2
monitor3

Deployment

Scrapyd

Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy your projects and control their spiders using a JSON web service.

scrapyd

scrapyd-terminal

Scrapydweb

A full-featured web UI for Scrapyd cluster management, with Scrapy log analysis & visualization supported.

docker-compose

container

Spider_app (scrapy-redis)
Redis
mongoDB

memo

Before deploy to docker, we need to modify a little parts in settings.py :

# local
# MONGO_URI = 'mongodb://localhost:27017'
# docker
MONGO_URI = 'mongodb://mongodb:27017'
# local
# REDIS_HOST = 'localhost'
# docker
REDIS_HOST = 'redis'

Since the docker seems the service defined at .yml as server host, we modify localhost here.

docker-terminal

Supplement

In the main spider script ptt.py, for the sake of convenience we restrict the date stuck in year 2020.

Also, we set maximum_missing_count as 500 where aims to control the bound of exploring articles. If there has been no page can be visited or got the limit of our missing count, we then stop crawling so that waste less resource.

class PTTspider(RedisSpider):
    configure_logging(install_root_handler=False) 
    logging.basicConfig ( 
        filename = 'logging.txt', 
        format = '%(levelname)s: %(message)s', 
        level = logging.INFO)
    name = 'ptt'
    redis_key = 'ptt:start_urls'
    board = None
    ## where are restrictions
    year = 2020
    maximum_missing_count = 500

Reference

scrapy api: https://scrapy.readthedocs.io/en/0.12/index.html
scrapy-redis api: https://scrapy-redis.readthedocs.io/en/v0.6.1/readme.html
jianshu personal note: https://www.jianshu.com/p/8a9176d11372
SCUTJcfeng ‘s github: https://github.com/SCUTJcfeng/Scrapy-redis-Projects
ptt website C_Chat board: https://www.ptt.cc/bbs/C_Chat/index.html
ripples’s markdown: http://www.q2zy.com/articles/2015/12/15/note-of-scrapy/
my8100 scrapywebUI: https://iter01.com/149794.html