ptt-crawler with scrapy-redis framework in python
In this project we try to collect data from the ptt website. We adopt scrapy framework based on python language and use mongoDB as our storage. However, crawler handles it job only on single machine. To explore efficently, scrapy-redis provides distributed mechanism that helps us running spider on clients. For the purpose of deployment, we use scrapyd to achieve it.
Full dependency installation on Ubuntu 16.04
In settings.py
, we should define the mongodb settings:
## in settings.py
MONGO_URI = 'mongodb://localhost:27017'
MONGO_DATABASE = 'ptt-sandbox'
## in settings.py
EDIS_HOST = 'localhost'
REDIS_PARAMS = {
'password':'yourpassword'
}
REDIS_PORT = 6379
DUPEFILTER_CLASS = 'scrapy_redis.dupefilter.RFPDupeFilter'
scrapy crawl ptt -a start={m/d} -a end={m/d}
-a
received an argument that is a parameter to the spider.{m/d}
means month/day. 3/5 just represents March 5th.scrapy crawl ptt -a start=3/5 -a end=3/8
redis-cli
auth yourpassword
settings.py
and it can be modified directly.
lpush ptt:start_urls https://www.ptt.cc/{board}/index.html
{board}
can be described Soft_Job, Gossiping or etc.redis-cli
scrapy crawl ptt -a start={date} -a end={date}
There are three collections in mongoDB:
schema | Description |
---|---|
*canonicalUrl | url where the page visited |
authorId | who post the article |
title | title in the article |
content | content in the article |
publishedTime | the date this post was created |
updateTime | the date this post was updated |
board | what post belong with in ptt |
schema | Description |
---|---|
*authorId | who post the article |
authorName | the author’s nickname |
schema | Description |
---|---|
commentId | who post the conmment |
commentTime | when user posted |
commentContent | the content in comment |
board | what comment belong with in ptt |
Note: where schema prefix * represents primary key.
scrapy crawl pttCrawl
redis-cil
lpush
to attain this goal. The following redis key is pttCrawl:start_urls
. We push urls to redis.
lpush pttCrawl:start_urls {ptt url}
settings.py
:
scrapy crawl pttCrawl
In settings.py
, we just add a line that would prevent from repetitive redirection:
## in settings.py
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
In settings.py
, we just add a line that can keep tracking processes of the crawler. As requests in the redis queue just exist after crawling process stopped. It make convenient start to crawl again.
## in settings.py
# Enable scheduling storing requests queue in redis
SCHEDULER = 'scrapy_redis.scheduler.Scheduler'
# Start from the last endpoint
SCHEDULER_PERSIST = True
scrapyd provide a daemon for crawling. Like http server, we run it by typing the following command:
scrapyd
for the purpose of deployment, we install the package scrapy-client
and run it:
scrapyd-deploy pttCrawler
In case of duplicates in database, we filter the data here.
```python
def process_item(self, item, spider):
if isinstance(item, PostItem):
logging.debug("filter duplicated post items.")
if item['canonicalUrl'] in self.post_set:
raise DropItem("Duplicate post found:%s" % item)
self.post_set.add(item['canonicalUrl'])
elif isinstance(item, AuthorItem):
logging.debug("filter duplicated author items.")
if item['authorId'] in self.author_set:
raise DropItem("Duplicate author found:%s" % item)
self.author_set.add(item['authorId'])
return item
### MongoPipeline
> save data in mongodb.
### JsonPipeline
> generate a json file.
## Security Methodology
To avoid getting banned, we adopt some tricks while we are crawling web pages.
1. **Download delays**
> We set the `DOWLOAD_DELAY` in `settings.py` to limit the dowmload behavior.
```python
## in settings.py
DOWNLOAD_DELAY = 2
Distrbuted downloader
scrapy-redis has already helped us indeed.
User Agent Pool
Randomly choose one user-agent through middleware.
```pythonin middlewares.py
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
agent = random.choice(list(UserAgentList))
request.headers['User-Agent'] = agent
```python
## in settings.py
UserAgentList = [
'Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/27.0.1500.55 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_2) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1309.0 Safari/537.17'
]
DOWNLOADER_MIDDLEWARES = {
'pttCrawler.middlewares.RandomUserAgentMiddleware': 543,
}
Note: we cannot disable cookies because we have to pass the ‘over18’ message to some ptt boards.
First we need to download the scrapydweb immediately.
pip install scrapydweb
Then running it by following command:
scrapydweb
We can get in from localhost:5000
and monitor our crawler.
Also, we can track the crawler in here.
Scrapy comes with a built-in service, called “Scrapyd”, which allows you to deploy your projects and control their spiders using a JSON web service.
A full-featured web UI for Scrapyd cluster management, with Scrapy log analysis & visualization supported.
Before deploy to docker, we need to modify a little parts in settings.py
:
# local
# MONGO_URI = 'mongodb://localhost:27017'
# docker
MONGO_URI = 'mongodb://mongodb:27017'
# local
# REDIS_HOST = 'localhost'
# docker
REDIS_HOST = 'redis'
Since the docker seems the service defined at .yml
as server host, we modify localhost
here.
In the main spider script ptt.py
, for the sake of convenience we restrict the date stuck in year 2020.
Also, we set maximum_missing_count
as 500 where aims to control the bound of exploring articles. If there has been no page can be visited or got the limit of our missing count, we then stop crawling so that waste less resource.
class PTTspider(RedisSpider):
configure_logging(install_root_handler=False)
logging.basicConfig (
filename = 'logging.txt',
format = '%(levelname)s: %(message)s',
level = logging.INFO)
name = 'ptt'
redis_key = 'ptt:start_urls'
board = None
## where are restrictions
year = 2020
maximum_missing_count = 500