项目作者: helloworld0909
项目描述 :
Crawl board page, post page and user page from ordinary forums like 1point3acres.com/bbs
高级语言: Python
项目地址: git://github.com/helloworld0909/ForumCrawler.git
ForumCrawler
A ForumCrawler crawls study aboard forums
A Crawler crawls board pages, posts and user info from ordinary forums based on Discuz,
and the data will be stored in MySQL database. It can also crawl specific information like offer information from study aborad forums.
It can be modified to work on some other regular forum web sites.
python run.py <spider name>
Available spiders:
Dependency:
- python2.7
- scrapy
- bs4(BeautifulSoup4)
- MySQLdb
- pywin32 (For Windows User)
Change Log
v0.5
Changes:
- Finish offer_spider, which can crawl offer info from bbs.gter.net
- Improve run.py, choose different LOG_FILE and JOBDIR for different spiders
- Automatically ignore empty offer items
v0.41
Changes:
- Divide settings into 2 parts:
- General settings in /
- Custom spider settings in /custom
- Modify other components to fit this change
v0.4
Add some utils
Changes:
- Add log_parser
- Add cookies util
- Developing gter.net spider
v0.31
Parse post context(admission info, user background, etc)
Changes:
- Parse post context()
- Parse admission board correctly
- trivial Bugs fixed
v0.3
Finish User page parsing
Changes:
- User page and profile parsing
- from future import unicode_literals
- Fix names of attributes
- Parse board_url and board_name of each post
- log filename relates to time_local()
v0.23
Finish login
Changes:
- Add class variable ‘cookies’, and pass it on to every request
v0.22
Finish forum parser and post parser
Changes:
- Finish parse_post(), PostItem
- Change the name of the project
- MySQL tables use MyISAM engine
v0.21
Use Rule to crawl forum, add forum info into MySQL
Changes: (Only finish forum part)
- Scrape forum info
- Add separate rules with respect to forum, post and user
- Add separate items
- Manage the process of creating tables in settings.py (TABLE_INFO)
v0.2
Only Crawl urls of board, thread and user sites
Changes:
- Replace BeautifulSoup with XPath
- Read cookies from json
- Add Rules in ForumSpider
- Add run.py
v0.1
Crawl all links under the domain.