项目作者: ZNClub-PA-ML-AI

项目描述 :
Web Crawling using Scrapy
高级语言: Python
项目地址: git://github.com/ZNClub-PA-ML-AI/Scrapy-Spiders.git
创建时间: 2016-08-23T05:33:17Z
项目社区:https://github.com/ZNClub-PA-ML-AI/Scrapy-Spiders

开源协议:

下载


  1.  
  2. sSSs sSSs .S_sSSs .S_SSSs .S_sSSs .S S. sSSs .S_sSSs .S .S_sSSs sSSs .S_sSSs sSSs
  3. d%%SP d%%SP .SS~YS%%b .SS~SSSSS .SS~YS%%b .SS SS. d%%SP .SS~YS%%b .SS .SS~YS%%b d%%SP .SS~YS%%b d%%SP
  4. d%S' d%S' S%S `S%b S%S SSSS S%S `S%b S%S S%S d%S' S%S `S%b S%S S%S `S%b d%S' S%S `S%b d%S'
  5. S%| S%S S%S S%S S%S S%S S%S S%S S%S S%S S%| S%S S%S S%S S%S S%S S%S S%S S%S S%|
  6. S&S S&S S%S d*S S%S SSSS%S S%S d*S S%S S%S S&S S%S d*S S&S S%S S&S S&S S%S d*S S&S
  7. Y&Ss S&S S&S .S*S S&S SSS%S S&S .S*S SS SS Y&Ss S&S .S*S S&S S&S S&S S&S_Ss S&S .S*S Y&Ss
  8. `S&&S S&S S&S_sdSSS S&S S&S S&S_sdSSS S S `S&&S S&S_sdSSS S&S S&S S&S S&S~SP S&S_sdSSS `S&&S
  9. `S*S S&S S&S~YSY%b S&S S&S S&S~YSSY SSS `S*S S&S~YSSY S&S S&S S&S S&S S&S~YSY%b `S*S
  10. l*S S*b S*S `S%b S*S S&S S*S S*S l*S S*S S*S S*S d*S S*b S*S `S%b l*S
  11. .S*P S*S. S*S S%S S*S S*S S*S S*S .S*P S*S S*S S*S .S*S S*S. S*S S%S .S*P
  12. sSS*S SSSbs S*S S&S S*S S*S S*S S*S sSS*S S*S S*S S*S_sdSSS SSSbs S*S S&S sSS*S
  13. YSS' YSSP S*S SSS SSS S*S S*S S*S YSS' S*S S*S SSS~YSSY YSSP S*S SSS YSS'
  14. SP SP SP SP SP SP SP
  15. Y Y Y Y Y Y Y
  16.  

Scrapy-Spiders

Scrapy module - Web Crawling

Build Status Coverage Status

Introduction

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

Even though Scrapy was originally designed for web scraping, it can also be used to extract data using APIs (such as Amazon Associates Web Services) or as a general purpose web crawler.

Architecture

Architecture

Visit here for more

Project Structure

tutorial/
scrapy.cfg # deploy configuration file

  1. tutorial/ # project's Python module, you'll import your code from here
  2. __init__.py
  3. items.py # project items file
  4. pipelines.py # project pipelines file
  5. settings.py # project settings file
  6. spiders/ # a directory where you'll later put your spiders
  7. __init__.py

Features

  • Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
  • An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)
  • Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.
  • Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).
  • Wide range of built-in extensions and middlewares for handling:

    • cookies and session handling
    • HTTP features like compression, authentication, caching
    • user-agent spoofing
    • robots.txt
    • crawl depth restriction
  • A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler
    Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more

Getting Started with Scrapy

Commands

  1. scrapy startproject project_name
  2. scrapy crawl spider_name
  3. scrapy crawl spider_name -o file.csv -t csv
  4. scrapy crawl spider_name -o file.json -t json
  5. scrapy shell "url"

Environment

Using conda environments

  1. conda env list # list all environments
  2. conda activate Scrapy # if Scrapy is listed
  3. conda deactivate # deactivate current environment
  4. conda create --name Scrapy # if Scrapy is NOT listed
  5. pip install -r requirements.txt # install dependencies
  6. conda list # all packages in env

Resources