项目作者: dchrostowski

项目描述 :
Public proxy farm that automatically records and queues suitable proxy servers for web crawling
高级语言: Python
项目地址: git://github.com/dchrostowski/autoproxy.git
创建时间: 2019-10-31T11:27:37Z
项目社区:https://github.com/dchrostowski/autoproxy

开源协议:MIT License

下载


autoproxy

About

This is a rewrite of my public proxy farm. It uses redis to record and store reliability statistics for publicly available proxy servers.

After recording sufficient data, it is able to create a database of proxy servers and choose the most reliable proxy to use for crawling a given website.

Pre-requisites

  • docker
  • docker-compose

Overview

When web crawling, proxies are essential for maintaining anonymity and circumventing bot detection. There are a number of free public proxy servers spread across the Internet, however their performance is inconsistent. This project utilizes a redis store to temporarily store and cache proxy server information for use by a scrapy middleware. This cache is then periodically synced to a Postgres database intended to be a more permanent and practical storage medium for proxy statistics.

Example Usage

I’m still working on this, but here’s how to run it:

  1. git clone https://github.com/dchrostowski/autoproxy.git
  2. cd autoproxy
  3. docker-compose build scrapyd
  4. docker-compose build spider_scheduler
  5. docker-compose up scrapyd spider_scheduler

Getting proxies

There are a few spiders (see autoproxy/autoproxy/spiders) that are scheduled to crawl a few sites to constantly pull in more proxies and then test those proxies against the sites they’ve scraped.

To access the Postgres database, you can run the following:

  1. docker exec -it autoproxy_db psql -U postgres proxies

The default password is somepassword

Future plans

I’m planning on publishing the autoproxy_package/ contents as a module/package eventually.