项目作者: krowpu

项目描述 :
A driver of remote headless scraping cluster for Capybara (aka remote Capybara Webkit)
高级语言: Ruby
项目地址: git://github.com/krowpu/scrapod.git
创建时间: 2017-03-18T09:12:18Z
项目社区:https://github.com/krowpu/scrapod

开源协议:MIT License

下载


Scrapod

A driver of remote headless scraping cluster for
Capybara
(aka remote Capybara Webkit).

Introduction

There are many browser automation tools, mostly built on top of
PhantomJS.
In my opinion, Capybara is still the best. Unfortunately most of Capybara
drivers are not enough suitable for web scraping purposes.
There are the reasons:

  • They run on the same server as your worker what can be ineffective
  • They do not care about headless browser process termination what can cause memory leaks
  • They open headless browser process on-demand what can be slow
  • They can take the total amount of available RAM and freeze the server

This happens because Capybara is intended firstly for testing purposes
but not for web scraping. Authors
do not want
to support such use cases. So you as a final product developer have to solve
these problems by yourself. This spawns
primitive and makeshift solutions
which are good until you have to run more than a few tens of tasks per hour.

The Scrapod
tries to solve all or most of the problems listed above.

Architecture

The Scrapod consists of two parts: client and server.

Client

Client is a driver for Capybara. It connects to server when you create session,
sends calls to Capybara API over the connection and converts responses to Ruby
data structures. This is what you want to use in a final product application.
This document describes the client completely.

Server

Server is a process which can run on the same or on another machine
than the client. Server configuration can be complex but still not difficult.
It is described in the server repository.
For testing purposes it is enough to install the gem and run
scrapod-server --debug. It will start listening on local port 20885.

Installation

Add the gem to your Gemfile (with git source because I do not push new
experimental gems to RubyGems):

  1. gem 'scrapod', git: 'https://github.com/krowpu/scrapod.git'

This will register a Capybara driver with name :scrapod which connects
to local port 20885. To connect to the remote host register a driver
by yourself. Assuming you use Sidekiq with Ruby on Rails, create the file
config/initializers/scrapod.rb with the following content:

  1. Capybara.register_driver :scrapod do |app|
  2. Scrapod::Driver.new app, Scrapod::Configuration::DEFAULT.merge(
  3. host: ENV['SCRAPOD_HOST'] || '127.0.0.1',
  4. port: ENV['SCRAPOD_PORT']&.to_i || 20885,
  5. )
  6. end

Usage

Just create Capybara session with :scrapod driver and use it as usually:

  1. session = Capybara::Session.new :scrapod
  2. session.visit 'https://google.com'
  3. session.title #=> "Google"
  4. session.fill_in 'q', with: 'Capybara'
  5. session.all('input')[1].trigger 'click'
  6. session.title #=> "Capybara - Google Search"