A driver of remote headless scraping cluster for Capybara (aka remote Capybara Webkit)
A driver of remote headless scraping cluster for
Capybara
(aka remote Capybara Webkit).
There are many browser automation tools, mostly built on top of
PhantomJS.
In my opinion, Capybara is still the best. Unfortunately most of Capybara
drivers are not enough suitable for web scraping purposes.
There are the reasons:
This happens because Capybara is intended firstly for testing purposes
but not for web scraping. Authors
do not want
to support such use cases. So you as a final product developer have to solve
these problems by yourself. This spawns
primitive and makeshift solutions
which are good until you have to run more than a few tens of tasks per hour.
The Scrapod
tries to solve all or most of the problems listed above.
The Scrapod consists of two parts: client and server.
Client is a driver for Capybara. It connects to server when you create session,
sends calls to Capybara API over the connection and converts responses to Ruby
data structures. This is what you want to use in a final product application.
This document describes the client completely.
Server is a process which can run on the same or on another machine
than the client. Server configuration can be complex but still not difficult.
It is described in the server repository.
For testing purposes it is enough to install the gem and runscrapod-server --debug
. It will start listening on local port 20885
.
Add the gem to your Gemfile (with git source because I do not push new
experimental gems to RubyGems):
gem 'scrapod', git: 'https://github.com/krowpu/scrapod.git'
This will register a Capybara driver with name :scrapod
which connects
to local port 20885
. To connect to the remote host register a driver
by yourself. Assuming you use Sidekiq with Ruby on Rails, create the fileconfig/initializers/scrapod.rb
with the following content:
Capybara.register_driver :scrapod do |app|
Scrapod::Driver.new app, Scrapod::Configuration::DEFAULT.merge(
host: ENV['SCRAPOD_HOST'] || '127.0.0.1',
port: ENV['SCRAPOD_PORT']&.to_i || 20885,
)
end
Just create Capybara session with :scrapod
driver and use it as usually:
session = Capybara::Session.new :scrapod
session.visit 'https://google.com'
session.title #=> "Google"
session.fill_in 'q', with: 'Capybara'
session.all('input')[1].trigger 'click'
session.title #=> "Capybara - Google Search"