项目作者: SimonRichardson

项目描述 :
Crawl all the things!
高级语言: Go
项目地址: git://github.com/SimonRichardson/crwlr.git
创建时间: 2017-05-20T12:49:15Z
项目社区:https://github.com/SimonRichardson/crwlr

开源协议:MIT License

下载


crwlr

Command Crawler

Getting started

The crwlr command expects to have some things pre-installed via go get if you
would like to build the project.

  • go get github.com/Masterminds/glide
  • go get github.com/mjibson/esc

Quick guide to getting started, this assumes you’ve got the $GOPATH setup
correctly and the gopath bin folder is in your $PATH:

  1. glide install
  2. make clean all
  3. cd dist
  4. ./crwlr crawl -addr="http://google.com"

Introduction

The crwlr CLI is split up into two distinctive commands, static and crawl.
static command is only an aid to help manually test the crawl command along
with various benchmarking/integration tests.

Static

The static command creates a series of pages that allow the crawl command to
walk, without hitting an external host. To help integration with crawl, the
static command can be used in combination with a pipe to send the current
address, this allows quick and fast iterative testing.

The following command launches the cli:

  1. crwlr static

In combination with the crawl command, an extra argument is required.

  1. crwlr static -output.addr=true | crwlr crawl

Also available is a quite descriptive -help section to better understand what
the static command can do:

  1. crwlr static -help
  2. USAGE
  3. static [flags]
  4. FLAGS
  5. -api tcp://0.0.0.0:7650 listen address for static APIs
  6. -debug false debug logging
  7. -output.addr false Output address writes the address to stdout
  8. -output.prefix -addr= Output prefix prefixes the flag to the output.addr
  9. -ui.local true Use local files straight from the file system

Crawl

The crawl command walks a host for potential new urls that it can also inturn
traverse. The command can configured (on by default) to check the robots.txt
of the host to follow the rules for crawling.

The command uses aggressive caching to help better improve performance and to
be more efficient when crawling a host.

As part of the command it’s also possible to output a report (on by default)
of what was crawled and expose some metrics about what went on. These include,
metrics like: requested vs received or filtered and errors.

  • Requested is when a request is sent to the host, it’s not know if that request
    was actually successful.
  • Received is the acknowledgement of the request succeeding.
  • Filtered describes if the host was cached already.
  • Errorred states if the request failed for some reason.

The following command launches the cli:

  1. crwlr crawl -addr="http://yourhosthere.com"

Also available is a comprehensive -help section:

  1. crwlr crawl -help
  2. USAGE
  3. crawl [flags]
  4. FLAGS
  5. -addr 0.0.0.0:0 addr to start crawling
  6. -debug false debug logging
  7. -filter.same-domain true filter other domains that aren't the same
  8. -follow-redirects true should the crawler follow redirects
  9. -report.metrics false report the metric outcomes of the crawl
  10. -report.sitemap true report the sitemap of the crawl
  11. -robots.crawl-delay false use the robots.txt crawl delay when crawling
  12. -robots.request true request the robots.txt when crawling
  13. -useragent.full Mozilla/5.0 (compatible; crwlr/0.1; +http://crwlr.com) full user agent the crawler should use
  14. -useragent.robot Googlebot (crwlr/0.1) robot user agent the crawler should use

Reports

The reporting part of the command outputs two different types of information;
sitemap reporting and metric reporting. Both reports can be turned off behind
a series of flags.

Sitemap Reports

When the command is done the sitemap report can be outputted (on by default),
which explains what was linked to what and also includes a list of static assets
that was also linked in the file.

A possible output is as follows:

  1. dist/crwlr crawl
  2. URL | Ref Links | Ref Assets |
  3. http://0.0.0.0:7650/robots.txt | | |
  4. http://0.0.0.0:7650 | | |
  5. | http://0.0.0.0:7650/index | http://0.0.0.0:7650/index.css |
  6. | http://0.0.0.0:7650/page1 | http://google.com/bootstrap.css |
  7. | http://0.0.0.0:7650/bad | http://0.0.0.0:7650/image.jpg |
  8. | | http://google.com/image.jpg |
  9. http://0.0.0.0:7650/index | | |
  10. | | http://0.0.0.0:7650/index.css |
  11. | | http://google.com/bootstrap.css |
  12. | | http://0.0.0.0:7650/image.jpg |
  13. | | http://google.com/image.jpg |
  14. http://0.0.0.0:7650/page1 | | |
  15. | http://0.0.0.0:7650/page2 | http://0.0.0.0:7650/index1.css |
  16. | | http://google.com/bootstrap.css |
  17. | | http://0.0.0.0:7650/image2.jpg |
  18. | | http://google.com/image.jpg |
  19. http://0.0.0.0:7650/bad | | |
  20. http://0.0.0.0:7650/page2 | | |
  21. | http://0.0.0.0:7650/page | |
  22. | http://0.0.0.0:7650/page3 | |
  23. http://0.0.0.0:7650/page | | |
  24. http://0.0.0.0:7650/page3 | | |

Metric Reports

When the command is done a report can be outputted (off by default), which can
help explain what the crawl actually requested vs what it filtered for example.

Example report using the static command is as follows:

  1. dist/crwlr crawl -report.metrics=true
  2. URL | Avg Duration (ms) | Requested | Received | Filtered | Errorred |
  3. http://0.0.0.0:7650/page | 0 | 1 | 0 | 0 | 1 |
  4. http://0.0.0.0:7650/page3 | 0 | 1 | 0 | 1 | 0 |
  5. http://0.0.0.0:7650/robots.txt | 5 | 1 | 1 | 0 | 0 |
  6. http://0.0.0.0:7650 | 1 | 1 | 1 | 0 | 0 |
  7. http://0.0.0.0:7650/index | 0 | 1 | 1 | 3 | 0 |
  8. http://0.0.0.0:7650/page1 | 1 | 1 | 1 | 2 | 0 |
  9. http://0.0.0.0:7650/bad | 0 | 1 | 0 | 1 | 1 |
  10. http://0.0.0.0:7650/page2 | 0 | 1 | 1 | 0 | 0 |
  11. Totals | Duration (ms) |
  12. | 9560 |

Tests

Tests can be run using the following command, it also includes a series of
benchmarking tests:

  1. go test -v -bench=. $(glide nv)

Improvements

Possible improvements:

  • Store the urls in a KVS so that a crawler can truly work distributed, esp. if
    the host is large or if it’s allowed to crawl beyond the host.
  • Potentially better strategies to walk assets at a later date to back fill the
    metrics.