项目作者: welaika

项目描述 :
Crawling since 1957
高级语言: Elixir
项目地址: git://github.com/welaika/sputnik.git
创建时间: 2017-12-08T04:06:15Z
项目社区:https://github.com/welaika/sputnik

开源协议:MIT License

下载


Sputnik

by weLaika

Sputnik is a website crawler written in Elixir.

It crawls a website following all internal links and makes a report of all pages’ status codes.

With query flags you can pass one ore more css selector to produce pages report about that.

Build

Sputnik can be built with:

  1. mix deps.get
  2. mix escript.build

Usage

Sputnik takes the url to crawl and optional query to perform on the crawled pages:

Options

  • query: valid css selectors, separated by commas, that you want to analyze all over the website
  • connections: max number of concurrent HTTP connections (default is 10)
  1. sputnik [--query <Q> --query <Q1> ...] [--connections <N>] <url>

Examples

running

  1. ./sputnik "http://spawnfest.github.io" --query "div" --query "a" --query "h1,h2,h3,h4,h5,h6" --connections 10

produces the following output

  1. #################### Pages ####################
  2. Pages found: 19
  3. status_code 200: 12
  4. status_code 301: 7
  5. #################### Queries ####################
  6. ## query `a` ##
  7. 327 result(s)
  8. Min 18 result(s) per page
  9. Max 57 result(s) per page
  10. ## query `div` ##
  11. 347 result(s)
  12. Min 13 result(s) per page
  13. Max 53 result(s) per page
  14. ## query `h1,h2,h3,h4,h5,h6` ##
  15. 95 result(s)
  16. Min 0 result(s) per page
  17. Max 31 result(s) per page

and it opens the browser with a page like this

Requirements

Documentation can be generated with ExDoc
and published on HexDocs. Once published, the docs can
be found at https://hexdocs.pm/sputnik.

Testing

To run tests:

  1. $ mix test --cover

To run credo:

  1. $ mix credo

Documentation

To generate the documentation:

  1. $ mix docs && open doc/index.html

Releasing

Bump the version in mix.exs, commit && push, and run mix hex.publish
Please read https://hex.pm/docs/publish for help.