A crawler for automated functional testing of a web application
A crawler for automated functional testing of a web application
Crawling a server-side-rendered web application is a low cost way to get low quality test coverage of your JavaScript-light web application.
If you have only partial test coverage of your routes, but still want to protect against silly mistakes, then this is for you.
Features:
Works with the test clients for Flask (inc Flask-WebTest), Django and WebTest.
Here’s an example: Flaskr, the Flask tutorial application has 166 lines of test code to achieve 100% test coverage.
Using Python Testing Crawler in a similar way to the Usage example below, we can hit 73% with very little effort. Disclaimer: Of course! It’s not the same quality or utility of testing! But it is better than no tests, a complement to hand-written unit or functional tests and a useful stopgap.
$ pip install python-testing-crawler
Create a crawler using your framework’s existing test client, tell it where to start and what rules to obey, then set it off:
from python_testing_crawler import Crawler
from python_testing_crawler import Rule, Request, Ignore, Allow
def test_crawl_all():
client = ## ... existing testing client
## ... any setup ...
crawler = Crawler(
client=my_testing_client,
initial_paths=['/'],
rules=[
Rule("a", '/.*', "GET", Request()),
]
)
crawler.crawl()
This will crawl all anchor links to relative addresses beginning “/“. Any exceptions encountered will be collected and presented at the end of the crawl. For more power see the Rules section below.
If you need to authorise the client’s session, e.g. login, then you should that before creating the Crawler.
It is also a good idea to create enough data, via fixtures or otherwise, to expose enough endpoints.
It depends on your framework:
Param | Description |
---|---|
initial_paths |
list of paths/URLs to start from |
rules |
list of Rules to control the crawler; see below |
path_attrs |
list of attribute names to extract paths/URLs from; defaults to “href” — include “src” if you want to check e.g. <link> , <script> or even <img> |
ignore_css_selectors |
any elements matching this list of CSS selectors will be ignored when extracting links |
ignore_form_fields |
list of form input names to ignore when determining the identity/uniqueness of a form. Include CSRF token field names here. |
max_requests |
Crawler will raise an exception if this limit is exceeded |
capture_exceptions |
upon encountering an exception, keep going and fail at the end of the crawl instead of during (default True ) |
output_summary |
print summary statistics and any captured exceptions and tracebacks at the end of the crawl (default True ) |
should_process_handlers |
list of “should process” handlers; see Handlers section |
check_response_handlers |
list of “check response” handlers; see Handlers section |
The crawler has to be told what URLs to follow, what forms to post and what to ignore, using Rules.
Rules are made of four parameters:
Rule(<source element regex>, <target URL/path regex>, <HTTP method>, <action to take>)
These are matched against every HTML element that the crawler encounters, with the last matching rule winning.
Actions must be one of the following objects:
Request(only=False, params=None)
— follow a link or submit a formonly=True
will retrieve a page/resource but not spider its links.params
allows you to specify overrides for a form’s default valuesIgnore()
— do nothing / skipAllow(status_codes)
— allow a HTTP status in the supplied list, i.e. do not consider it an error.
HYPERLINKS_ONLY_RULE_SET = [
Rule('a', '/.*', 'GET', Request()),
Rule('area', '/.*', 'GET', Request()),
]
REQUEST_ONLY_EXTERNAL_RULE_SET = [
Rule('a', '.*', 'GET', Request(only=True)),
Rule('area', '.*', 'GET', Request(only=True)),
]
This is useful for finding broken links. You can also check <link>
tags from the <head>
if you include the following rule plus set the Crawler’s path_attrs
to ("HREF", "SRC")
.
Rule('link', '.*', 'GET', Request())
SUBMIT_GET_FORMS_RULE_SET = [
Rule('form', '.*', 'GET', Request())
]
SUBMIT_POST_FORMS_RULE_SET = [
Rule('form', '.*', 'POST', Request())
]
Forms are submitted with their default values, unless overridden using Request(params={...})
for a specific form target or excluded using (globally) using the ignore_form_fields
parameter to Crawler
(necessary for e.g. CSRF token fields).
PERMISSIVE_RULE_SET = [
Rule('.*', '.*', 'GET', Allow([*range(400, 600)])),
Rule('.*', '.*', 'POST', Allow([*range(400, 600)]))
]
If any HTTP error (400-599) is encountered for any request, allow it; do not error.
The crawler builds up a graph of your web application. It can be interrogated via crawler.graph
when the crawl is finished.
See the graph module for the defintion of Node
objects.
Two hooks points are provided. These operate on Node
objects (see above).
Using should_process_handlers
, you can register functions that take a Node
and return a bool
of whether the Crawler should “process” — follow a link or submit a form — or not.
Using check_response_handlers
, you can register functions that take a Node
and response object (specific to your test client) and return a bool of whether the response should constitute an error.
If your function returns True
, the Crawler with throw an exception.
There are currently Flask and Django examples in the tests.
See https://github.com/python-testing-crawler/flaskr for an example of integrating into an existing application, using Flaskr, the Flask tutorial application.