Scrape and publish Telemetry probe data from Firefox
Scrape Telemetry probe data from Firefox repositories.
This extracts per-version Telemetry probe data for Firefox and other Mozilla products from registry files like Histograms.json and Scalars.yaml.
The data allows answering questions like “which Firefox versions is this Telemetry probe in anyway?”.
Also, probes outside of Histograms.json - like the CSS use counters - are included in the output data.
The data is pulled from two different sources:
for Firefox data.Probe Scraper outputs JSON to
Effectively, this creates a REST API which can be used by downstream tools like
and various data dictionary type applications (see below).
An OpenAPI reference to this API is available:
A web tool to explore the Firefox-related data is available at A project to develop a similar view for Glean-based data
is under development in the Glean Dictionary.
Deprecation is an important step in an application lifecycle. Because of the backwards-compatible nature of our pipeline, we do not
remove Glean apps or variants from the repositories.yaml
file - instead, we mark them as deprecated.
When an app variant is marked as deprecated (see this example from Fenix), the following happens:
in the Glean Dictionary, in the Access
section (see e.g. Fenix’s client_id metric).When an app is marked as deprecated (see this example of Firefox for Fire TV), the following happens:
Show deprecated applications
checkbox)To scrape a git repository for probe definitions, an entry needs to be added in repositories.yaml
The exact format of the entry depends on whether you are adding an application or a library. See below for details.
For a given application, Glean metrics are emitted by the application itself, any libraries it uses
that also use Glean, as well as the Glean library proper. Therefore, probe scraper needs a way to
find all of the dependencies to determine all of the metrics emitted by
that application.
Therefore, each application should specify a dependencies
parameter, which is a list of Glean-using libraries used by the application. Each entry should be a library name as specified by the library’s library_names
For Android applications, if you’re not sure what the dependencies of the application are, you can run the following command at the root of the project folder:
$ ./gradlew :app:dependencies
See the full application schema documentation
for descriptions of all the available parameters.
Probe scraper also needs a way to map dependencies back to an entry in therepositories.yaml
file. Therefore, any libraries defined should also include
their build-system-specific library names in the library_names
See the full library schema documentation
for descriptions of all the available parameters.
You can choose to develop using the container, or locally. Using the container will be slower, since changes will trigger a rebuild of the container.
But using the container method will ensure that your PR passes CircleCI build/test phases.
You may wish to,
instead of installing all these requirements in your global Python environment,
start by generating and activating a
Python virtual environment.
The .gitignore
expects it to be called ENV
or venv
python -m venv venv
. venv/bin/activate
Install the requirements:
pip install -r requirements.txt
pip install -r test_requirements.txt
python develop
Run tests. This by default does not run tests that require a web connection:
pytest tests/
To run all tests, including those that require a web connection:
pytest tests/ --run-web-tests
To test whether the code conforms to the style rules, you can run:
python -m black --check probe_scraper tests ./*.py
flake8 --max-line-length 100 probe_scraper tests ./*.py
yamllint repositories.yaml .circleci
python -m isort --profile black --check-only probe_scraper tests ./*.py
To render API documentation locally to index.html
make apidoc
Run tests in container. This does not run tests that require a web connection:
export COMMAND='pytest tests/'
make run
To run all tests, including those that require a web connection:
make test
To test whether the code conforms to the style rules, you can run:
make lint
Any tests that require a web connection to run should be marked with @pytest.mark.web_dependency
These will not run by default, but will run on CI.
Before opening a PR, it’s good to test the code you wrote on the production data. You can specify a specific Firefox
version to run on by using first-version
export COMMAND='python -m probe_scraper.runner --firefox-version 65 --dry-run'
make run
or locally via:
python -m probe_scraper.runner --firefox-version 65 --dry-run
Including --dry-run
means emails will not be sent.
Additionally, you can test just on Glean repositories:
export COMMAND='python -m probe_scraper.runner --glean --dry-run'
make run
By default that will test against every Glean repository, which might take a while. If you want to test against just one (e.g. a new repository you’re adding), you can use the --glean-repo
argument to just test the repositories you care about:
export COMMAND='python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run'
make run
Replace burnham in the example above with your repository and its dependencies.
You can also do the dry-run locally:
python -m probe_scraper.runner --glean --glean-repo glean-core --glean-repo glean-android --glean-repo burnham --dry-run
The module is built around the following data flow:
The code layout consists mainly of:
- the central script, ties the other pieces togetherscrapers
- pull build info from the BuildHub
- loads probe registry files for multiple versions from
- loads probe registry files from a git repository (no version or channel support yet, just per-commit)parsers/
- extract probe data from the registry filesthird_party
- these are imported parser scripts from mozilla-centraltransform_*.py
- transform the extracted raw data into output formatstests/
- the unit testsThe processed probe data is serialized to the disk in a directory hierarchy starting from the provided output directory. The directory layout resembles a REST-friendly structure.
|-- product
|-- general
|-- revisions
|-- channel (or "all")
|-- ping type
|-- probe type (or "all_probes")
For example, all the JSON probe data in the main ping for the Firefox Nightly channel can be accessed with the following path: firefox/nightly/main/all_probes
. The probe data for all the channels (same product and ping) can be accessed instead using firefox/all/main/all_probes
The root directory for the output generated from the scheduled job can be found at
All the probe data for Firefox coming from the main ping can be found at
metrics dataGlean data is generally laid out as follows:
| -- glean
| -- repositories
| -- general
| -- repository-name
| -- general
| -- metrics
For example, the data for a repository called fenix
would be found at /glean/fenix/metrics
. The time the data was last updated for that project can be found at glean/fenix/general
A list of available repositories is at /glean/repositories