The monitoring module for ALICE O2
Monitoring module injects user custom metrics and monitors the process. It supports multiple backends, protocols and data formats.
Click here if you don’t have aliBuild installed
Compile Monitoring
and its dependencies via aliBuild
aliBuild build Monitoring --defaults o2-dataflow
Load the environment for Monitoring (in the alice
directory)
alienv load Monitoring/latest
Get
an instance from MonitoringFactory
by passing backend’s URI(s) as a parameter (comma separated if more than one).
The factory is accessible from o2::monitoring
namespace.
#include <MonitoringFactory.h>
using namespace o2::monitoring;
std::unique_ptr<Monitoring> monitoring = MonitoringFactory::Get("backend[-protocol]://host:port[/verbosity][?query]");
See the table below to find URI
s for supported backends:
Format | Transport | URI backend[-protocol] | URI query | Default verbosity |
---|---|---|---|---|
- | - | no-op |
- | - |
InfluxDB | UDP | influxdb-udp |
- | info |
InfluxDB | Unix socket | influxdb-unix |
- | info |
InfluxDB | StdOut | influxdb-stdout |
- | info |
InfluxDB | Kafka | influxdb-kafka |
Kafka topic | info |
InfluxDB | WebSocket | influxdb-ws |
token=TOKEN |
info |
InfluxDB 2.x | HTTP | influxdbv2 |
org=ORG&bucket=BUCKET&token=TOKEN |
info |
ApMon | UDP | apmon |
- | info |
StdOut | - | stdout , infologger |
[Prefix] | debug |
A metric consist of 5 parameters:
Parameter name | Type | Required | Default |
---|---|---|---|
name | string | yes | - |
values | map |
no/1 | - |
timestamp | time_point |
no | current time |
verbosity | Enum (Debug/Info/Prod) | no | Verbosity::Info |
tags | map | no | host and process names |
A metric can be constructed by providing required parameters (value and metric name, value name is set to value
):
Metric{10, "name"}
By default metric can be created with zero or one value (in such case value name is set to value
). Any additional value may be added using .addValue
method, therefore the following two metrics are identical:
Metric{10, "name"}
Metric{"name"}.addValue(10, "value")
addTag(tags::Key, tags::Value)
or addTag(tags::Key, unsigned short)
methods. The latter method allows assigning numeric value to a tag.
Metric{10, "name"}.addTag(tags::Key::Subsystem, tags::Value::QC)
See the example: examples/2-TaggedMetrics.cxx.
hostname
tag is added by default by the library.You can add your own global tag by calling addGlobalTag(std::string_view key, std::string_view value)
or addGlobalTag(tags::Key, tags::Value)
on Monitoring object.
setRunNumber(uint32_t)
. Value 0
is unique and means no run number is set.Pass metric object to send
method as l-value reference:
send({10, "name"})
send(Metric{20, "name"}.addTag(tags::Key::CRU, 123))
send(Metric{"throughput"}.addValue(100, "tx").addValue(200, "rx"))
See how it works in the example: examples/1-Basic.cxx.
There are 3 verbosity levels (the same as for backends): Debug, Info, Prod. By default it is set to Verbosity::Info
. The default value can be overwritten using: Metric::setDefaultVerbosity(verbosity)
.
To overwrite verbosity on per metric basis use third, optional parameter to metric constructor:
Metric{10, "name", Verbosity::Prod}
Metrics need to match backends verbosity in order to be sent, eg. backend with /info
verbosity will accept Info
and Prod
metrics only.
In order to avoid sending each metric separately, metrics can be temporary stored in the buffer and flushed at the most convenient moment.
This feature can be controlled with following two methods:
monitoring->enableBuffering(const std::size_t maxSize)
...
monitoring->flushBuffer();
enableBuffering
takes maximum buffer size as its parameter. The buffer gets full all values are flushed automatically.
See how it works in the example: examples/10-Buffering.cxx.
This feature can calculate derived values. To do so, use optional DerivedMetricMode mode
parameter of send
method:
send(Metric&& metric, [DerivedMetricMode mode])
Two modes are available:
DerivedMetricMode::RATE
- rate between two following values,DerivedMetricMode::INCREMENT
- sum of all passed values.DerivedMetricMode::SUPPRESS
- suppresses forthcoming metric with same value, this happens until timeout is reached (configurable using DerivedMetrics::mSuppressTimeout
)The derived value is generated only from the first value of the metric and it is added to the same metric with the value name suffixed with _rate
, _increment
accordingly.
See how it works in the example: examples/4-RateDerivedMetric.cxx.
This feature provides basic performance status of the process. Note that is runs in separate thread.
enableProcessMonitoring([interval in seconds, {PmMeasurement list}]);
List of valid measurement lists:
PmMeasurement::Cpu
PmMeasurement::Mem
PmMeasurement::Smaps
- Beware. Enabling this will trigger kernel to run smaps_account
periodically.Following metrics are generated every time interval:PmMeasurement::Cpu
:
PmMeasurement::Mem
: (Linux only)
PmMeasurement::Smaps
: (Linux only)
Additional metrics are generated at the end of process execution:
CPU measurements:
Memory measurements: (Linux only)
[METRIC] <name>,<type> <values> <timestamp> <tags>
The prefix ([METRIC]
) can be changed using query component.
Overwrite metric verbosity using regex expression:
Metric::setVerbosityPolicy(Verbosity verbosity, const std::regex& regex)
This guide explains manual installation. For ansible
deployment see AliceO2Group/system-configuration gitlab repo.
devtoolset-9
librdkafka
git clone -b v2.3.0 https://github.com/edenhill/librdkafka && cd librdkafka
cmake -H. -B./_cmake_build -DENABLE_LZ4_EXT=OFF -DCMAKE_INSTALL_LIBDIR=lib -DRDKAFKA_BUILD_TESTS=OFF -DRDKAFKA_BUILD_EXAMPLES=OFF -DCMAKE_INSTALL_PREFIX=~/librdkafka_install
cmake --build ./_cmake_build --target install -j
RdKafka_DIR
and point to CMake config directory:
git clone https://github.com/AliceO2Group/Monitoring && cd Monitoring
cmake -H. -B./_cmake_build -DRdKafka_DIR=~/librdkafka_install/lib/cmake/RdKafka/ -DCMAKE_INSTALL_PREFIX=~/Monitoring_install
cmake --build ./_cmake_build --target install -j
monitoring.sh
: add - librdkafka
to “requires”aliBuild build Monitoring --defaults o2-dataflow --always-prefer-system
Monitoring
as dependency of your projectAs librdkafka
is optional dependency of Monitoring it is not handled by CMakeConfig, therefore you need:
find_package(RdKafka CONFIG REQUIRED)
find_package(Monitoring CONFIG REQUIRED)
And then, link against AliceO2::Monitoring
target.
#include "Monitoring/MonitoringFactory.h"
...
std::vector<std::string> topics = {"<topic-to-subscribe>"};
auto client = MonitoringFactory::GetPullClient("<kafka-server:9092>", topics, "<client-id>");
for (;;) {
auto metrics = client->pull();
if (!metrics.empty()) {
/// metric.first => topic name; metric.second => metric itself
} else {
// wait a bit if no data available
std::this_thread::sleep_for(std::chrono::milliseconds(100));
}
Run-time parameters:
<topic-to-subscribe>
- List of topics to subscribe<kafka-server:9092>
- Kafka broker (staging or production)<client_id>
- unique, self-explainable string describing the client, eg. dcs-link-status
or its-link-status
.Metrics are returned in batch of maximum 100 for each pull() call.
Native data format is Influx Line Protocol but metrics can be converted into any format listed in here: https://docs.influxdata.com/telegraf/latest/data_formats/output/