Define metrics aggregation rules
The aggregations service provides a way for you to aggregate metrics into lower cardinality versions of themselves. Users can define and apply their own aggregation rules, or apply the rules recommended by the recommendations service.
Aggregation rule format
The aggregations service expects the following format:
Field name | Data type | Description |
---|---|---|
metric | string | The metric name or metric name matcher to which the aggregation rule applies. |
match_type | string (optional) | The type of matching to be done against the value of the metric field. For valid values, see substring matchers. If you do not specify match_type , the value is exact . |
drop | bool (optional) | If set to true , the entire metric is dropped instead of aggregated. If you set this to true , you cannot use the drop_labels and aggregations fields. If you do not specify drop , the value is false . |
drop_labels | string array | The list of labels that will be aggregated away; each of these labels that is present in the original series will have their value set to <aggregated> . You can specify either drop_labels or keep_labels , but you can’t use both fields within the same rule. |
keep_labels | string array | The list of labels that will be retained. The value of all labels not present in this list will be replaced by <aggregated> . You can specify either keep_labels or drop_labels , but you can’t use both fields within the same rule. |
aggregations | string array | The list of aggregation functions to apply to the metric or metrics that are matched by this rule. For valid values, see Supported aggregation types. |
aggregation_interval | string duration (optional) | The interval of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_interval , you also need to specify aggregation_delay field. |
aggregation_delay | string duration (optional) | The time of samples that are included in a single emitted aggregated sample. See Configure the aggregation interval for valid values. If you set aggregation_delay , you also need to specify aggregation_interval field. |
The following example shows an aggregation rule for the metric proxy_sql_queries_total
:
{
"metric": "proxy_sql_queries_total",
"drop_labels": ["container", "instance", "namespace", "pod"],
"aggregations": ["sum:counter"]
}
Supported aggregation types
The following values are supported for the aggregations
field of an aggregation rule:
Aggregation function | Definition |
---|---|
sum:counter | The running sum of all increases of raw series values. Applicable to counter type metrics, and correctly accounts for counter resets. A counter type metric is conceptually similar to elevation gain. For example, if a cyclist counts their elevation gain by peak, they can sum several peaks’ worth of elevation gain to understand how much they’ve climbed in total. The elevation gain for each peak over time is a raw series. If you specify the sum:counter aggregation with "drop_labels": ["peak"] for this metric, the per-peak raw series would be aggregated into one series that would tell the cyclist the total amount they climbed over time. From this aggregated data, they can no longer tell how much they have climbed in total for a given peak. |
sum | The sum of all values across the aggregated series at a given time stamp. The sum aggregation is not useful for counter type metrics; for counter type metrics, use sum:counter instead. |
min | The minimum of all values across all the aggregated series at a given time stamp. |
max | The maximum of all values across all the aggregated series at a given time stamp. |
count | The number of raw series that feed into the aggregated series at a given time stamp. |
Substring matchers
By default, a rule is applied to the metric name specified in the rule’s metric
field.
In addition, Adaptive Metrics allows you to write rules that apply to all metrics whose names match a given prefix or suffix.
To apply rules to all such metrics, use the optional field match_type
in your rule and set it to prefix
or suffix
.
The match_type
field supports the following values:
exact
: Apply the rule to the metric whose name is specified in the rule’smetric
field. Because metric names are unique, the rule will only apply to one metric.prefix
: Apply the rule to all metrics whose names start with the string in the rule’smetric
field.suffix
: Apply the rule to all metrics whose names end with the string in the rule’smetric
field.
An example rule that matches all metrics beginning with http_requests_total_
, and that aggregates away their instance
label using the sum:counter
function, looks as follows:
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
}
In such scenario, the metric http_requests_total_abc
has two rules that potentially apply. However, because an exact match has precedence over a prefix match, both the instance
and pod
labels would be aggregated away for http_requests_total_abc
:
[
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
},
{
"metric": "http_requests_total_abc",
"drop_labels": ["instance", "pod"],
"aggregations": ["sum:counter"]
}
]
If multiple substring matchers match a metric, the first match always wins. Consider a rule file with the following two rules:
[
{
"metric": "http_requests_total_",
"match_type": "prefix",
"drop_labels": ["instance"],
"aggregations": ["sum:counter"]
},
{
"metric": "_abc",
"match_type": "suffix",
"drop_labels": ["pod"],
"aggregations": ["sum:counter"]
}
]
In this scenario, the metric http_requests_total_abc
is matched by both rules.
Because neither rule is an exact match, the first rule in the list takes precedence.
This means that the instance
label, not the pod
label is aggregated away for http_requests_total_abc
.
Configure an aggregation
As an illustration, think of a power grid that monitors the energy consumption of houses on different city streets. An example metric that expresses building consumption could be electrical_throughput_total
with labels street_name
and building_number
. Given that you only care about the total energy consumption per street and the average consumption per building on a street,
you could configure two aggregations where one sums the consumption of all buildings in a street and the other counts the buildings of the street.
Since the metric electrical_throughput_total
is a counter, we’d need to use the sum:counter
aggregation (instead of the sum
aggregation) to handle counter resets correctly:
{
"metric": "electrical_throughput_total",
"drop_labels": ["building_number"],
"aggregations": ["sum:counter", "count"]
}
Based on the preceding configuration, the aggregation service would discard the label building_number
from the aggregated metric electrical_throughput_total
.
In its place, it would compute and store aggregated values per street for this metric.
The sum:counter
aggregation function computes the total electrical throughput of every street in the street_name
label set. The count
aggregation function computes the count of buildings per street. These two values can be used to compute an average consumption per building for each street.
However, because the building_number
label has been discarded, it is no longer possible to understand how much power a specific building consumes.
Examples of sum()
, sum by()
, count()
, and count by()
functions are as follows:
Sum the rate of electrical throughput per street:
promqlsum by (street_name) (rate(electrical_throughput_total[5m]))
Sum the rate of electrical throughput for buildings on
<EXAMPLE-STREET>
:promqlsum(rate(electrical_throughput_total{street_name="<EXAMPLE-STREET>"}[5m]))
Count the number of buildings per street that are producing electrical throughput
promqlcount by (street_name) (electrical_throughput_total)
Count the total number of buildings that are producing electrical throughput
promqlcount(electrical_throughput_total)
Get the average rate of electrical throughput
promqlavg(rate(electrical_throughput_total[5m]))
Limits on the aggregation service
The Adaptive Metrics feature has limits, which are necessary to guarantee a highly reliable service. These limits are designed to adjust automatically to your usage of the service. This means that as long as your usage of Adaptive Metrics increases gradually, you should not expect to hit limits under normal circumstances. However, if your usage increases substantially over a short period of time, you might experience rate limiting. In this case, limits adapt to the changed usage pattern automatically after some time (usually within 24 hours). If you are experiencing sustained rate limiting beyond this time frame, contact Grafana Labs Support.
Number of aggregated series
The Adaptive Metrics aggregation service enforces limits on the number of series that can be aggregated. If these limits are exceeded, the aggregation service begins to discard incoming samples.
When this happens, you will see an increase in aggregator-too-many-aggregated-series
or aggregator-too-many-raw-series
errors in the Discarded Metrics Samples panel of your billing dashboard.
Rate of samples to aggregate
There is also a limit on the rate at which samples can get forwarded to the Adaptive Metrics aggregation service.
If this limit is exceeded, our API will return a 429
status code and you will see an increase in aggregations-max-ingestion-rate-exceeded
errors in the Discarded Metrics Samples panel of your billing dashboard.
Drop a metric
You can also configure an aggregation rule that causes the entire metric to be dropped. If you don’t want to persist any time series at all for electrical_throughput_total
, from the example in Configure an aggregation, you would configure a rule as follows:
{
"metric": "electrical_throughput_total",
"drop": true
}
This might be useful in cases where a metric originates in many different locations and it would be hard to configure every site of origin to drop the metric on the client side.
Note
Generally, aggregation is more favorable than dropping a metric entirely. By aggregating a metric, you can usually reduce its cardinality by 80-90%, and in the database keep some reference to it, such as a lower-fidelity version of it. This can be useful during the investigation of an incident. If you drop a metric, you reduce costs a bit more, but you eliminate all traces of the metric. This means that you do not see this metric when looking in the metric-name browser in Grafana Explore.
If you drop a metric, it shows up on the Discarded Metrics Samples panel with a label that provides context about why it was dropped.
Most of these labels are self-explanatory, but in the case of the requested-by-configuration
label,
it means that the user intentionally drops samples by means of aggregation rules that the aggregation service applies.
Drop a label
You can drop a label before you ingest data.
Dropping a label is useful in cases where granularity is not needed. For example, if you want to measure energy consumption in locations where the temperatures are typically high in the summer, you can drop labels for locations whose temperatures are low if you do not need to monitor those.
Dropping a label reduces the cardinality for that metric name, and thereby decreases the total number of billable active series.
Configure the aggregation interval and the DPM of the aggregated metric
The number of data points per minute (DPM) that are stored for the aggregated metric depends directly on the aggregation interval of the metric, which is the interval at which the aggregated samples are emitted.
The default aggregation_interval
value matches the included DPM per series of your organization.
For the organizations with the default resolution of 1 DPM this means a default interval setting of 60s
.
The valid values for aggregation_interval
are: 6s
, 10s
, 15s
, 20s
, 30s
and 60s
corresponding to 10 DPM, 6 DPM, 4 DPM, 3 DPM, 2 DPM and 1 DPM respectively.
Note
Changing the values ofaggregation_interval
setting causes a small gap in the data for the affected aggregated metrics while the aggregation is being initialized with the new parameters.
If you want to increase the DPM of the aggregated metric, decrease the aggregation_interval
to one of the supported values.
Note
You can set the
aggregation_interval
individually for each aggregation rule.You can also ask Grafana Cloud support to set a global value for
aggregation_interval
as the default for all aggregation rules. Open a support ticket in the Cloud Portal to request this.
Caution
By increasing the DPM of the aggregated metric you may incur additional costs.
Configure the aggregation delay
The aggregation_delay
is the delay after which the aggregated samples are emitted. The default value is 90s
. The valid values for the aggregation_delay
are: 15s
, 30s
, 60s
, 1m30s
, 2m
, 2m30s
and 3m
.
Note
Changing the values ofaggregation_delay
setting causes a small gap in the data for the affected aggregated metrics while the aggregation is being initialized with the new parameters.
Increase the aggregation_delay
to emit the aggregated samples later and reduce the risk of excluding the samples that are received late (because of a lagging remote write client, for example).
The total delay between the time of the raw sample arriving at Grafana Cloud and the time that the aggregated sample becomes queryable is the sum of the aggregation_interval
and the aggregation_delay
.
Note
You can set the
aggregation_delay
individually for each aggregation rule.You can also ask Grafana Cloud support to set a global value for
aggregation_delay
as the default for all aggregation rules. Open a support ticket in the Cloud Portal to request this.