Best practices for Grafana SLOs
Because SLOs are still a relatively new practice, it can feel overwhelming when you start to create SLOs for the first time. To help simplify things, some best practices for SLOs and SLO queries are provided on this page.
What is a good SLO?
A Service Level Objective (SLO) is meant to define specific, measurable targets which represent the quality of service provided by a service provider to its users. The best place to start is with the level of service your customers expect. Sometimes these are written into formal service level agreements (SLAs) with customers, and sometimes they are implicit in customers expectations for a service.
Good SLOs are simple. Don’t use every metric you can track as an SLI; choose the ones that really matter to the consumers of your service. If you choose too many, it’ll make it hard to pay attention to the ones that matter.
A good SLO is attainable, not aspirational
Start with a realistic target. Unrealistic goals create unnecessary frustration which can then eclipse useful feedback from the SLO. Remember, this is meant to be achievable and it is meant to reflect the user experience. An SLO is not an OKR.
It’s also important to make your SLO simple and understandable. The most effective SLOs are the ones that are readable for all stakeholders.
Target services with good traffic
Too little traffic is insufficient for monitoring trends and can cause noisy alerts and irregularities can be reflected disproportionately with low-traffic environments. Conversely, too much traffic can mask customer-specific issues.
Team alignment
Teams should be the ones to create SLOs and SLIs, not managers. Your SLOs should communicate to you feedback for your services and the customer experience with them, so it’s good for the team to work together to create the SLOs.
Embed SLO review in team rituals
As you work with SLOs, the information they provide can help guide decision-making because they add context and correlate patterns. This can help when there’s a need to balance reliability and feature velocity. Early on, it’s good practice for teams to review SLOs at regular intervals.
Iterate and adjust
Once SLO review is a part of your team rituals it’s important to iterate on the information you have to be able to make continuously more informed decisions.
As you learn more from your SLOs, you may learn your assumptions don’t reflect practical reality. In the early period of SLO implementation, you may find there are a number of factors you hadn’t previously considered. If you have a lot of error budget left over, you can adjust your objectives accordingly.
Alerts and labels
SLO alerts are different from typical data source alerts. Because alerts for SLOs let you know there is a trend in your burn rate that needs attention, it’s important to understand how to set up and balance fast-burn and slow-burn alerts to keep you informed without inducing alerting fatigue.
Prioritize your alerts
Have your alerts routed first to designated individuals to validate your SLI. Send notifications to designated engineers through OnCall or your main escalation channel when fast-burn alerts fire so that the appropriate people can quickly respond to possible pressing issues. Send group notifications for slow-burn alerts to analyze and respond to as a team during normal working hours.
Use labels
Set up good label practices. Keep them limited to make them navigable and consumable for triage.
Grafana SLOs use two label types: SLO labels and Alert labels. SLO labels are for grouping and filtering SLOs. Alert labels are added to slow and fast burn alerts and are used to route notifications and add metadata to alerts.
Query tips and pitfalls
There are many approaches to how you configure your SLO queries. Ultimately, it all depends on your needs. Ultimately, just remember: if you don’t have metrics that represent your user’s experience then you need new metrics.
Keep queries simple
The best SLIs are based on Prometheus counter metrics (such as monotonically increasing series) and use labels to encode the counted event as either a success or failure (for example: requests_total{code=”200”}
). If your metrics don’t look like this, it’s probably better to reinstrument your service with well-suited metrics than to try and work around the issue with complex SLI query definitions.
Availability and Latency are the most common SLOs to start with for request driven services. For example:
- Availability (non-5xx responses):
requests_total{code!~”5..”} / requests_total
- Latency (less than 1 second):
requests_duration_seconds_bucket{code!~”5..”, le=”1.0”} / requests_duration_seconds_count{code!~”5..”}
Freshness is a common SLO for message queues or batch processes where you want to ensure that each item (perhaps after several retries) gets completed before the work request grows too stale.
- Freshness (work spent less than 120 sec in queue):
completed_duration_seconds_bucket{le=”120”} / completed_duration_seconds_count
Advanced SLIs
Define advanced SLIs as a “success/total” ratio for best dashboards. The “Ratio” SLO type enforces this success/total style, but you’ll get more dashboard features if you follow the same approach with your advanced SLOs.
- Do
<success rate> / <total rate>
- Avoid:
1 - (<failure rate> / <total rate>)
If you can’t reinstrument your metrics to encode success/failure with labels and you must work with failure_total
and all_total
counters, you can do (total - fail) / total
. For example:
`(
sum by (…) (rate(all_total[$__rate_interval]))
- sum by (…) (rate(failure_total[$__rate_interval])) ) / sum by (…) (rate(all_total[$__rate_interval]))`
Know your SLIs
There are many SLI types. A brief explanation of Multidimensional and Rollup SLIs follows below.
Multidimensional SLI
A Multidimensional SLI reports a ratio for each value of a given label. for example: sum by (cluster) (rate(<success>[5m])) / sum by (cluster) (rate<total>[5m]))
.
When you specify “group by” labels on the ratio SLO type, it makes it a multidimensional SLI. A common use is to specify cluster
and/or namespace
in the grouping.
Multidimensional SLIs enables per-cluster alerting and supports more flexible dashboards where you can include or exclude values for the chosen dimension labels (see rollup SLI below).
Rollup SLI
A rollup SLI (or aggregated SLI) is a calculation of a multidimensional SLI where the numerator and denominator is further aggregated before the final ratio calculation.
When you select cluster=all
on the dashboard of a multidimensional SLO that defined cluster
as a group label, the dashboard calculates the aggregate ratio of the sum of all successes/over sum of all requests. This provides alerting on each cluster and reporting on the rollup overall results.
Additional reference materials
Google provides very clear documentation on SLOs in their [SRE Book(https://sre.google/sre-book/service-level-objectives/)]. They also provide useful guides on SLO implementation and alerting on SLOs.