Menu
Grafana Cloud

metrics_config

The metrics_config block is used to define a collection of metrics instances. Each instance defines a collection of Prometheus-compatible scrape_configs and remote_write rules. Most users will only need to define one instance.

yaml
# Configures the optional scraping service to cluster agents.
[scraping_service: <scraping_service_config>]

# Configures the gRPC client used for agents to connect to other
# clustered agents.
[scraping_service_client: <scraping_service_client_config>]

# Configure values for all Prometheus instances.
[global: <global_config>]

# Configure the directory used by instances to store their WAL.
#
# The Grafana Agent assumes that all folders within wal_directory are managed by
# the agent itself. This means if you are using a PVC, you must point
# wal_directory to a subdirectory of the PVC mount.
[wal_directory: <string> | default = "data-agent/"]

# Configures how long ago an abandoned (not associated with an instance) WAL
# may be written to before being eligible to be deleted
[wal_cleanup_age: <duration> | default = "12h"]

# Configures how often checks for abandoned WALs to be deleted are performed.
# A value of 0 disables periodic cleanup of abandoned WALs
[wal_cleanup_period: <duration> | default = "30m"]

# Allows to disable HTTP Keep-Alives when scraping; the Agent will only use
# outgoing each connection for a single request.
[http_disable_keepalives: <boolean> | default = false]

# Allows to configure the maximum amount of time an idle Keep-Alive connection
# can remain idle before closing itself. Zero means no limit.
# The setting is ignored when `http_disable_keepalives` is enabled.
[http_idle_conn_timeout: <duration> | default = "5m"]

# The list of Prometheus instances to launch with the agent.
configs:
  [- <metrics_instance_config>]

# If an instance crashes abnormally, how long should we wait before trying
# to restart it. 0s disables the backoff period and restarts the agent
# immediately.
[instance_restart_backoff: <duration> | default = "5s"]

# How to spawn instances based on instance configs. Supported values: shared,
# distinct.
[instance_mode: <string> | default = "shared"]

scraping_service_config

The scraping_service block configures the scraping service, an operational mode where configurations are stored centrally in a KV store and a cluster of agents distributes discovery and scrape load between nodes.

yaml
# Whether to enable scraping service mode. When enabled, local configs
# cannot be used.
[enabled: <boolean> | default = false]

# Note these next 3 configuration options are confusing. Due to backwards compatibility the naming
# is less than ideal.

# How often should the agent manually refresh the configuration. Useful for if KV change
# events are not sent by an agent.
[reshard_interval: <duration> | default = "1m"]

# The timeout for configuration refreshes. This can occur on cluster events or
# on the reshard interval. A timeout of 0 indicates no timeout.
[reshard_timeout: <duration> | default = "30s"]

# The timeout for a cluster reshard events. A timeout of 0 indicates no timeout.
[cluster_reshard_event_timeout: <duration> | default = "30s"]

# Configuration for the KV store to store configurations.
kvstore: <kvstore_config>

# When set, allows configs pushed to the KV store to specify configuration
# fields that can read secrets from files.
#
# This is disabled by default. When enabled, a malicious user can craft an
# instance config that reads arbitrary files on the machine the Agent runs
# on and sends its contents to a specically crafted remote_write endpoint.
#
# If enabled, ensure that no untrusted users have access to the Agent API.
[dangerous_allow_reading_files: <boolean>]

# Configuration for how agents will cluster together.
lifecycler: <lifecycler_config>

kvstore_config

The kvstore_config block configures the KV store used as storage for configurations in the scraping service mode.

yaml
# Which underlying KV store to use. Can be either consul or etcd
[store: <string> | default = ""]

# Key prefix to store all configurations with. Must end in /.
[prefix: <string> | default = "configurations/"]

# Configuration for a Consul client. Only applies if store
# is "consul"
consul:
  # The hostname and port of Consul.
  [host: <string> | duration = "localhost:8500"]

  # The ACL Token used to interact with Consul.
  [acltoken: <string>]

  # The HTTP timeout when communicating with Consul
  [httpclienttimeout: <duration> | default = 20s]

  # Whether or not consistent reads to Consul are enabled.
  [consistentreads: <boolean> | default = true]

# Configuration for an ETCD v3 client. Only applies if
# store is "etcd"
etcd:
  # The ETCD endpoints to connect to.
  endpoints:
    - <string>

  # The Dial timeout for the ETCD connection.
  [dial_tmeout: <duration> | default = 10s]

  # The maximum number of retries to do for failed ops to ETCD.
  [max_retries: <int> | default = 10]

lifecycler_config

The lifecycler_config block configures the lifecycler; the component that Agents use to cluster together.

yaml
# Configures the distributed hash ring storage.
ring:
  # KV store for getting and sending distributed hash ring updates.
  kvstore: <kvstore_config>

  # Specifies when other agents in the clsuter should be considered
  # unhealthy if they haven't sent a heartbeat within this duration.
  [heartbeat_timeout: <duration> | default = "1m"]

# Number of tokens to generate for the distributed hash ring.
[num_tokens: <int> | default = 128]

# How often agents should send a heartbeat to the distributed hash
# ring.
[heartbeat_period: <duration> | default = "5s"]

# How long to wait for tokens from other agents after generating
# a new set to resolve collisions. Useful only when using a gossip
# KV store.
[observe_period: <duration> | default = "0s"]

# Period to wait before joining the ring. 0s means to join immediately.
[join_after: <duration> | default = "0s"]

# Minimum duration to wait before marking the agent as ready to receive
# traffic. Used to work around race conditions for multiple agents exiting
# the distributed hash ring at the same time.
[min_ready_duration: <duration> | default = "1m"]

# Network interfaces to resolve addresses defined by other agents
# registered in distributed hash ring.
[interface_names: <string array> | default = ["eth0", "en0"]]

# Duration to sleep before exiting. Ensures that metrics get scraped
# before the process quits.
[final_sleep: <duration> | default = "30s"]

# File path to store tokens. If empty, tokens will not be stored during
# shutdown and will not be restored at startup.
[tokens_file_path: <string> | default = ""]

# Availability zone of the host the agent is running on. Default is an
# empty string which disables zone awareness for writes.
[availability_zone: <string> | default = ""]

scraping_service_client_config

The scraping_service_client_config block configures how clustered Agents will generate gRPC clients to connect to each other.

yaml
grpc_client_config:
  # Maximum size in bytes the gRPC client will accept from the connected server.
  [max_recv_msg_size: <int> | default = 104857600]

  # Maximum size in bytes the gRPC client will sent to the connected server.
  [max_send_msg_size: <int> | default = 16777216]

  # Whether messages should be gzipped.
  [use_gzip_compression: <boolean> | default = false]

  # The rate limit for gRPC clients; 0 means no rate limit.
  [rate_limit: <float64> | default = 0]

  # gRPC burst allowed for rate limits.
  [rate_limit_burst: <int> | default = 0]

  # Controls if when a rate limit is hit whether the client should
  # retry the request.
  [backoff_on_ratelimits: <boolean> | default = false]

  # Configures the retry backoff when backoff_on_ratelimits is
  # true.
  backoff_config:
    # The minimum delay when backing off.
    [min_period: <duration> | default = "100ms"]

    # The maximum delay when backing off.
    [max_period: <duration> | default = "10s"]

    # The number of times to backoff and retry before failing.
    [max_retries: <int> | default = 10]

global_config

The global_config block configures global values for all launched Prometheus instances.

yaml
# How frequently should Prometheus instances scrape.
[scrape_interval: duration | default = "1m"]

# How long to wait before timing out a scrape from a target.
[scrape_timeout: duration | default = "10s"]

# A list of static labels to add for all metrics.
external_labels:
  { <string>: <string> }

# Default set of remote_write endpoints. If an instance doesn't define any
# remote_writes, it will use this list.
remote_write:
  - [<remote_write>]

Note: For more information on remote_write, refer to the Prometheus documentation.

The following default values set by Grafana Agent Static Mode are different than the default set by Prometheus:

  • remote_write: send_exemplars default value is true
  • remote_write: queue_config: retry_on_http_429 default value is true

metrics_instance_config

The metrics_instance_config block configures an individual metrics instance, which acts as its own mini Prometheus-compatible agent, though without support for the TSDB.

yaml
# Name of the instance. Must be present. Will be added as a label to agent
# metrics.
name: string

# Whether this agent instance should only scrape from targets running on the
# same machine as the agent process.
[host_filter: <boolean> | default = false]

# Relabel configs to apply against discovered targets. The relabeling is
# temporary and just used for filtering targets.
host_filter_relabel_configs:
  [ - <relabel_config> ... ]

# How frequently the WAL truncation process should run. Every iteration of
# the truncation will checkpoint old series and remove old samples. If data
# has not been sent within this window, some of it may be lost.
#
# The size of the WAL will increase with less frequent truncations. Making
# truncations more frequent reduces the size of the WAL but increases the
# chances of data loss when remote_write is failing for longer than the
# specified frequency.
[wal_truncate_frequency: <duration> | default = "60m"]

# The minimum amount of time that series and samples should exist in the WAL
# before being considered for deletion. The consumed disk space of the WAL will
# increase by making this value larger.
#
# Setting this value to 0s is valid, but may delete series before all
# remote_write shards have been able to write all data, and may cause errors on
# slower machines.
[min_wal_time: <duration> | default = "5m"]

# The maximum amount of time that series and samples may exist within the WAL
# before being considered for deletion. Series that have not received writes
# since this period will be removed, and all samples older than this period will
# be removed.
#
# This value is useful in long-running network outages, preventing the WAL from
# growing forever.
#
# Must be larger than min_wal_time.
[max_wal_time: <duration> | default = "4h"]

# Deadline for flushing data when a Prometheus instance shuts down
# before giving up and letting the shutdown proceed.
[remote_flush_deadline: <duration> | default = "1m"]

# When true, writes staleness markers to all active series to
# remote_write.
[write_stale_on_shutdown: <boolean> | default = false]

# A list of scrape configuration rules.
scrape_configs:
  - [<scrape_config>]

# A list of remote_write targets.
remote_write:
  - [<remote_write>]

Note: More information on the following types can be found on the Prometheus website:

Data retention

The prometheus.remote_write component uses a Write Ahead Log (WAL) to prevent data loss during network outages. The component buffers the received metrics in a WAL for each configured endpoint. The queue shards can use the WAL after the network outage is resolved and flush the buffered metrics to the endpoints.

The WAL records metrics in 128 MB files called segments. To avoid having a WAL that grows on-disk indefinitely, the component truncates its segments on a set interval.

On each truncation, the WAL deletes references to series that are no longer present and also checkpoints roughly the oldest two thirds of the segments (rounded down to the nearest integer) written to it since the last truncation period. A checkpoint means that the WAL only keeps track of the unique identifier for each existing metrics series, and can no longer use the samples for remote writing. If that data has not yet been pushed to the remote endpoint, it is lost.

This behavior dictates the data retention for the prometheus.remote_write component. It also means that it’s impossible to directly correlate data retention directly to the data age itself, as the truncation logic works on segments, not the samples themselves. This makes data retention less predictable when the component receives a non-consistent rate of data.

The WAL block in Flow mode, or the metrics config in Static mode contain some configurable parameters that can be used to control the tradeoff between memory usage, disk usage, and data retention.

The truncate_frequency or wal_truncate_frequency parameter configures the interval at which truncations happen. A lower value leads to reduced memory usage, but also provides less resiliency to long outages.

When a WAL clean-up starts, the most recently successfully sent timestamp is used to determine how much data is safe to remove from the WAL. The min_keepalive_time or min_wal_time controls the minimum age of samples considered for removal. No samples more recent than min_keepalive_time are removed. The max_keepalive_time or max_wal_time controls the maximum age of samples that can be kept in the WAL. Samples older than max_keepalive_time are forcibly removed.

Extended remote_write outages

When the remote write endpoint is unreachable over a period of time, the most recent successfully sent timestamp is not updated. The min_keepalive_time and max_keepalive_time arguments control the age range of data kept in the WAL.

If the remote write outage is longer than the max_keepalive_time parameter, then the WAL is truncated, and the oldest data is lost.

Intermittent remote_write outages

If the remote write endpoint is intermittently reachable, the most recent successfully sent timestamp is updated whenever the connection is successful. A successful connection updates the series’ comparison with min_keepalive_time and triggers a truncation on the next truncate_frequency interval which checkpoints two thirds of the segments (rounded down to the nearest integer) written since the previous truncation.

Falling behind

If the queue shards cannot flush data quickly enough to keep up-to-date with the most recent data buffered in the WAL, we say that the component is ‘falling behind’. It’s not unusual for the component to temporarily fall behind 2 or 3 scrape intervals. If the component falls behind more than one third of the data written since the last truncate interval, it is possible for the truncate loop to checkpoint data before being pushed to the remote_write endpoint.

WAL corruption

WAL corruption can occur when Grafana Agent unexpectedly stops while the latest WAL segments are still being written to disk. For example, the host computer has a general disk failure and crashes before you can stop Grafana Agent and other running services. When you restart Grafana Agent, it verifies the WAL, removing any corrupt segments it finds. Sometimes, this repair is unsuccessful, and you must manually delete the corrupted WAL to continue.

If the WAL becomes corrupted, Grafana Agent writes error messages such as err="failed to find segment for index" to the log file.

Note

Deleting a WAL segment or a WAL file permanently deletes the stored WAL data.

To delete the corrupted WAL:

  1. Stop Grafana Agent.

  2. Find and delete the contents of the wal directory.

    By default the wal directory is a subdirectory of the data-agent directory located in the Grafana Agent working directory. The WAL data directory may be different than the default depending on the wal_directory setting in your Static configuration file or the path specified by the Flow command line flag --storage-path.

    Note

    There is one wal directory per:

    • Metrics instance running in Static mode
    • prometheus.remote_write component running in Flow mode
  3. Start Grafana Agent and verify that the WAL is working correctly.