Unsupported: Tagging Metrics

Tagged metrics—such as those used by Datadog and Telegraf—are explicitly outside the scope of this library. Alternatives exist and are recommended. This document lays out the reasons to avoid support for tags.

Aggregating and Disaggregating Metrics

Given a simple metric, like a counter or Timers, the very first operation StatsD will perform is an aggregation over time. For example, over a 30-second window, calculate the total number of events (a counter) or several aggregations like average, median, 90th percentile (a timer).

A very common next step is for users to want to perform additional aggregations. For example, if we’re timing a /widgets API endpoint for both GET and POST requests, we might want to know the median time across both HTTP methods.

Without tags, we must start with the most disaggregated metrics, e.g.:

statsd.timing('api.widgets.GET', response_time)
statsd.timing('api.widgets.POST', response_time)

We can then aggregate these metrics with wildcards (e.g. in Graphite):

weightedAverage(api.widgets.*.mean, api.widgets.*.count)

However, with tags, we have an alternative approach: to use a single, aggregated metric name, and disaggregate via tags, e.g.:

statsd.timing('api.widgets', response_time, {'method': 'GET'})
statsd.timing('api.widgets', response_time, {'method': 'POST'})

By default, queries for the api.widgets timer will include all requests, but may be filtered to specific subsets with tags (e.g. in Datadog):

api.widgets.mean{method:GET}

Naming Metrics

The examples above demonstrate that there is a fundamental change in how metrics must be named, particularly in the absence of tags, to avoid data loss. If tags are not supported, there is no way to disaggregate api.widgets into its GET and POST subsets.

Thus, it is incredibly important that an application be written with specific metrics capabilities in mind. If using a metrics system that does not support tags, like StatsD or StatsDaemon, metric names must be disaggregated by default. If using a system that does support tags, like Datadog or Telegraf, metric names may be aggregated by default.

If an application is expecting tags to work but they are not supported by the underlying metrics system, the best case scenario is a loss of data resolution. The worst case scenario is a complete loss of data, if the metrics system is incapable of correctly parsing the extended metric data.

Explicit Opt-in

Given that the best case scenario for a mismatch of application and metrics system is a form of data loss, the choice to use metrics with tags must be incredibly explicit.

Technically, this library is capable of sending metrics to Datadog and Telegraf, as well as StatsD. However, to take advantage of these, you’ll need to change your strategy for naming—and tagging—metrics.

To avoid silently failing, this library forces you to make an explicit change to how you send metrics to these systems. At a minimum, you must touch every file that has import statsd, but that’s not really enough: you need to touch every metrics call.