Monitoring with Prometheus 2.0

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

January 17, 2018

This article was contributed by Antoine Beaupré

Prometheus is a monitoring tool built from scratch by SoundCloud in 2012. It works by pulling metrics from monitored services and storing them in a time series database (TSDB). It has a powerful query language to inspect that database, create alerts, and plot basic graphs. Those graphs can then be used to detect anomalies or trends for (possibly automated) resource provisioning. Prometheus also has extensive service discovery features and supports high availability configurations. That's what the brochure says, anyway; let's see how it works in the hands of an old grumpy system administrator. I'll be drawing comparisons with Munin and Nagios frequently because those are the tools I have used for over a decade in monitoring Unix clusters.

Monitoring with Prometheus and Grafana

What distinguishes Prometheus from other solutions is the relative simplicity of its design: for one, metrics are exposed over HTTP using a special URL (/metrics) and a simple text format. Here is, as an example, some network metrics for a test machine:

    $ curl -s http://curie:9100/metrics | grep node_network_.*_bytes
    # HELP node_network_receive_bytes Network device statistic receive_bytes.
    # TYPE node_network_receive_bytes gauge
    node_network_receive_bytes{device="eth0"} 2.720630123e+09
    # HELP node_network_transmit_bytes Network device statistic transmit_bytes.
    # TYPE node_network_transmit_bytes gauge
    node_network_transmit_bytes{device="eth0"} 4.03286677e+08

In the above example, the metrics are named node_network_receive_bytes and node_network_transmit_bytes. They have a single label/value pair(device=eth0) attached to them, along with the value of the metrics themselves. This is only a couple of hundreds of metrics (usage of CPU, memory, disk, temperature, and so on) exposed by the "node exporter", a basic stats collector running on monitored hosts. Metrics can be counters (e.g. per-interface packet counts), gauges (e.g. temperature or fan sensors), or histograms. The latter allow, for example, 95th percentiles analysis, something that has been missing from Munin forever and is essential to billing networking customers. Another popular use for histograms is maintaining an Apdex score, to make sure that N requests are answered in X time. The various metrics types are carefully analyzed before being stored to correctly handle conditions like overflows (which occur surprisingly often on gigabit network interfaces) or resets (when a device restarts).

Those metrics are fetched from "targets", which are simply HTTP endpoints, added to the Prometheus configuration file. Targets can also be automatically added through various discovery mechanisms, like DNS, that allow having a single A or SRV record that lists all the hosts to monitor; or Kubernetes or cloud-provider APIs that list all containers or virtual machines to monitor. Discovery works in real time, so it will correctly pick up changes in DNS, for example. It can also add metadata (e.g. IP address found or server state), which is useful for dynamic environments such as Kubernetes or containers orchestration in general.

Once collected, metrics can be queried through the web interface, using a custom language called PromQL. For example, a query showing the average bandwidth over the last minute for interface eth0 would look like:

    rate(node_network_receive_bytes{device="eth0"}[1m])

Notice the "device" label, which we use to restrict the search to a single interface. This query can also be plotted into a simple graph on the web interface:

What is interesting here is not really the node exporter metrics themselves, as those are fairly standard in any monitoring solution. But in Prometheus, any (web) application can easily expose its own internal metrics to the monitoring server through regular HTTP, whereas other systems would require special plugins, on both the monitoring server and the application side. Note that Munin follows a similar pattern, but uses its own text protocol on top of TCP, which means it is harder to implement for web apps and diagnose with a web browser.

However, coming from the world of Munin, where all sorts of graphics just magically appear out of the box, this first experience can be a bit of a disappointment: everything is built by hand and ephemeral. While there are ways to add custom graphs to the Prometheus web interface using Go-based console templates, most Prometheus deployments generally use Grafana to render the results using custom-built dashboards. This gives much better results, and allows graphing multiple machines separately, using the Node Exporter Server Metrics dashboard:

All this work took roughly an hour of configuration, which is pretty good for a first try. Things get tougher when extending those basic metrics: because of the system's modularity, it is difficult to add new metrics to existing dashboards. For example, web or mail servers are not monitored by the node exporter. So monitoring a web server involves installing an Apache-specific exporter that needs to be added to the Prometheus configuration. But it won't show up automatically in the above dashboard, because that's a "node exporter" dashboard, not an Apache dashboard. So you need a separate dashboard for that. This is all work that's done automatically in Munin without any hand-holding.

Even then, Apache is a relatively easy one; monitoring some arbitrary server not supported by a custom exporter will require installing a program like mtail, which parses the server's logfiles to expose some metrics to Prometheus. There doesn't seem to be a way to write quick "run this command to count files" plugins that would allow administrators to write quick hacks. The options available are writing a new exporter using client libraries, which seems to be a rather large undertaking for non-programmers. You can also use the node exporter textfile option, which reads arbitrary metrics from plain text files in a directory. It's not as direct as running a shell command, but may be good enough for some use cases. Besides, there are a large number of exporters already available, including ones that can tap into existing Nagios and Munin servers to allow for a smooth transition.

Unfortunately, those exporters will only give you metrics, not graphs. To graph metrics from a third-party Postfix exporter, a graph must be created by hand in Grafana, with a magic PromQL formula. This may involve too much clicking around in a web browser for grumpy old administrators. There are tools like Grafanalib to programmatically create dashboards, but those also involve a lot of boilerplate. When building a custom application, however, creating graphs may actually be a fun and distracting task that some may enjoy. The Grafana/Prometheus design is certainly enticing and enables powerful abstractions that are not readily available with other monitoring systems.

Alerting and high availability

So far, we've worked only with a single server, and did only graphing. But Prometheus also supports sending alarms when things go bad. After working over a decade as a system administrator, I have mixed feelings about "paging" or "alerting" as it's called in Prometheus. Regardless of how well the system is tweaked, I have come to believe it is basically impossible to design a system that will respect workers and not torture on-call personnel through sleep-deprivation. It seems it's a feature people want regardless, especially in the enterprise, so let's look at how it works here.

In Prometheus, you design alerting rules using PromQL. For example, to warn operators when a network interface is close to saturation, we could set the following rule:

    alert: HighBandwidthUsage
    expr: rate(node_network_transmit_bytes{device="eth0"}[1m]) > 0.95*1e+09
    for: 5m
    labels:
      severity: critical
    annotations:
      description: 'Unusually high bandwidth on interface {{ $labels.device }}'
      summary: 'High bandwidth on {{ $labels.instance }}'

Those rules are regularly checked and matching rules are fired to an alertmanager daemon that can receive alerts from multiple Prometheus servers. The alertmanager then deduplicates multiple alerts, regroups them (so a single notification is sent even if multiple alerts are received), and sends the actual notifications through various services like email, PagerDuty, Slack or an arbitrary webhook.

The Alertmanager has a "gossip protocol" to enable multiple instances to coordinate notifications. This design allows you to run multiple Prometheus servers in a federation model, all simultaneously collecting metrics, and sending alerts to redundant Alertmanager instances to create a highly available monitoring system. Those who have struggled with such setups in Nagios will surely appreciate the simplicity of this design.

The downside is that Prometheus doesn't ship a set of default alerts and exporters do not define default alerting thresholds that could be used to create rules automatically. The Prometheus documentation also lacks examples that the community could use, so alerting is harder to deploy than in classic monitoring systems.

Issues and limitations

Prometheus is already well-established: Cloudflare, Canonical and (of course) SoundCloud are all (still) using it in production. It is a common monitoring tool used in Kubernetes deployments because of its discovery features. Prometheus is, however, not a silver bullet and may not the best tool for all workloads.

In particular, Prometheus is not designed for long-term storage. By default, it keeps samples for only two weeks, which seems rather small to old system administrators who are used to RRDtool databases that efficiently store samples for years. As a comparison, my test Prometheus instance is taking up as much space for five days of samples as Munin, which has samples for the last year. Of course, Munin only collects metrics every five minutes while Prometheus samples all targets every 15 seconds by default. Even so, this difference in sizes shows that Prometheus's disk requirements are much larger than traditional RRDtool implementations because it lacks native down-sampling facilities. Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome) will be difficult without some serious hacking to selectively purge samples or adding extra disk space.

The project documentation recognizes this and suggests using alternatives:

Prometheus's local storage is limited in its scalability and durability. Instead of trying to solve long-term storage in Prometheus itself, Prometheus has a set of interfaces that allow integrating with remote long-term storage systems.

Prometheus in itself delivers good performance: a single instance can support over 100,000 samples per second. When a single server is not enough, servers can federate to cover different parts of the infrastructure. And when that is not enough sharding is possible. In general, performance is dependent on avoiding variable data in labels, which keeps the cardinality of the dataset under control, but the dataset size will grow with time regardless. So long-term storage is not Prometheus' strongest suit. But starting with 2.0, Prometheus can finally write to (and read from) external storage engines that can be more efficient than Prometheus. InfluxDB, for example, can be used as a backend and supports time-based down-sampling that makes long-term storage manageable. This deployment, however, is not for the faint of heart.

Also, security freaks can't help but notice that all this is happening over a clear-text HTTP protocol. Indeed, that is by design, "Prometheus and its components do not provide any server-side authentication, authorisation, or encryption. If you require this, it is recommended to use a reverse proxy." The issue is punted to a layer above, which is fine for the web interface: it is, after all, just a few Prometheus instances that need to be protected. But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection. It would be nice to have at least IP-level blocking in the node exporter, although this could also be accomplished through a simple firewall rule.

There is a large empty space for Prometheus dashboards and alert templates. Whereas tools like Munin or Nagios had years to come up with lots of plugins and alerts, and to converge on best practices like "70% disk usage is a warning but 90% is critical", those things all need to be configured manually in Prometheus. Prometheus should aim at shipping standard sets of dashboards and alerts for built-in metrics, but the project currently lacks the time to implement those.

The Grafana list of Prometheus dashboards shows one aspect of the problem: there are many different dashboards, sometimes multiple ones for the same task, and it's unclear which one is the best. There is therefore space for a curated list of dashboards and a definite need for expanding those to feature more extensive coverage.

As a replacement for traditional monitoring tools, Prometheus may not be quite there yet, but it will get there and I would certainly advise administrators to keep an eye on the project. Besides, Munin and Nagios feature-parity is just a requirement from an old grumpy system administrator. For hip young application developers smoking weird stuff in containers, Prometheus is the bomb. Just take for example how GitLab started integrating Prometheus, not only to monitor GitLab.com itself, but also to monitor the continuous-integration and deployment workflow. By integrating monitoring into development workflows, developers are immediately made aware of the performance impacts of proposed changes. Performance regressions can therefore be trivially identified quickly, which is a powerful tool for any application.

Whereas system administrators may want to wait a bit before converting existing monitoring systems to Prometheus, application developers should certainly consider deploying Prometheus to instrument their applications, it will serve them well.

Index entries for this article
GuestArticles	Beaupré, Antoine

(Log in to post comments)

Monitoring with Prometheus 2.0

Posted Jan 17, 2018 19:02 UTC (Wed) by bitfehler (subscriber, #109516) [Link]

"But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection."

That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (https://prometheus.io/docs/prometheus/latest/configuratio...) as well as HTTP basic auth (https://prometheus.io/docs/prometheus/latest/configuratio...).

Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually.

Disclaimer: I work at SoundCloud ;)

Monitoring with Prometheus 2.0

Posted Jan 17, 2018 20:44 UTC (Wed) by anarcat (subscriber, #66354) [Link]

"But for monitoring endpoints, this is potentially hundreds of services that are available publicly without any protection." That part is a misunderstanding. For scraping, Prometheus supports all kinds of security, including regular TLS, client certificates (https://prometheus.io/docs/prometheus/latest/configuratio...) as well as HTTP basic auth (https://prometheus.io/docs/prometheus/latest/configuratio...).

Sure: prom supports scraping HTTPS targets. But by default, the node_exporter (and in fact most exporters as well) do not export their metrics through HTTPS. Users are told to install a TLS proxy in front to enable end-to-end security.

And even then: this doesn't authenticate the collecting server against the metrics target. For that you need yet another authentication layer. Furthermore, many container deployments do not use HTTPS internally: it's all plain text, and then HTTPS is added on the edges, which means a lot of this traffic goes in the clear. So I think it's a fairly accurate description. It doesn't mean it's catastrophic: many organizations have been running Munin exactly that way forever. But it's something to keep in mind when deploying Prometheus: it's not magic.

The security guide is great, in that regard: honest, and to the point. Thank you for that.

Besides that, nice overview. The criticism is valid, however in my experience the benefits start to outweigh the downsides at a certain scale, e.g. at some point the flexibility and interoperability with other components becomes a major feature (e.g. "having to" use Grafana is nice because we show data from other sources than just Prometheus, etc). I am sure more "out-of-the-box" solutions will show up eventually.

Yep. Note that in the last paragraph, i suggest sysadmins should wait befor converting existing infrastructures, but I would probably use prometheus to monitor any new infrastructure I would setup in the future. My only concern is disk space and downsampling, but I will be touching on that subject more in the next article, which should come out next week. Stay tuned! :)

Monitoring with Prometheus 2.0

Posted Jan 17, 2018 23:25 UTC (Wed) by anarcat (subscriber, #66354) [Link]

Bugs filed or worked on while writing on this series of articles:

Most of those are actually filed against the Debian project's packaging, because I had good interactions with the package maintainer there. Thanks again to tincho for all the help in setting up Prometheus and technical reviews of the article.

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 16:22 UTC (Thu) by spaetz (guest, #32870) [Link]

Thanks for showing the list of filed bugs. I find that a very nice feature.

Monitoring with Prometheus 2.0

Posted Feb 6, 2018 16:50 UTC (Tue) by anarcat (subscriber, #66354) [Link]

And for what it's worth, Gnocchi 4.2 was just released with support for remote Prometheus writes, which means you could, in theory, use Prometheus only for discovery, collectiong and alerts, and store long-term trending into Gnocchi, which can then be used by Grafana for graphing.

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 10:12 UTC (Thu) by aowi (subscriber, #112529) [Link]

> Therefore, retaining samples for more than a year (which is a Munin limitation I was hoping to overcome)

You can have munin use arbitrary data retention with the "graph_data_size custom" setting though it doesn't seem to be well documented.

https://github.com/munin-monitoring/munin/blob/ce9e01172a...

This is the default retention we use, which keeps five minute samples for two days (5m*576), 30 minute samples (5m*6) for nine days (30m*432) et cetera up to a 1d (5m*288) sampling for ten years (1d*3660):

graph_data_size custom 576, 6 432, 24 540, 288 3660

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 16:03 UTC (Thu) by anarcat (subscriber, #66354) [Link]

sure, there's a way to hack munin to keep more results. but how do you instrument graphs on top of that? it quickly becomes a mess, unfortunately.

but yeah, 10 years is a timespan i'd like to see...

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 17:27 UTC (Thu) by ken (subscriber, #625) [Link]

I gave up trying to get munin to show more than 1 year.

I was adding support to read out temperature from multiple temp sensors using a tellstick duo and then it would be nice to see multiple years but in the end I could not figure out how to do it.

Have not had time to research what to use instead but Prometheus do not look to be the proper solution.

Monitoring with Prometheus 2.0

Posted Jan 25, 2018 9:22 UTC (Thu) by aowi (subscriber, #112529) [Link]

munin-cgi-graph will graph any time-period you tell it to. Just click on the statically generated graphs to get to it. It'll let you zoom in and out as much as you'd like. The interface is crude, but perfectly workable.

We're not auto-generating the ten-year graphs, but then again, we're also not auto-generating the three-month graphs, or the 'what happened between 18:00 and 18:30 last Tuesday'-graphs either. But the data is there, and the graphs are three mouse-clicks away when needed. As a tool for exploration, for planning and for the occasional reporting it's quite serviceable.

If you do need to auto-generate the graphs for your purposes, then it'd require a bit more work, yes.

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 10:51 UTC (Thu) by nim-nim (subscriber, #34454) [Link]

If you're monitoring over untrusted networks, it's much better to add a trusted apache (or whatever) layer to perform public auth and crypto over a loopback http service such as prometheus, rather than trust every component to talk https and get the crypto aspects right

Sure it's a bit longer to setup, than if it was built-in, to do you trust someone to actually audit all the built-in https stacks out there? Especially given how fast https security moves nowadays?

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 17:07 UTC (Thu) by fwiesweg (subscriber, #116364) [Link]

We found this to be a valid argument, too. Our applications are running behind an nginx reverse proxy anyway so setting up forwarding the exporters via https (with client authentication for the scraper, too) was a straightforward and simple step.

Monitoring with Prometheus 2.0

Posted Jan 18, 2018 16:42 UTC (Thu) by jcpunk (subscriber, #95796) [Link]

Any thoughts on this vs Performance CoPilot (http://pcp.io/)?

Monitoring with Prometheus 2.0

Posted Jan 19, 2018 8:55 UTC (Fri) by barryascott (subscriber, #80640) [Link]

Once you have created a couple of grafana dashboards it then
becomes a short task to craft custom dashboards for any purpose.
The hard bit is what metrics do we need to show to help understand
the behaviour we are curious about.

That RRD reduces the detail of the metrics it stores over time is
a problem.

Often I want to compare a metric from this week with the one a
few weeks ago. That's often not possible with RRD once the detail
has gone.

We are hoping to be able to keep full metrics for long enough with
Prometheus. The trade off being the storage needs.

Barry

Monitoring with Prometheus 2.0

Posted Jan 19, 2018 9:53 UTC (Fri) by bangert (subscriber, #28342) [Link]

RRD only aggregates data if you tell it to - and yes, it is annoying that you have to specify that up front.

A big issue for most(all?) other tsdb's is, that they use much more storage per saved data byte compared to RRD.
This is not only a question about the amount of storage but IO performance.

Monitoring with Prometheus 2.0

Posted Feb 2, 2018 7:53 UTC (Fri) by faxm0dem (guest, #92265) [Link]

Long-term storage is very important to us, but RRDTool doesn't scale unless you DIY. The solution we came up with is to pre-aggregate the data into multiple resolutions using configurable consolidation functions, just like RRDTool does, but on top of a modern storage. We wrote a riemann plugin that does the aggregation in realtime, and then indexes the results into Elasticsearch. It also handles the aliases so that the access to the data is transparent to the user (highest resolution available gets higher priority) and curates old data automatically so that storage usage doesn't increase in time.