I'm now running Prometheus for monitoring my IT resources

December 24, 2024 — Nico Cartron

Context

I have a few IT resources that I'm running either at home or on dedicated servers, that need to be monitored so that I get alerted in case of issue.

I used to run Nagios for that, but with my move away from running virtual machines on VMWare ESXi as well as migrating as much stuff as possible from Linux to FreeBSD, I did not reinstall a Nagios.

When I decided to spend time on it, I decided it would be good to opt for a "modern" monitoring solution, and instead of using Nagios, I went for Prometheus.

Prometheus

Prometheus is an Open source monitoring solution that is very flexible, and integrates nicely with pretty much any system out there.
Besides, I know that the Alerting part is also nice, and allows integrating with email, instant messaging (Telegram, Slack, ...) or things like PagerDuty.

Also, Blackbox Exporter allows probing endpoints over a variety of protocols.

Steps

This article is by no means not a walk-through installing and configuring Prometheus - rather, I'll just highlight what I'm monitoring, and how.

What I am monitoring

Basic services: since I only want to check they're alive, I'm using the tcp connect module
- example: Mosquitto
HTTP services: I'm using the http module, which allows checking for specific HTTP return code (e.g., 200) and can also match specific regex - both for declaring a service healthy or unhealthy.
- example: my webserver / I also use the regex part to check for MariaDB health, if a specific string is returned, it means MariaDB is not running.
TLS certificates: I'm checking for certificates expiry.
DNS: I'm checking that my Authoritative DNS servers are not only up, but do reply with correct data.
System resources: e.g. hard disk capacity - what's more annoying than a system that dies because / is full? :D

How I am monitoring / alerting

As I hinted above, I'm using a combination of Blackbox Exporter and Alertmanager:

Blackbox Exporter allows me to probe my resources, using e.g. HTTP or DNS,
Node Exporter allows, well, exporting metrics from a system, and use that to monitor it, and possibly send alerts - e.g. if you have only 10% of space left on your root partition
Alertmanager takes care of sending alerts, in my case by email

I spent a bit of time on one "circular dependancy": if my local DNS resolver is down, then all my tests will fail, triggering a lot of alerts - and of course, I wanted to avoid that.
Alertmanager has a very elegant solution, called inhibit_rules, where you define a severity, that if triggered, will not send alerts about.

Wrap Up

I'm really pleased by how elegant the whole setup is:

adding new services is super easy,
the whole configuration is done through YAML config files,
I can monitor services either very granularly, or just with basics tcp_connect checks.

Next steps

So far, I'm sending alerts by email, I will probably define some text/Telegram alerts for the most urgent ones.

I may also look into adding this to my Grafana setup, because who doesn't like a nice-looking Grafana dashboard?

Tags: IT

Nico's blog