I'm now running Prometheus for monitoring my IT resources
Context
I have a few IT resources that I'm running either at home or on dedicated servers, that need to be monitored so that I get alerted in case of issue.
I used to run Nagios for that, but with my move away from running virtual machines on VMWare ESXi as well as migrating as much stuff as possible from Linux to FreeBSD, I did not reinstall a Nagios.
When I decided to spend time on it, I decided it would be good to opt for a "modern" monitoring solution, and instead of using Nagios, I went for Prometheus.
Prometheus
Prometheus is an Open source monitoring solution
that is very flexible, and integrates nicely with pretty much any system out
there.
Besides, I know that the Alerting part is also nice, and allows integrating
with email, instant messaging (Telegram, Slack, ...) or things like PagerDuty.
Also, Blackbox Exporter allows probing endpoints over a variety of protocols.
Steps
This article is by no means not a walk-through installing and configuring Prometheus - rather, I'll just highlight what I'm monitoring, and how.
What I am monitoring
- Basic services: since I only want to check they're alive, I'm using the tcp
connect module
- example: Mosquitto
- HTTP services: I'm using the http module, which allows checking for specific
HTTP return code (e.g., 200) and can also match specific regex - both for
declaring a service healthy or unhealthy.
- example: my webserver / I also use the regex part to check for MariaDB health, if a specific string is returned, it means MariaDB is not running.
- TLS certificates: I'm checking for certificates expiry.
- DNS: I'm checking that my Authoritative DNS servers are not only up, but do reply with correct data.
- System resources: e.g. hard disk capacity - what's more annoying than a
system that dies because
/
is full? :D
How I am monitoring / alerting
As I hinted above, I'm using a combination of Blackbox Exporter and Alertmanager:
- Blackbox Exporter allows me to probe my resources, using e.g. HTTP or DNS,
- Node Exporter allows, well, exporting metrics from a system, and use that to monitor it, and possibly send alerts - e.g. if you have only 10% of space left on your root partition
- Alertmanager takes care of sending alerts, in my case by email
I spent a bit of time on one "circular dependancy": if my local DNS resolver is
down, then all my tests will fail, triggering a lot of alerts - and of
course, I wanted to avoid that.
Alertmanager has a very elegant solution, called inhibit_rules
, where you
define a severity, that if triggered, will not send alerts about.
Wrap Up
I'm really pleased by how elegant the whole setup is:
- adding new services is super easy,
- the whole configuration is done through YAML config files,
- I can monitor services either very granularly, or just with basics
tcp_connect
checks.
Next steps
So far, I'm sending alerts by email, I will probably define some text/Telegram alerts for the most urgent ones.
I may also look into adding this to my Grafana setup, because who doesn't like a nice-looking Grafana dashboard?
Tags: IT