🕑 Estimated reading time: 11mn
ebriand/kubernetes-security-workshop is the GitHub trace of a workshop held by Eric Briand, CTO @ Zenika. Check it out if you wish to learn more about security practices within a Kubernetes cluster.
The concept of observability is a set containing metrics, logs and traces. Metrics can be aggregated, logs represent your application events and traces come from requests.
Prometheus has been created in 2012 by SoundCloud and adopted by the CNCF in 2016. Just after Kubernetes, Prometheus has been one of the first projects to cross the Chasm and reach the Graduated status, which proves its robustness and stability, so much that it is becoming a de facto solution.
In 2017 came OpenMetrics, a standardization effort that facilitates writing and reading metrics at every level, be it system, middleware (network, kubelet, control plane, database), application and business metrics, often by design! It is indeed possible to expose new metrics by following the format.
Jobs Services --▶ Email | ▲ |--▶ PagerDuty | | | ▼ ----▶ Service Discovery | Push Gateway | Alert Manager ▲ | ▲ | ----------------------------------------- | | | | | | ├----------------|-- Collector -▶ Time Series -▶ HTTP ---|-----------┘ | | Database Server ◀--|-----------┐ | | | | | | ----------------------------------------- | ▼ | |-- Web UI Exporters ▼ |-- Grafana (Targets) Storage -- Clients API
Of all these components, many can be hosted by Kubernetes: the server (Collector, Time Series Database, HTTP Server), Grafana, exporters, the alert manager. The Collector components collects metrics at a defined pace however some applications may not be alive at the moment metrics are collected such as short-lived jobs. To avoid missing on metrics, short-lived jobs can send their metrics on a Push Gateway which will store and expose them to the Collector. The Push Gateway can be seen as a proxy component that solves the inherent short life expectancy of jobs metrics.
Applications feeding on the HTTP Server will need to use a specific language called PromQL.
Kubernetes natively exposes Prometheus-compatible metrics and tools are available to help administrators installing the setup:
How do Kubernetes operators work? They follow the Reconciliation loop / pattern: Observe, Analyze, Act.
The GitHub repositories will install the following Kubernetes CRDs: alertmanager.monitoring.coreos.com, podmonitor.monitoring.coreos.com, prometheuses.monitoring.coreos.com, prometheusrules.monitoring.coreos.com, servicemonitors.monitoring.coreos.com. These CRDs can be versioned, sent and processed by Kubernetes. In addition, it will start an Alert Manager cluster, a Grafana dashboard, kube-state-metrucs aggregators, node_exporter modules, Prometheus adapters, Prometheus and the Prometheus operator.
Prometheus offers client libraries / SDKs that can be integrated into many technologies. From Prometheus’s side, you can create an endpoint by using the Service Monitor CRD, bt referencing the new service using the labels. This looks the service up and reconfigures Prometheus to retrieve and process the new metrics.
Note that Kubernetes annotations would work as well in this case nevertheless they are less precise than Service Monitor and from the latest Kubernetes features in development, namely XMonitor, PodMonitor and NodeMonitor. Monitors are not without ssues, though: sidecars like Istio that deploys proxy containers next to services will not get their metrics retrieved, hence the emergence of new Monitors in Kubernetes.
Beware: by default, Prometheus will be deployed in a defined namespace. Do not forget to use RBAC to allow it to retrieve metrics from services located in other namespaces.
Monitoring is always good, even more when adding application and business metrics is easy. Following up on metrics is not: your alerts and dashboards will need maintenance to avoid Alert fatigue, i.e. becoming desentized by alerts and missing on important ones or not capturing any event at all. On the subject of following up, defining a monitoring strategy can be essential: it is possible to assign a cluster for each team, a global one, both, on multiple datacenters, etc.
Prometheus and its ecosystem like its language PromQL are powerful but by default, the retention of a cluster is very short because of its architecture. It is estimated that a Prometheus cluster will rarely keep more than two weeks worth of metrics. Fortunately, there are strategies to counter that limitation such as Thanos.
Prometheus stores its metrics on local block storage and shifts blocks every two hours. It can store millions of samples and request across multiple time series at once. DigitalOcean runs Prometheus at scale and ran into a storage issue. No choice: they had to shard.
In Prometheus, performance is dependent on the nature of stored metrics and of the storage hardware. Scaling implies increasing the amount of available storage which is costly, not a long-lasting or sustaiable fix and still complex when querying metrics after a few weeks or months. Storing elsewhere is not recommended for and not supported anyway.
Two good practices: always put at least one server per failure domain and remember to never put sources that you could correlate on different servers. A failure domain is basically a blast zone: a cluster or a datacenter and correlating sources from different servers is very impractical.
Prometheus metrics can be aggregated right from Grafana by adding multiple Prometheus servers as sources to the same dashboard. Unfortunately, that multiplies the dashboard construction work and widgets by the number of Prometheus sources but data loss on one server can be covered by information from other servers.
A root Prometheus could be the solution then. It could collect metrics from leaf Prometheus servers. This is the recommended solution but beware of the load! You could opt for a hybrid solution to limit risks: federating metrics for time series that can be correlated and leaving the other metrics to their origin servers.
In any case, only having one Prometheus server leads to a Single Point of Failure. However, the recommended solution multiplying servers is not ideal as well. Not only you would use twice the resources but you would get two strictly identical data sources. Fronting multiple servers with an L7 load balancer is less than ideal either: if one of your servers stops working, you end up with a hole in one of your time series and a bump in the other. Unfortunately, you will have to go with the recommended solution here: doubling Prometheus servers and sending the same metrics to both.
Thanos has been released in 2017 by Improbable and is still in development. It integrated the CNCF as a sandbox project in August of 2019. Thanos is fully compatible with Prometheus and its ecosystem, can potentially offer infinite data retention, handle correlation of multiple Prometheus servers, replicas, offer better data compaction and an integrated downsampling feature.
Thanos works as a sidecar for Prometheus. It exports historical data to a permanent storage, usually Amazon S3, its storage gateway handles long term reads and its queryer serves as a proxy between Grafana and the sidecard. This means that Grafana will exchange with Thanos which will query the Prometheus servers and the long term storage transparently while conserving full compatibility with Prometheus’s API and language! Even the interface is very similar to Prometheus.
If you wish yo use Thanos, update Prometheus. Prometheus 2.13.0 and Thanos 0.8.1 together bring noticeable improvements in CPU and RAM usage as well as latency.
In its core, downsampling reduces the number of exposed samples by grouping them and pre-computing count, sum, min, max and counter (rate and rate change) to greatly speed up queries over very long periods.
RAW ------- (After 48h) -------> @5mn ------- (After 8d) --------> @1h Compaction Compaction
It is even possible to get faster by using a cache layer on top of Thanos called Trickster. The project is in development but offers promising results already, at leased used on top of Prometheus.
If your systems generate a lot of metrics, you might end up paying a lot for log retention, even with downsampling enabled and fine tuned to your needs. To reduce costs, it is recommended to also finetune the S3 class that you use for storage: choose the cheapest that matches your needs. S3 Standard-IA is often a good choice unless you query your data very often.
Finally, for a presentation of Prometheus at scale by Improbable themselves: Thanos: Global, durable Prometheus monitoring.