Kubernetes Observability – Part VI: Prometheus in Kubernetes guide

This article discusses the different options you have to install Prometheus in Kubernetes, and then explains the installation of the Prometheus operator in detail, using the kube-prometheus-stack Helm chart. I conclude with how to upgrade your Prometheus stack as well as your own alerting rules and Grafana dashboards.

Kubernetes observability article series

This article is part of a series of blog posts about Kubernetes observability. You can find all other articles of the series here!

Introduction

The Prometheus stack is a popular set of tools used to achieve observability of your system. As I explained in the learning guide outlined in part I of this Observability article series, the Prometheus stack could be used on individual servers (e.g. with Docker), but is designed to monitor large-scale distributed systems. The Prometheus stack shines when using it in Kubernetes, for several reasons:

  • Prometheus comes with dynamic service discovery (SD) for Kubernetes: in the scrape_config section of the Prometheus configuration file you can add a target of type kubernetes_sd_config, which instructs Prometheus to contact the Kubernetes API server to discover available nodes, pods, services, etc., and to scrape metrics from them.
  • Many components of Kubernetes (e.g. the API server) already export metrics, in the Prometheus metrics format
  • A large ecosystem (kube-prometheus) has formed to further improve the integration of Kubernetes and Prometheus (and related components, such as Alertmanager, Grafana, etc.)

Required knowledge

To make the most of this article, you need to be familiar with Prometheus and Kubernetes:

Regarding Prometheus: you first need to understand Prometheus well, including its architecture, concepts, how to configure it, and the PromQL language, for which you can consult my other articles (see above).

Regarding Kubernetes: you should be familiar with basic concepts, including Helm charts and Kubernetes operators.

In this article I’ll discuss the different installation options, explain the installation of the Prometheus operator in detail, and conclude with how to upgrade your Prometheus stack, alerting rules and Grafana dashboards.

Prometheus installation options for Kubernetes

To get Prometheus up and running in Kubernetes, you have three options:

  1. Install only Prometheus (and a few related components, such as Alertmanger, but not Grafana) using the Prometheus Helm chart. Configuration is done using the Prometheus-native files (such as prometheus.yml, or alert-rules.yml).
  2. Install the Prometheus Operator using its Helm chart, without any CRs: this installation approach is not recommended, because this way of installing it is now deprecated, use option 3 instead. The Prometheus operator (which, by itself, is not at all deprecated) is a Kubernetes operator that lets you handle the provisioning and operating various components (such as Prometheus and Alertmanager) using CRDs (documented here). In other words: configuration is done using Kubernetes-native objects, instead of Prometheus-native files. The Prometheus operator converts them to the Alertmanager/Prometheus-native configuration format on the fly.
  3. Install the Prometheus Operator along with Grafana and many default alerting rules and dashboards, using the kube-prometheus package. kube-prometheus is a GitHub project that contains a huge library of best-practices and instructions for how to install the Prometheus operator, including many sensible default configurations.

Option 3 is the recommended approach, the next section presents more details.

In summary (showing only options 1 & 3):

Prometheus Helm chartkube-prometheus-based Prometheus operator
Configuration typePrometheus-nativeKubernetes-native
Installed default componentsPrometheus
Alertmanager
Node exporter
kube-state-metrics
See option 1
+ Grafana

Comes with pre-installed library of alerting rules and dashboards

If you are unclear about the projects kube-state-metrics, prometheus-adapter and metrics-server and how they relate to Prometheus, expand the below box.

Kubernetes has its own approach to deal with metrics, that is independent of Prometheus, but also much less powerful than Prometheus. One of the Kubernetes SIGs (Special Interest Groups) have defined metric-related APIs (see here), such as the Resource-metrics API (which exposes CPU & RAM usage of nodes, pods and containers) and the Custom-metrics API (which exposes arbitrary metrics that you can define). APIs require implementations that actually serve them (see list of implementations). A commonly used implementation is the metrics-server that is often pre-installed in a Kubernetes cluster, but can also be installed using a Helm chart (run kubectl get pods -A | grep metrics-server to determine whether it is installed). Once installed, you can use commands such as kubectl top, or use the horizontal pod autoscaler (HPA), which requires a running metrics-server to do its work.

However, the metrics-server / metrics API is not intended to do observability. It collects only few metrics (such as CPU and memory), and only keeps the recent history of samples in volatile memory (it is not persistent), and also does not offer alerting. This is why you should install Prometheus in addition to metrics-server, using one the above recommended installation options. Prometheus persistently stores many different kinds of metrics, and does offer alerting.

There are two more projects often used in practice:

  • The kube-state-metrics project is a separate service that watches Kubernetes API server events, and generates all kinds of metrics for the state of custom workloads installed in the cluster, e.g. the state of pods, deployments, etc., exposing them on a /metrics endpoint in Prometheus format, for Prometheus to scrape them. You need kube-state-metrics, because the metrics that Kubernetes components (such as API server) expose by default are about the state of the services in general (e.g. number of servced requests), not about the state of the objects/workloads that they store inside.
  • prometheus-adapter: if you use the HPA, and the resource-metrics (CPU / RAM) data is not suitable to determine whether the replica count of your service should be increased/decreased, you can also instruct the HPA to use custom metrics instead (using the Custom-metrics API mentioned above). prometheus-adapter is a bridge component that can serve these custom metrics, from a running Prometheus instance. It also can (but does not necessarily have to) replace the metrics-server on clusters that do not already have the metrics-server installed.

Detailed look at the Prometheus operator

Using the Prometheus operator (from the kube-prometheus package) is the recommended approach to install and operate Prometheus in Kubernetes. It will require you to learn about new concepts that are specific to the operator, but the advantage is that the operator simplifies many tasks. As the official GitHub repo states:

  • Simplified Deployment Configuration: Configure the fundamentals of Prometheus like versions, persistence, retention policies, and replicas from a native Kubernetes resource.
  • Prometheus Target Configuration: Automatically generate monitoring target configurations based on familiar Kubernetes label queries; no need to learn a Prometheus specific configuration language.

The last sentence, no need to learn a Prometheus specific configuration language, is just marketing BS. You do need to be familiar with most concepts of Prometheus and Alertmanager, as you would otherwise not be able to debug problems. Also, some aspects of the configuration require you to literally write YAML sections that would normally be part of a prometheus.yaml, alertmanager.yaml or alert-rules.yaml file.

In the following subsections we will look at central concepts of the Prometheus operator, how to install it into your cluster, how to configure monitoring of additional services, and how to get custom dashboards into Grafana.

Central concepts of the Prometheus operator

Like any typical Kubernetes operator, the Prometheus operator scans your cluster for new or changed CRs (Custom Resources), and then deploys addidional pods or adapts configurations in accordance to these CRs. The most relevant CRDs are:

  • Prometheus: defines a Prometheus deployment (which the operator will deploy as StatefulSet). It provides options to configure replication (HA mode), persistent storage, and Alertmanagers to which the deployed Prometheus instances send alerts.
  • Alertmanager: defines a Alertmanager deployment (which the operator will deploy as StatefulSet). It provides options to configure replication (HA mode) and persistent storage.
  • ServiceMonitor: specifies Kubernetes services that should be scraped by Prometheus. The Prometheus operator automatically generates scrape configurations for them and configures the Prometheus instances to use them.
  • PodMonitor: like ServiceMonitor, but for Kubernetes pods.
  • PrometheusRule: defines Prometheus alerting and/or recording rules. The Operator generates a rule file, and configures the Prometheus instances to use it.
  • AlertmanagerConfig: specifies subsections of the Alertmanager configuration, allowing routing of alerts to custom receivers, and setting inhibit rules.

There are a few more CRDs you only need in certain instances. You can find the exhaustive, detailed description of the CRDs in the operator’s design docs (definitely a recommended reading).

Installation of the operator

Installing the kube-prometheus package can be done in two ways: using jsonnet (see official instructions), or using a Helm chart called kube-prometheus-stack. I will discuss the Helm chart option in detail, because most people are familiar with Helm already. While the Jsonnet approach may be more flexible than the Helm chart, it requires that you take the time to learn jsonnet and the surrounding tools. I’ve found the jsonnet syntax to be as incomprehensible as Object-C, so that was a natural impediment for me to learn it.

To install kube-prometheus-stack, first add the Helm chart’s repository, as instructed here. Create a custom-values.yaml file in which you customize many different details (see below). Finally, assuming that your kube config is set up correctly (using the right context for the right cluster), run helm upgrade --install -f custom-values.yaml [RELEASE_NAME] -n monitoring --create-namespace prometheus-community/kube-prometheus-stack to install (or upgrade) it.

As with any Helm chart, you will need to study how it works by looking at the chart.yaml and values.yaml file, so that you override the options that make sense for you in your custom-values.yaml file. Here are a few pointers I found helpful:

  • In the kube-prometheus-stack values.yaml file, the configuration options below the keys “grafana“, “kubeStateMetrics” and “nodeExporter” are not well-documented, because these are sub-charts defined in the Chart.yaml file. The Chart.yaml file points to the GitHub repos where these respective charts are found, so refer to them to learn how these options work.
  • The chart creates a large number of alerting rules (see defaultRules), but does not create a default Alertmanager configuration (that dictates which receivers there are and how alerts are routed to receivers). You still need to provide your own Alertmanager configuration, by either creating a yaml file that contains a AlertmanagerConfig CR (see the respective section in the operator’s design docs), or by providing the configuration in the custom-values.yaml file in section alertmanager.config .
  • In the values.yaml file, search for configuration options starting with “additional…” to learn which custom options you can set in the Helm chart. Intesting findings (imo) are:
    • additionalPrometheusRulesMap
    • prometheus.prometheusSpec.additionalScrapeConfigs
    • prometheus.prometheusSpec.additionalAlertManagerConfigs
    • From the names of these options (and the inline documentation) it should be quite obvious how to use them.
  • You should tweak the persistent storage used by Prometheus and Alertmanager. While a few GBs of storage will be sufficient for Alertmanager (which does not store much data), things become more complex for Prometheus, because storage requirements depend on the retention period and the amount of scraped samples (per time unit), which is different for every system. See here for pointers. Regarding Grafana: the kube-prometheus-stack Helm chart treats Grafana as a stateless component on purpose, even though it is usually operated statefully, e.g. to store users or dashboards. The reason is that all configuration data should be stored in (plain-text) files. This is the default for Prometheus and Alertmanager already, and the Helm chart enforces this to Grafana, too, by not setting grafana.persistence.enabled to true (which is false by default, see here).
    • For Alertmanager, tweak the options alertmanager.alertmanagerSpec.retention and alertmanager.alertmanagerSpec.storage
    • For Prometheus, tweak the options prometheus.prometheusSpec.retention and prometheus.prometheusSpec.storageSpec
  • Limiting the resource requests and limits is always a good idea, especially for Kubernetes clusters with pre-defined resource quotas. Set prometheus.prometheusSpec.resources, alertmanager.alertmanagerSpec.resources, grafana.resources, grafana.sidecar.resources, nodeExporter.resources, and kubeStateMetrics.resources.
  • Fully automatic High availability of Prometheus is still on the roadmap for the Prometheus operator team (see here for details). They key concept to scale Prometheus is sharding and federation. For now, you can either do sharding yourself (limiting each Prometheus CR to scrape only certain targets, e.g. using the serviceMonitorSelector option), or (additionally) use the shards feature of the kube-prometheus-stack Helm chart, but be aware of the complexities that this involves.
  • Set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues to false. Otherwise, the Prometheus operator would only detect those ServiceMonitor CRs which were installed as part of the kube-prometheus-stack Helm chart (having the “release: [RELEASE NAME]” label), and would not detect other ServiceMonitor CRs from other sources, having different labels. Presumably, the default value (true) was chosen to limit the resource usage of the Prometheus installation.
  • Set grafana.sidecar.datasources.searchNamespace to ALL, so that Grafana will pick up ConfigMaps containing dashboards in every namespace. Otherwise (as default behavior) Grafana only detects ConfigMaps located in the same namespace as Grafana itself. See below for more details about Grafana dashboards.

Configuring ServiceMonitors

As explained above, ServiceMonitor CRs tell Prometheus to scrape specific Kubernetes services, based on label selectors. Apart from the design docs, the API definition is a valuable resource to learn more about the ServiceMonitor CRD.

By default, a ServiceMonitor CR must be in the same Kubernetes namespace as the Endpoints of the Service it refers to, unless you define .spec.namespaceSelector.any = true in the CR.

It generally makes sense to examine which pre-existing services there are in your cluster, which are not directly related to your application, and create ServiceMonitor CRs for them. Typical examples include the Ingress controller or a GitOps controller, such as ArgoCD.

Here is a ServiceMonitor example for the Traefik Ingress controller:


apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: traefik-monitor labels: app: traefik spec: selector: matchLabels: app: traefik endpoints: - port: metrics
Code language: YAML (yaml)

Setting up Grafana dashboards

To get custom dashboards into Grafana, the kube-prometheus-stack uses ConfigMaps. The mechanism behind this is a side-car container for Grafana. In values.yaml, grafana.sidecar is enabled by default. This causes the installation of the k8s-sidecar container in the Grafana Pod, which continuously looks for ConfigMaps with a label such as “grafana_dashboard: '1'“. These ConfigMaps need to contain the dashboard json as data value. Upon detection, k8s-sidecar mounts that json as file into the Grafana container, causing Grafana’s provisioning mechanism to pick it up.

To get a dashboard into the cluster, follow these steps:

  • First you need a dashboard in json form (e.g. this one). Often, you get this dashboard by interactively building it in Grafana’s web UI, and then exporting it as file (docs).
  • Convert the JSON file to a ConfigMap, e.g. via
    kubectl create cm config-map-name --from-file dashboard.json -o yaml --dry-run > config-map.yaml
  • Edit config-map.yaml with a text editor: add the label grafana_dashboard  with value "1"
  • Apply the config map to your cluster (kubectl apply -f config-map.yaml -n <namespace>)
  • If Grafana did not pick up the dashboard automatically, you may have to restart Grafana.

Installing and scraping additional exporters

Getting a service (e.g. a database, message broker, etc.) monitored in Kubernetes always involves these steps:

  • Choose and deploy a suitable exporter (using Kubernetes mechanisms, such as a Deployment), unless that service already comes with a built-in /metrics endpoint,
  • Write a ServiceMonitor that instructs Prometheus to scrape that exporter,
  • Write alerting rules for that exporter.

Fortunately, many Helm charts of common services already have options that do all of these steps for you. An example is the Bitnami Postgres Helm chart, where setting metrics.enabled, metrics.serviceMonitor.enabled and prometheusRule.enabled to true will take care of everything.

For many other common exporters (such as Blackbox exporter, redis, rabbitmq, or Pushgateway) you can find supported Helm charts here.

About Prometheus target labels

In this article I explained the concept of instrumentation vs. target labels. Target labels are attached by Prometheus at the time of scraping. When using Prometheus outside of Kubernetes, Prometheus automatically adds the target labels job="..." and instance="...".

However, when running Prometheus in Kubernetes, several other labels are added, as the screen shot below illustrates, which I took of the Prometheus web UI (Status -> Targets):

target labels in prometheus when running in kubernetes
Auto-generated target labels when running Prometheus in Kubernetes, e.g. container and namespace

If you want to add more (static) target labels to all metrics scraped from a Service for which set up a ServiceMonitor, you have two options:

  • Update both the Service and ServiceMonitor: you add your static label (somekey: somevalue) under the metadata.labels of your Service. In your ServiceMonitor, add the label key to .spec.targetLabels, which is an array. See here for an example.
  • Update only the ServiceMonitor, using the Prometheus relabeling feature, e.g. as demonstrated here. Notes:
    • The original post is looking for help, claiming that their relabelings example does not work. It did work just fine in my experiments.
    • Prometheus does not actually replace the source label. In the linked example, the source labels, [__name__], will not be deleted. Instead, Prometheus actually just adds your targetLabel. I am not sure what Prometheus folks were thinking when making up the naming of their YAML syntax…

Upgrading your monitoring stack

Most tutorials forget (?) to mention that just running the helm install command once (from your developer laptop) is a bad idea. You should update your monitoring stack from time to time, in an automated fashion. Tools such as dependabot or Renovate bot are a great help, in combination with configuration management as code tools, such as Helmfile or Helmsman. You can find a detailed introduction to Renovate bot in my previous article here. Use these bots to regularly check for new kube-prometheus-stack Helm chart releases, and have them automatically apply minor version updates of the Helm chart to your cluster / Git repository.

Whenever there is a new major release of the chart, however, you should avoid fully-automated updates. Instead, have the bot create a Merge/Pull Request (to notify you), followed by manually checking the Helm chart upgrade section for potential caveats first.

Updating alerting rules and dashboards

Whenever you tweaked the files that contain your alerting rules (or Grafana dashboards), you need a mechanism that automatically deploys them to your cluster:

  • Updating alerting rules:
    • Option 1 – push (with Helm): if you provided all your alerting rules in the custom-values.yaml file (in additionalPrometheusRulesMap), then commit them to Git and build a job in your CI/CD pipeline which re-runs the helm upgrade command for the kube-prometheus-stack Helm chart (this is the “CI Ops” approach).
    • Option 2 – pull (with GitOps): as described in my previous article, you can run a GitOps operator (ArgoCD, etc.) which regularly pulls changes from your Git repository. Put your alerting rules into yaml file(s), as PrometheusRule CRs (with file contents similar to what the kube-prometheus-stack Helm chart would create, see here) and store them in Git.
  • Updating dashboards: this works exactly like updating alerting rules. The only difference is that you cannot specify additional dashboards in the custom-values.yaml file directly. You have to either push dashboard config yaml files in a CI/CD job (using kubectl apply -f ... commands), or use the GitOps approach as described above in Option 2.

Conclusion

Getting an entire Prometheus stack up and running in a Kubernetes cluster does take some time at first. You have to get used to the Prometheus operator concepts instead of manually writing configuration files. On the bright side, the Prometheus operator, in combination with the templates offered by the kube-prometheus project, take a lot of work off your shoulders.

In case you run into problems, I recommend two things:

  1. Look at the Prometheus operator documentation files on GitHub,
  2. Understand how the kube-prometheus-stack Helm chart has configured the default values, by looking at the CRs (Prometheus, Alertmanager, etc.) that it created in your cluster.

Leave a Comment