In this article I define the term observability, why you need it and what basic steps are necessary to achieve it. I then present a generic learning guide for how to implement observability for Kubernetes with the Prometheus-stack.
Kubernetes observability article series
This is the first part of a series of blog posts about Kubernetes observability. You can find all other articles of the series here!
Introduction to Observability
It is always important to quickly determine that your system is malfunctioninig, and to reduce the time needed to diagnose the problem and fix it. For smaller, non-distributed systems, which run on a single machine, you can do black-box monitoring (e.g. Nagios-style checks) to determine whether something is wrong. Once the checks failed, alarms would fire, and you’d look at the syslog to determine what’s wrong.
However, applications with a large user base need to scale, depending to the request load. Platforms like Kubernetes implement horizontally-scaling architectures, where the components of your system are distributed over many compute nodes, which are dynamically started and stopped. Unfortunately, these systems are much harder to monitor, because now you have to gather, aggregate and analyze different kinds of data (such as logs) from all your nodes.
In recent years, the term “observability” (for distributed systems) has emerged, that discusses this problem. I highly recommend the excellent primer “Distributed Systems Observability” by Cindy Sridharan to learn more about observability (free e-book available here).
What is Observability?
According to Cindy Sridharan, observability is a system property that lets you quickly understand what is going wrong inside your system (and enables you to fix it quickly), by observing its outputs. However, observability is more than what many practitioners think: just collecting logs, metrics and distributed traces, and doing alerting on the metrics, is not enough. Instead, you’ll need to change your way of thinking entirely and put additional work into multiple phases of your software development life cycle (SDLC). You really need to “bake” observability into your system. To be more concrete, let’s look at possible modifications of the SDLC phases:
- Development: while designing and implementing your system, you need to modify both your own application’s code, and get a good understanding of third-party systems:
- Regarding your own application there are many things you can do, such as:
- Add extensive logging (using structured logs to simplify parsing)
- Add (distributed) tracing code (in case you have multiple microservices that call each other)
- Expose metrics (also referred to as “instrumenting your app”) that are specific to your application, and use APM tools that expose additional, generic (non-application-specific) metrics (e.g. garbage collector cycles)
- Implement a
/health
HTTP endpoint used by recovery mechanisms of platforms such as Kubernetes - Design useful and application-specific alerts that are indicative of the problem – these don’t necessarily have to be technical but could also be business KPIs. For instance, you could expose login/logout event metrics, such that an alert can perform a trend analysis notifying you once something is wrong, e.g. if your monthly logins start declining
- Regarding third-party services that you use as part of your system (e.g. databases, load balancers, etc.), you need to understand which metrics you can instrument from them, and only scrape those that are relevant for your system
- Regarding your own application there are many things you can do, such as:
- Testing: in addition to writing your usual “pre-production phase” tests (which run in some separate (integration) environment, e.g. in a CI pipeline), you also write tests that simulate realistic (load) conditions in your production environment, or an exact clone of it. This includes writing tests that trigger your alarms. Otherwise, how would you know your configured alarms really work?. This includes applying the chaos monkey pattern.
- Operations: you configure and run monitoring software (such as the Prometheus stack), which continuously monitors metrics and generates alerts, and offers dashboards which you can use on-demand to diagnose problems (e.g. analyzing logs and traces). You also monitor the deployment process itself, e.g. by collecting event data created as part of your CI/CD pipeline.
Observability vs. monitoring
Observability is a very loaded term. There is no clear definition of what separates observability from monitoring, only that observability is more than monitoring. The above definition is Cindy’s, and others may have different views, tool vendors in particular, who need to sell you something. See this excellent article for a discussion.
This article is the first part of a series on how to do observability in Kubernetes with Prometheus stack. I quickly found out a way how not do it: just following some tutorial. This did not work, because I was missing a lot of fundamental knowledge. I didn’t understand how the Kubernetes-specific tooling (e.g. the Prometheus operator) worked – I did not even realize how standalone-Prometheus works, nor did I have a proper understanding of basic observability patterns and best practices. However, you need to understand all these aspects to be able to monitor and observe the right parts of your system, and to be able to debug problems caused by the observability tools themselves (e.g. an alert not firing even though it should).
Why Prometheus, Loki and Grafana?
There are many tools in the observability space, from Open Source Software like Netdata or Zabbix, to commercial (SaaS) products such as Datadog, Splunk, LightStep, sumo logic, or Instana. In this article series I focus on the “Prometheus stack” (with the accompagniyng tools, Alertmanager, Loki and Grafana), which is OSS and has become very popular, in the Kubernetes context and otherwise. That is, you can also operate Prometheus etc. on bare metal, or using Docker (compose).
If you’re also interested in distributed tracing, I suggest you look at Grafana’s Tempo.
Learning guide
When I started learning about this field, I had no clear idea where to start. Over time, I figured out what blocks of knowledge build upon each other. I identified three tracks depicted below, which is a kind of flow chart. You can (and should) do the Technical track and the High-level observability track in parallel. Think of them like the “theory and practice” tracks you’d go through, when getting your driver’s license. Only once you have finished both of these tracks it makes sense to start working with Prometheus in Kubernetes.
For the sake of simplicity, this learning guide is limited to Prometheus, Alertmanager and Grafana. Log management with Loki is not part of it, because it is an “easy” addition, once you are familiar with all other concepts. I’ll go into more details about Loki in a future article.
Let’s look at goal of each individual steps in more detail:
- Technical track:
- Understand high-level architecture and data flow: learn about the different components / (Linux) processes of the Prometheus stack, what each of them does, and what data flows to which other component at what time.
- Learn the concepts and terminologies: Prometheus introduces a lot of concepts, such as metric, time series, instrumentation, recording/alerting rules, target label, etc., which you all need to grasp well.
- Learn where and how to configure and connect the different processes: understand which of the components are configured with (stateless) configuration files, and which ones store the config in a mutable / stateful database. All the tools in the Prometheus stack are configured in different places with different approaches.
- Learn PromQL: get acquainted with Prometheus’ query language, which you need to build alerts and Grafana dashboards.
- Learn how to use existing dashboards, modify them & create your own ones: not only do you need to understand how to iteratively build dashboards (starting from existing ones) to save time, but you also need to get some experience to avoid building overloaded dashboards.
- Learn how to build & select alerts for third-party services: while you also need to build alerts for your own components (see third task in the High-level observability track), you also need to determine which Prometheus exporter(s) to use for third-party services (e.g. a database), and for which of its many metrics you should create alarms.
- High-level track:
- Have general understanding of “observability” (or define your own): in your team, make sure everyone has the same understanding of what observability means (for you). You can use an existing definition, or establish your own one that is more narrow, but works well for your particular system and team.
- Establish an observability strategy, detailing your goal and a roadmap: this includes defining the overall final goal of what observability helps you achieve (using “hard” numbers, e.g. in the form SLOs), how you want to start out (adding observability gradually), define the (man hour) budget allocated to building observability features, whether (and how) to implement “on call”, or how you track your incidents.
- Learn high-level alert design principles: this is essentially a list of principles that people working in this field gathered over time. For instance, you should start out building alerts from your user’s perspective (using business KPIs) instead of technical alerts (that target servers or services).
- Gather monitoring tips: for each type of component of your system (such as frontend, server node, or service) there is already a lot of empirical evidence and tips regarding how to do instrumentation and what alerts to write for which metrics. For instance, the most important factor for frontends is typically response time.
- Kubernetes track:
- Understand available tooling and surrounding eco system: there are many different ways to deploy and customize the Prometheus stack in Kubernetes, such as the Prometheus operator, of which you need to choose one. There is also an ecosystem built by the community, consisting of additional useful resources and tools to simplify the process, such as kube-prometheus.
- Re-learn configuration […] using native K8s objects: when using the Prometheus operator for Kubernetes, you will no longer write “traditional” yaml configuration files for Prometheus etc., but configure them with Kubernetes-native workload objects (e.g. a
ServiceMonitor
), which the operator translates to traditional configuration files at run-time.
Conclusion
Getting Prometheus up and running in Kubernetes is no easy task, because you need to have a lot of background knowledge. With the learning guide presented in this article, I’ve only drawn a rough map, outlining the journey to a Prometheus-based observability stack for Kubernetes. However, you still need concrete learning resources for each step of the learning guide.
For steps covered by the High-level track, I recommend that you take a look at the book Practical Monitoring – Effective Strategies for the Real World (2017), which provides an excellent overview. As for the steps, check out the other articles of this series.