GitOps introduction – what it is and how it works

GitOps is a methodology for Continuous Deployment, where you use Git to store (and track changes of) a declarative definition of your deployment configuration, and have a GitOps tool apply this configuration to your (production, etc.) environment. In this GitOps introduction you will learn how GitOps is defined. By looking at several examples, you will better understand who does what, when and where. I also discuss the advantages and disadvantages of GitOps, and give pointers where to find out more about available implementations.

GitOps introduction

A key component of building and running a software system is the deployment of your code. With the advent of cloud technology came the desire to automate and track as many tasks as possible. Traditional ways of deployment are that someone from the operations team is told (e.g. via a ticket, filed by the development team) to deploy an updated version of the software, so they fire up their terminal, log into the production systems via SSH, and type a bunch of commands to update the application. It’s evident that this is not automated at all, and that it will be hard to track which version is running where.

One solution to this problem is GitOps. A longer, more descriptive name would be “Git-powered Ops”. The basic idea is that you either encode the deployment process (e.g. as shell scripts), or describe the desired deployed state (e.g. as YAML files), version these files in Git, and have automated tooling apply the changes. The main advantage is that you can now trace back where, who and when deployments have changed, by looking at the Git log, and can also undo bad deployments by reverting a Git commit. The overall goal is to reduce friction in the deployment process, making it easier and faster for your organization to deploy new changes, and thus move your product forward faster.

In the remainder of this article, I first take a look at the predominant definition of GitOps, and then discuss several concrete implementation variations, by example. I then go into the advantages and disadvantages of GitOps, and briefly hint at the available implementations.

Assumed prior knowledge

To fully understand the remainder of this article, you should have a solid, basic knowledge of Kubernetes (including terms like Kubernetes Operator or Custom Resource [Definition]), as well as Helm.

GitOps definition

The term “GitOps” was originally coined by WeaveWorks in 2017. Since then, different definitions have spread (depending on who you ask), with subtle differences. To me, it looks as if differences often result from the fact that some tool manufacturer wants to earn money with the tool they are selling, so they bend the definition of GitOps slightly so that they can claim that their tool is “fully GitOps-capable”…

Since it would serve no purpose to have detailed list of different definitions and how they vary, I’d like to focus on the definition you get when looking at one of the oldest definitions by WeaveWorks, and combining this with the vendor-independent GitOps working group definition (which is still in progress), and the gitops.tech site. All these definitions have the following in common:

  1. Developers store declarative descriptions of an application’s environment in Git. Declarative descriptions describe the desired state of the environment (but not how to get there). Side note: the different GitOps definitions don’t agree on what exactly this “environment” is, e.g. whether it includes infrastructure aspects (e.g. spinning up a new Kubernetes cluster), or only describes which software (+configuration) & version should run in already-provisioned infrastructure.
  2. Some deployment system (sometimes called agent) continuously compares the actual environment’s state with the currently desired one (retrieved from Git), and (somehow) manipulates the actual state so that it matches the desired one. Side notes: this definition considers the exact mechanism that turns the environment’s current state into the desired state to be an implementation detail. The tools you find in practice are most likely Kubernetes operators.

This definition explicitly dismisses push-based approaches (also referred to as the “CI Ops anti-pattern” by WeaveWorks) as being not really GitOps-like. Push-based solutions simply add a few shell commands that deal with deployment to the end of your (already existing) CI pipeline – or they are placed in a separate CD pipeline which is triggered once the CI pipeline finished. These deployment commands could be something like kubectl apply, or sshing into the environment (remote server) and running a few commands there, which deploy your code. We can see an example of the flow in the diagram below:

CI Ops structure and flow

The CI Ops approach does not meet the expectations of the GitOps paradigm (as defined above), because:

  • a) CI Ops uses imperative shell commands, which mutate the target environment’s state, rather than declaring a desired state. [Note: this argument only applies if your deployment commands use SSH or something similar, but not if they use e.g. kubectl apply]
  • b) CI Ops does not continuously ensure that the environment’s state matches the declared state stored in Git. Instead, the CI system (pipeline runner) is only in touch with the target environment for a few seconds, while the deployment commands are running. The CI pipeline then goes to sleep until it is triggered again. Thus, a developer or admin could manipulate the deployment manually (e.g. by manually SSHing into the production environment and changing the deployment), and this would only be fixed by the CI Ops pipeline on its next run. Consequently, from just looking at the Git log (specifically those commits that deal with your environment), you cannot tell with absolute certainty whether those changes really are deployed all the time, due to the lack of continuous reconciliation.

Instead, the implementations of GitOps tools presented below are pull-based. A typical form of implementation is a Kubernetes operator, which regularly checks your Git repositories (and/or Docker/OCI image registries) for changes, and applies them to your Kubernetes cluster. The operator also continuously monitors the cluster’s state for unexpected external changes made by someone else, and reverts them.

GitOps variations

The above GitOps definition sounds simple, but it is very vague. When I learned about GitOps for the first time, I was puzzled by how things work in detail. I had questions like:

  1. If the GitOps operator is separate from the CI system and only monitors Git commits, how do I make sure the operator does not (attempt to) deploy software for commits whose CI pipeline has not finished (yet), and thus the (Docker) images are (still) missing?
  2. Different GitOps articles mention the use of multiple Git repositories (application repository vs. environment / config repository). Should I follow this advice, and what are the benefits? And how should I structure each repository – e.g. which files belong where?
  3. What is the exact workflow to get a change deployed? Who (e.g. developer vs. operations team member) commits what, in which order, and where?
  4. How can I deploy my application(s) in multiple environments (e.g. staging vs. testing vs. production)? Do I repeat the GitOps setup process (as if I were to deploy 3 different applications), or can I have dynamically-created environments?
  5. What do I do if the GitOps operator itself has a bug, causing mayhem in my environments? Do I have to uninstall it to get my environment back under control?

The short answer is: there are no definitive answers (yet)! Each and every GitOps tool (see below for a list) has different opinions about how to solve these issues – consequently, articles that explain GitOps in general terms tend not to answer these questions.

Understanding GitOps by example

I don’t want to leave you out in the rain, so let’s take a look at a few exemplary approaches, so that you understand who does what, when and where:

Prerequisites

Suppose you are building a web application (e.g. an API backend), packaged as Docker image, with a dependency on a NoSQL database. Let’s also assume that you already created a Helm chart, that let’s you (manually) install your application into a Kubernetes cluster with Helm, and the chart includes the database dependency as sub-chart. Of course, alternative such as a Kustomize directory structure would also work.

Your application is stored in a single Git repository (the application repo) that contains your app, and files describing how to deploy it into your environment. For instance, the /app directory contains the applications sources and the Dockerfile to build it, and /helm contains our self-made Helm chart. The chart also contains the concrete Docker image version/tag that should be deployed.

For sake of simplicity, suppose that there is only a single production environment, and changes committed to the main branch shall automatically be deployed to it by a GitOps operator.

Let’s also assume that your application repository is configured s.t. the main branch cannot be manipulated by commits directly, but only allows merge/pull requests from feature branches, and only if the feature branch’s CI pipeline succeeded. I’ll use the term “MR” from now on, for Merge Request.

Setup

Let’s now implement a simple GitOps workflow: initially, someone from the operations team has to set up the K8s cluster and install a GitOps tool (such as Flux v2) to observe your Git app repo, specifically the Helm chart in the /helm directory in the main branch. If the Git repo is not publicly accessible, the operations team must also create a read-only access token (e.g. a GitLab deploy token) for the app repo, and make it known to the GitOps operator (via K8s secrets). The following image illustrates how the deployment of a new version works:

  1. Developer creates a feature branch called feature, implements changes, and pushes them to origin.
  2. CI Pipeline builds/test, and pushes the Docker image to the OCI registry, tagging the image with the commit SHA of the latest commit. Suppose this SHA is 1234.
  3. If the CI pipeline finished successfully, the developer creates and pushes a new commit in which they update the referenced Docker image tag to 1234 in the Helm chart.
  4. Developer creates a MR to merge feature into main (and has someone approve it)
  5. The GitOps operator detects the change in the Helm chart in the main branch and deploys the updated version of the application into the cluster.

In the above workflow, a caveat is that there is always a manual step involved for the developer: they have to wait for the CI pipeline (from step 2) to succeed first, before they can create and push yet another commit in step 3. You can get around this caveat by automating this image-update-step. For instance, you can create a CI job (only run for merge commits to the main branch) that will automatically create and push a new commit to main. In this commit, the CI job updates the Docker image tag in the Helm chart to match the latest commit SHA (1234 in our example) of the merge commit’s source branch. If you are using GitLab CI, see here and here for pointers.

Extending to multiple environments

This example could easily be extended from a single production environment to (a static set of) multiple environments: e.g. staging and production. Instead of using the main branch, you would e.g. have one branch per environment. The operations team configures the GitOps operator to not only observe one branch (main), but to observe all environment branches, separately. Promoting changes from one environment to the next (e.g. staging to production) is done by merge requests.

Prerequisites

Suppose your system consists of multiple microservices, each of which is stored in its own app repository. Similar to the Simple scenario presented in the other tab, you built a Helm chart (or Kustomize configuration) for each app, which lets you install that particular app into a K8s cluster. To control the deployment, we now introduce a centralized environment repository.

In this example we use ArgoCD. In a nutshell, the ArgoCD GitOps operator is configured by an ArgoCD-specific Application CRD (Custom Resource Definition). In the corresponding CR object, you specify things like: source from which changes should be pulled (like the Git repo URL, branch/tag name, and path within the repo), the concrete source file format (e.g. that ArgoCD should expect a Helm char, or Kustomize), as well as configuration parameters (e.g. concrete Helm value.yml overrides). While creating such an Application CR file is typically the job of the operations team (who do this e.g. by manually creating the Application CR file and pushing it into the cluster via kubectl apply, or by using the argo CLI tool), you can also use the App of Apps pattern. Here you create a very simple Helm chart (such as this one) which does not have the typical workload objects (like Ingress, Service, …), but Application CR YAML files. The ArgoCD GitOps operator will then recursively monitor those Application CRs for you.

Setup

You create an environment repository with 3 branches (production, staging and testing). All branches contain only an App-of-Apps Helm chart, with slightly different values in each branch (variations include the K8s namespace, the number of Deployment replicas, or the Docker image tags). The operations team sets up the Kubernetes cluster, installs ArgoCD, and configures it 3 times, where each configuration monitors one of the branches of the environment repository for changes in the App-of-Apps Helm chart. The operations team also sets up Git repo access tokens, so that the CI pipeline of the application repos can push to the environment repo, and the ArgoCD operator can read from the environment repo. The following image illustrates how the deployment of a new app version works:

  1. In one of the app repos, the developer creates a feature branch called feature, implements changes, and pushes them to origin.
  2. The app repo’s CI Pipeline builds, test, and pushes the Docker image into the OCI registry, tagging the image by the commit SHA. Suppose this SHA is 1234.
  3. If that CI pipeline finished successfully, the developer creates a MR to merge the feature branch into one of the app repo’s environment branches (e.g. staging).
  4. Once the MR is merged, another CI pipeline of the app repo (configured only for the environment branches) may check again that everything is in order (build, run tests, push image to OCI registry), then it clones the environment repository’s respective branch (here: staging), updates the corresponding app’s Application CR file (which is part of the app’s Helm chart) with the new image tag (SHA 1234) and pushes the changes to the environment repository.
  5. The ArgoCD operator detects the change in the Application CR file in the staging branch of the environment repo and deploys the updated version of the application to the corresponding environment (staging).

Deployment approvals in production

So far, our examples actually do Continuous Deployment, that is, software changes are fully automatically pushed to your environments, even to production. If you want an additional approval step that applies only to production, you can adapt step 4. Change it so that it does not automatically push changes to the environment repo, but only creates a MR there, which needs to be approved manually, e.g. by a member of the operations team.

The following example is based on this article, where a clever configuration of the CI pipeline and GitOps operator will cause the dynamic creation (or destruction) of a new environment (in the form of a Kubernetes namespace) for each new merge request in the application repository.

Prerequisites

Let’s assume you already have a setup as described by the Multi-repo example, but with only one application repo. Your environment repo has one (or more) branches for your statically configured environments. The idea is to create a single additional branch in the environment repository, e.g. named dynamic-envs, which will store one ArgoCD Application CR per dynamic environment. This CR file has to reference e.g. a Helm chart (or a set of other Application CRs) that completely describe your entire system, i.e., all microservices you want to deploy in that new, dynamic environment.

Setup

The operations team sets up the ArgoCD GitOps operator to monitor the directory /apps in the dynamic-envs branch of the environment repo. In the /templates directory of the dynamic-envs branch you create a templated Application CR file (see here for an example), where variables like the K8s namespace name or Docker image tags need to be replaced with the MR’s source branch name or the most recent commit SHA, e.g. by a tool like kyml. The operations team also sets up Git repo access tokens, so that the CI pipeline of the application repo can push to the environment repo, and the ArgoCD operator can read from the environment repo.

The creation of a dynamic environment works as follows:

  1. In the app repo, the developer creates a feature branch, e.g. called f, implements changes, and pushes them to origin.
  2. The developer creates a MR to merge the branch f into the main branch.
  3. The main branch of the app repo has a CI pipeline configured that only runs when a new MR is created. It builds, test, and pushes the Docker image into OCI registry, tagged by the commit SHA. Then the pipeline pulls the dynamic-envs branch of the environment repo, fills the Application CR file from the /templates directory with concrete values (like the K8s namespace, or the Docker image tag), saves the filled CR file into /apps/f.yml (because the feature branch’s name is f), creates a new commit (which adds this file) and pushes it to the environment repo.
  4. The ArgoCD operator detects the new /apps/f.yml CR file in the dynamic-envs branch of the environment repo and deploys the application. Note: you must design the Helm chart in the app repo in such a way that it creates a new namespace for its other workload objects.

The destruction of a dynamic environment works as follow:

  1. The MR is either rejected or successfully merged in the app repo, e.g. by another developer.
  2. Either outcome triggers a “destruction” CI pipeline in the app repo, which clones the environment repo, deletes the filled CR file (here: /apps/f.yml) and commits and pushes this change.
  3. The ArgoCD operator detects that the CR file in the dynamic-envs branch of the environment repo has disappeared. Consequently, it deletes the corresponding namespace and all its included workload objects.

Discussion of the examples

In the simple scenario, we only needed a single repository, which reduces the complexity. However, considering question #1 that I asked above, we have to implement custom measures to ensure that the GitOps operator only deploys changes that have already been successfully tested by the CI pipeline.

Transitioning from a single repo to multiple repos (with the introduction of an environment/config repository), as done in the Multi-repo example, is the next logical step. This gets you advantages such as:

  • The environment repository’s Git log is cleaner, as it no longer contains development-related commits. This also makes auditing easier.
  • You can tie together multiple microservice (each stored in a different Git application repo) into one deployment: the environment repo’s files define which concrete microservices to deploy, and the corresponding versions.
  • Separation of access: your developers no longer have direct access to the production environment, and thus can no longer (accidentally) mess with it.
  • You avoid accidentally triggering infinite loops in your CI pipeline, or flooding the CI pipeline. Imagine the single-repo scenario where you define that successfully-passed CI jobs trigger a CD job (e.g. on every code change). The CD job updates deployment manifest files (e.g. values.yaml of your Helm chart), makes a new commit, and pushes it to the same repository. This push would cause the CI pipeline to run again (unless you very carefully fine-tune the triggers of the CI jobs), which is not only inefficient, but could also start an infinite loop.

The drawback of having one (or more) environment repos is increased complexity: you need to decide what the exact layout of the environment repo should be (e.g. one environment repo per environment, or to have a single environment repo with one branch per environment, or one directory per environment, or one ArgoCD Application CR file per environment, or … etc.). Similarly, there is complexity when it comes to your environments themselves. For instance, should you have different environments/namespaces in a single K8s cluster, or build one cluster per environment or team? Do you create a new, dynamic environment for each build/MR, or do you choose a static number of environments? The more complex and powerful your solution, the more initial (and continuous maintenance) work you should expect for setting up the GitOps workflow. This article has more details.

Oh, if you are wondering about question #5 from above: typically, you can (temporarily or permanently) disable the automatic reconciliation mechanism of the GitOps operator, without having to uninstall it completely. See e.g. here (Flux v2) or here (ArgoCD, disable the automatic sync).

Advantages of GitOps

Let’s take a look why you should go through the motions and adopt GitOps:

AdvantageAlso applies to CI Ops
Workflow: use the same simple and established developer workflow (such as creating a feature branch, then a merge request, then approve it) also for operations. Everything stays in one system, the SCM, rather than developers sending email tickets to ops, who then maintain the deployment status in a separate system (or don’t maintain it at all).
More regular releases: assuming that all tests passed after a push, the deployment is fully automated – this results in more frequent releases and thus better interaction with your end-users.
Better reproducibility: because GitOps only declares the desired state, we leave the task up to specialized tools which make it so that the declared state matched the actual one (such as the K8s operator and scheduler). This achieves a better reproducibility compared to implementing our own brittle, maintenance-intensive scripts that mutate the environment’s state via SSH.
Auditing: the Git log collects all changes made to our environment(s). The GitOps operator will continuously ensure that whatever is declared in Git is also what is currently deployed. This gives the Git log actual validity.
Improved security: getting HTTP access to a Git server is easier achieved (e.g. regarding firewall policies, and using a read-only access token) than achieving that a CI pipeline can access the K8s API server.
Improved security: it is no longer necessary to give access to the production K8s cluster to developers

Challenges

At the present, early stage, GitOps still has many rough edges. In the example discussion section I already mentioned the problem of complexity and decision making, e.g. having to decide how to structure your repositories (app vs. environment). GitOps lacks best practices in this regard (and many others). The GitOps working group is working on improving this, but it will take time. Also, there is the issue with the disconnect between CI and CD, now that CD is done by a separate system (the GitOps operator) – if a deployment failed, the developers also need to know about it, but they won’t find this bit of information in their CI system anymore (which would have been the case when using CI Ops). You need to configure your monitoring and alerting system such that your developers also get notified if a deployment failed (not just the operations team). There are many other challenges, summarized well elsewhere, see this article or this one.

Implementations

Finding the best-suitable implementation is tricky. If you like vendor lock-in, you can use cloud-vendor-specific solutions, such as AWS Code Deploy, GCP Cloud Build or Azure DevOps. But you may want to refrain from tying yourself to one cloud provider…

You will most likely be better off using a solution from the vendor-independent OSS landscape. Unfortunately, there are many different tools on the market, where each one has a slightly different focus and is meant for different use cases. There are (incomplete) lists such as this one, which are of little help, because they don’t keep up with the market (-> become out of date), and they don’t explain the suitable use cases for each tool.

My general recommendation is to go for maturity first, because new (immature) solutions are risky. Use existing “awesome GitOps” lists, complemented by Internet-search-results for “gitops tools”, as a basis, then immediately sort out those that don’t yet have thousands of GitHub stars, or that are very new (e.g. less than one year old). At the time of writing, this article by cloudogu is an excellent resource to get started, as it also features these maturity metrics.

Next, you will need to study the manuals of the remaining solutions to identify which ones best serve your needs, and then implement proof of concept prototypes to see whether they really serve your needs ;). The devil is always in the details!

Conclusion

Although GitOps is still in an early stage, it’s benefits can outweigh the costs (of adoption) in some situations, and may be worthwhile especially for larger organizations. The GitOps working group will hopefully define best practices soon, which should help drive and simplify GitOps adoption. However, the fact that CI Ops is considered an “anti pattern” by a “true” GitOps tool provider (Weaveworks) is partly a marketing stunt (remember, they want to sell their own tool). If you are a small software shop or start up, the simplicity of CI Ops (with kubectl apply at the end of a pipeline) may suit you better. You can always migrate to GitOps later.

Leave a Comment