Auto-scaling Azure Pipelines Agents in Kubernetes

This article explores my Kubernetes operator that provisions Azure Pipelines agents as Kubernetes Pods. It comes with an in-depth tutorial for how to use my operator. I also explore all alternative options for running Azure Pipeline agents, and explain why using a Kubernetes-based option beats all other options.

Introduction

CI/CD pipelines are a cornerstone to improve the efficiency of software development teams. A pipeline automates all steps that happen between a developer pushing new code, and the end user trying the changes in some deployed environment. See this blog post for more details about CI/CD pipelines.

To keep the development team productive, a CI/CD pipeline should complete as quickly as possible. One possible cause for slow pipelines is the build agent infrastructure, by which I mean the deployment approach you use to run your pipeline agents. These agents do the actual computational work defined in your pipeline, e.g. building your Docker images or deploying them. While you could simply set up hundreds of very fast machines to reduce the pipeline duration, this would be very expensive, and not particularly “green”. Consequently, you need elastic scaling for your build agents, where some scaling mechanism automatically provisions (and unprovisions) agents, depending on the number of pending CI/CD jobs.

Azure Pipelines (AZP) is one of many CI/CD platform alternatives. It is especially popular in enterprises who already use the Azure cloud anyway, because AZP integrates into Azure much better than other CI/CD platforms (say, GitLab).

Unfortunately, at the time of writing, AZP does not offer an actually good elastic scaling approach for the AZP agents. In this article, I discuss and compare the available existing AZP agent deployment choices, and then present my own Kubernetes operator that solves the issues of the other approaches.

Options for Azure Pipelines build agent infrastructure

The following table lists and compares the officially-supported options to run AZP agents:

 Microsoft-hosted VMsSelf-hosted VM / serverAzure VM Scale SetACI Terraform ModuleKEDA
Customizability of used hardware
2 vCPUs, 8 GB RAM

Limited to choosing vCPU count and GBs of memory
Customizability of pre-installed tools
Via customized disk image

Via customized disk image

Via customized Docker image

Via customized Docker image
Supported operating systemsWin, Linux, macOSWin, Linux, macOSWin, LinuxWin, LinuxWin, Linux
Elastic scaling
Provisioning speedA few secondsUp to 20 minutesFrom a few seconds up to 1-2 minutes
Resource usage efficiencyPoor (only one agent per VM)Poor (only one agent per VM)Poor (only one agent per VM)GoodGood
Pricing37€/month per agentDepends on chosen CPU and RAMDepends on the number of VMs and VM sizeDepends on chosen CPU and RAMDepends on chosen K8s node VM sizes
Technical issuesunknownunknownSee their huge FAQunknownSee section “A closer look at KEDA” below 

Here are a few notes about each approach:

  • Microsoft-hosted VMs: using them has the benefits of avoiding any maintenance work and that all pre-made tasks work out-of-the-box. However, because you do not have any influence over the pre-baked VM disk images, your pipeline might break whenever Microsoft decides to change some of the pre-installed tool versions.
  • Self-hosted VM / server: on a static set of physical or virtual servers you install the AZP agent, either directly on the host, or running it in a Docker container
  • Azure VM Scale Set: you first create a Azure VM Scale Set (VMSS) where you chose the VM size and disk image, and then grant AZP the permissions to manage the VMSS. As described here, upscaling happens only every 5 minutes, and you should “… allow 20 minutes for machines to be created”, and “it can take an hour or more for Azure Pipelines to scale out or scale in”. Yay! You should definitely use that 😀 (you get that I’m being sarcastic…. right?)
  • ACI Terraform Module: provisions a static number of AZP agent containers, running on Azure Container Instances (ACI)
  • KEDA: KEDA is a general-purpose Kubernetes operator that can scale Jobs, or Deployments/StatefuSets, based data emitted by some event source. Here, this event source is AZP’s (undocumented) jobs API, which lists only pending and running jobs. The provisioning time depends on whether a Pod can fit on an already-running K8s node or not – if not, the cluster autoscaler needs to provision a node first, which may take 1-2 minutes.

One option not listed above is apparently to be announced by Microsoft soon. They have promised it for Q3, moved it to Q4, and we’ll see whether Microsoft will deliver.

There are also numerous third-party Kubernetes operators such as this or this one, which have all been discontinued, thus I did not analyze or try them.

A closer look at KEDA

Once you examine the above comparison table, your conclusion should be to either use Microsofts hosted agents, or to use KEDA. KEDA offers better customizability (of hardware and software), so that’s what my company chose, at first. But over time, we experienced the following uncomfortable issues (which is why I implemented my own Kubernetes operator):

  • It is not easily possible to run agents with (sidecar) containers dynamically defined in your pipeline YAML file. Example: job #1 builds and pushes a Docker image (with a version tag that depends on an Azure Pipelines variable, e.g. Build.BuildId) that you want to run with a KEDA-based agent in job #2 (job #2 starts after job #1). The only solution is to start a dynamic container as an ephemeral container (in an already-running agent Pod). But this has many other drawbacks: for instance, an ephemeral container cannot be protected from termination via a preStop lifecycle hook, it is invisible in most tools, and its resource usage is not accounted for via requests/limits.
  • Using “scale to zero” is more difficult with KEDA: you either have to manually register a fake/dummy agent for each pool/demand, or set the minReplicaCount > 0 in your ScaledObject. Otherwise, your jobs would not even start (I discuss this limitation of the AZP platform in more detail below).
  • If you use long-running agent pods (i.e., not providing the --once flag to the Azure Pipelines agent container), KEDA may prematurely kill your agent pods, resulting in aborted pipelines and many ‘offline’ agents in your agent pool. Why? Because KEDA scales your Deployments/Jobs only based on the number of pending jobs. Suppose two jobs are pending, and a Deployment with 2 replicas is scheduled by KEDA. One job terminates finished (successfully) quickly, the other one takes a bit longer. The pending job count reported by the AZP job API gets reduced from 2 to 1, and KEDA down-scales the Deployment, by changing its replica count to 1. Now, Kubernetes’ Deployment and ReplicaSet controllers arbitrarily terminate one of the Pods. Murphy makes sure it’s the one that still runs the active job.
    • One solution for this problem is to use short-lived Kubernetes Jobs, as done in https://github.com/clemlesne/azure-pipelines-agent. Unfortunately, their disadvantage is that they lack support for cache volumes: Kubernetes has no mechanism to ensure that a cache volume is only concurrently used by one Job: the ReadWriteOnce accessMode does not mean that only one Job can access a volume!

Kubernetes operator to scale Azure Pipelines agents

I built a Kubernetes operator called azure-pipelines-k8s-agent-scaler which solves all problems we had with KEDA.

What is a Kubernetes operator?

In a nutshell, a Kubernetes operator consists of a CustomResourceDefinition and a controller application (which is deployed as a container in a Deployment). The controller essentially translates whatever you define in CustomResource (CR) objects to “normal” Kubernetes workloads (such as Pods, ConfigMaps, etc.), and ensures that all divergences are continuously reconciled.

Like the KEDA operator, my operator also queries the AZP jobs API, which announces pending and ongoing jobs, and then creates corresponding Pods. This query is repeated every couple of seconds. In a CustomResource that you deploy, you define the AZP credentials, AZP pool name, and the different Pod templates (for the different AZP capabilities/demands you want to support).

The key features of the operator are:

  • Scale to zero: Scale to zero saves CPU & memory resources, and therefore reduces infrastructure costs. My operator creates (and destroys) Kubernetes Pods (one Pod per AZP job). The agent Pods are ephemeral (using the agent software’s --once flag, see here for details), such that agent pods terminate automatically after finishing an AZP job.
    • You may wonder: why does my operator manage Pods directly, while other solutions (such as KEDA) instead manage higher-level workloads, such as Deployment/StatefulSet objects, updating their replicas count? By managing Pods directly, the operator has full control over which superfluous Pods to terminate. Suppose the user starts an AZP pipeline, the operator then schedules a Pod, but then the user cancels that pipeline run again. The idling agent Pod is no longer needed, and the operator should terminate it. But Kubernetes operators often have some delay or they base their decisions on a slightly outdated view of the Pods, and thus it could happen that the idle Agent Pod is actually no longer idle by the time the operator decides to terminate it, and therefore that Pod should not be killed. In essence, my operator does the equivalent of a “kubectl exec” into the agent Pod, to determine whether it is running an active AZP job, and only if this is not the case, the operator terminates the agent Pod.
    • Because ephemeral pods lack storage-persistence, my operator supports defining and mounting persistent, reusable cache volumes. An example scenario where this is useful: building Docker images with BuildKit, which benefits from a persistent local cache, as discussed in this article.
    • My operator automatically takes care of the registration (and deletion) of fake offline agents in the AZP agent pool, which is required by the AZP platform so that it even announces jobs on the AZP jobs API.
  • Dynamic sidecar containers: sometimes, a job needs other sidecar containers, whose images contain binaries that are missing from the AZP container’s image (e.g. Terraform, Cypress, Java SDK, etc.). These containers should run in the same Pod as the AZP agent container, to allow for efficient file sharing via an emptyDir volume. My operator allows you to define dynamic sidecar containers directly in the AZP pipeline YAML file: you define a demand called ExtraAgentContainers whose value specifies the sidecars (more details in the tutorial below). When the operator starts an agent Pod, it parses this demand and creates the defined sidecar containers as regular containers in the Pod. They show up as normal containers in Kubernetes Dashboard (or other tools), and you can execute into them, if needed. To run commands in these sidecars in an AZP job, you can use the execContainer.sh script, which handles waiting until the container is up (which could take longer if pulling the container’s image takes a very long time).

Tutorial for using the Azure Pipelines agent scaler operator

Let’s see my operator in action. The tutorial is divided into four sections, with each section progressively showcasing more features of the operator.

1. Basic setup

1.1 Installation of the operator + preparing the AZP pool

First, you need to install the operator into your cluster. Follow the instructions here to do so. In the azp-operator namespace, you should now see a Deployment and Pod for the controller of the operator, and the AutoScaledAgent CRD should also be installed.

Next, open the AZP web interface, navigate to your project’s settings page, and create a new self-hosted agent pool. For this tutorial, I’ll use “operator-pool” as name, so replace it with your pool name.

1.2 Creating the CR

Next, we need to create an AutoScaledAgent CR in your cluster, which configures the operator. I recommend using a dedicated Kubernetes namespace for this, because the operator creates Pods or PVCs in the same namespace as the CR. For this tutorial, I choose “azp-agents” as namespace.

Because the AutoScaledAgent CR references a Secret that contains your Personal Access Token (PAT), we need to create that PAT first. See the docs for details. In the AZP web interface, in the popup dialog that opens, you need to click on the “Show all scopes” link at the bottom, then the “Agent pools” category will show up, where you need to select the “Read & manage” checkbox. Define a PAT name and an expiration date, and click “Create”. Make sure you save the PAT value somewhere safe (e.g. a password manager), because the AZP web interface will only show it once.

Now we can create the AutoScaledAgent CR. There are two options to do so:

  1. Craft the contents of the AutoScaledAgent CR yourself, e.g. based on the sample
  2. Make a copy of the demo-agent Helm chart and customize it

In the tutorial we use the second approach, because it is easier. The demo-agent Helm chart essentially generates an AutoScaledAgent CR and applies it to the cluster, together with a few other helper files (such as a Secret that stores your AZP PAT).

Make a copy of the demo-agent chart folder, and change its values.yaml file as follows:

poolName: "operator-pool"
organizationUrl: "https://dev.azure.com/REPLACEME"
maxTerminatedPodsToKeep: 1
pat: ""  # Override this via --set
# Overwrite default values, if necessary:
dummyAgentGarbageCollectionInterval: "30m"
dummyAgentDeletionMinAge: "2h"
normalOfflineAgentDeletionMinAge: "5h"

azpAgentContainer:
  image:
    registry: ghcr.io
    repository: mshekow/azp-agent
    tag: "2023.09.04"
    pullPolicy: Always

  resources:
    limits:
      memory: 512Mi
    requests:
      memory: 512Mi

reusableCacheVolumes: []
buildkitConfig:
  debug: false
  gc:
    keepCacheMountsGi: 30
    keepTotalGi: 60

podsWithCapabilities:
  - capabilities: { }
    minCount: 1
    maxCount: 5
    containers: [ ]

imagePullSecrets: [ ]
terminationGracePeriodSeconds: 1200
nameOverride: ""
fullnameOverride: ""
podLabels: { }
nodeSelector: { }
tolerations: [ ]
affinity: { }
Code language: YAML (yaml)

Some remarks about the values.yaml file:

  • At the top, change the value for organizationUrl to contain your organization name instead
  • The most important part is the podsWithCapabilities setting. It defines a single Pod template that has no particular AZP capabilities/demands. We want at least one Pod (and at most 5 of them) running at any given time. Also, that Pod should have no other sidecar containers (so the only active container is the one for the AZP agent, which we don’t need to explicitly define in the values.yaml file – the Helm chart template already does this for us).
  • The agent image (mshekow/azp-agent) is based on this Dockerfile. In a production setting of a real project, you may want to build your own Docker image, so that you control which other tools are installed into the image, and to have control over the installed AZP agent version.

Now, run Helm to install your modified chart, also creating the namespace:

helm upgrade --install --namespace azp-agents --create-namespace --set pat=PASTE-YOUR-PAT-HERE demo-agent-release demo-agent
The last argument is the relative path to directory in which you store your Helm chart copy, so modify it if necessary.

Via “kubectl get pod -n azp-agents” you should now see a Pod with an AZP agent, because of the minCount: 1 in the above values.yaml file. When you look at the Agent pools in your AZP project settings, you should also see this agent as “online”.

Now change the minCount value to 0 (so that we have the “scale to zero” approach) and run the above “helm upgrade …” command again. The AZP agent Pod should now be terminated. The agent Pod still does exist (e.g. it is listed by “kubectl get pod“), but all its containers have stopped. The maxTerminatedPodsToKeep setting in the above values.yaml file controls how many of the most recently terminated Pods you want the operator to keep. Only those terminated Pods exceeding this limit are completely removed by the operator.

1.3 Run a simple hello-world pipeline

In your Git repo (e.g. a Git repo stored in Azure Repos), create an azure-pipeline.yaml file with the following content:

trigger:
  batch: true
  branches:
    include:
      - "*"

pool:
  name: operator-pool

jobs:
  - job: hello_world
    steps:
      - script: echo "hello world"
        displayName: hello world
Code language: YAML (yaml)

Create an AZP pipeline for that YAML file and run it. The first job will be pending for a few seconds. You should observe that the operator spins up a Pod, the agent container in that Pod runs the job, then terminates again.

2. Static sidecar containers

If you want to use tools in your pipeline that are not included in the AZP agent’s Docker image, you can make use of static or dynamic sidecar containers. This section discusses static sidecars, which you define “statically” in the values.yaml.

Suppose we want the ability to build new Docker images in the pipeline. To achieve this, we add a second Pod template with a “buildkit” capability, and we define a BuildKit sidecar container which we run in rootless mode, and configure it with a ConfigMap named “buildkit-config” that the demo-agent Helm chart already prepares for us. Because the rootless-mode of BuildKit requires some Pod-wide annotations and securityContext tweaks, we also provide them. Add the following Pod template to podsWithCapabilities:

  - capabilities:
      buildkit: "1"
    minCount: 1
    maxCount: 5
    securityContext:
      fsGroup: 1000
      fsGroupChangePolicy: "OnRootMismatch"
    annotations:  # See https://github.com/moby/buildkit/issues/2441#issuecomment-1253683784
      container.apparmor.security.beta.kubernetes.io/buildkit: unconfined  # last segment must match the container name
    volumes:  # defines extra/custom volumes
      - name: buildkit-config
        configMap:
          name: buildkit-config
    containers:  # defines extra sidecar containers that run alongside the AZP agent container
      - name: buildkit
        image:
          registry: docker.io
          repository: moby/buildkit
          tag: master-rootless
          pullPolicy: Always
        command: [ ]  # optional, overwrites the image's ENTRYPOINT
        args: # optional, overwrites the image's CMD
          - --oci-worker-no-process-sandbox
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 1Gi
        securityContext:  # optional, may be necessary for some cluster configurations or images
          seccompProfile:
            type: Unconfined
          runAsUser: 1000
          runAsGroup: 1000
        readinessProbe:
          exec:
            command:
              - "buildctl"
              - "debug"
              - "workers"
        volumeMounts: # defines extra volume mounts, if necessary
          - name: buildkit-config
            mountPath: /home/user/.config/buildkit
Code language: YAML (yaml)

Next, run the above “helm upgrade …” command again, to apply the updated configuration. Your AZP agent pool now shows an agent that has the “buildkit” capability.

Commit a Dockerfile to your repository, e.g. with a dummy content such as this:

FROM alpine:latest
RUN echo 1234Code language: Dockerfile (dockerfile)

Change your azure-pipeline.yaml file as follows to run a Docker image build:

trigger:
  batch: true
  branches:
    include:
      - "*"

pool:
  name: operator-pool
  demands:
    - buildkit

jobs:
  - job: build_image
    steps:
      - script: ./execContainer.sh -n buildkit -c 'cd $(System.DefaultWorkingDirectory) && buildctl build --frontend dockerfile.v0 --local context=. --local dockerfile=.'
        displayName: Build Docker image
        workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)

Your pipeline should now successfully build an image (without pushing it). The execContainer.sh script is part of the mshekow/azp-agent image. It essentially forwards the command provided in -c to the BuildKit container via “kubectl exec”, but first waits for the BuildKit container to be up and running. This waiting procedure is necessary, because it can happen that the AZP agent container starts quickly, but pulling the BuildKit image takes a long time → by the time the script steps runs, the BuildKit container might still not be ready yet.

The operator makes sure that all sidecar containers share an emptyDir volume with the AZP agent container. When the AZP agent clones the repository’s code to $(System.DefaultWorkingDirectory) (which is located inside the emptyDir volume), it is also available to each sidecar container under that same path. In the script step, the “cd $(System.DefaultWorkingDirectory)” statement is necessary, because the working directory of the BuildKit container is arbitrary (each Docker image defines some working directory), so we change it to $(System.DefaultWorkingDirectory) first. If you are not familiar with “buildctl build” (but you know “docker build”), see this blog post to learn about the differences of these two client CLIs.

If you take a closer look at the Pods in the azp-agents namespace, you will notice the following pattern:

  • Once the AZP agent has finished the job, its container terminates, but the BuildKit container (and its buildkitd daemon process) is still running
  • A few seconds later, the operator detects this situation and terminates the BuildKit sidecar container, so that the entire Pod is now in a terminated state. The operator uses a simple hack: the image version tag is changed to a non-existing tag, thus the container runtime attempts to restart the container, stopping it first, but then fails to restart it. The consequential downside of this approach is that some Kubernetes client tools (e.g. Lens) may show the Pod’s state as “failed”. This is expected behavior.

3. Reusable Cache volumes

A major downside of the above BuildKit example is that BuildKit’s local cache is not persistent (yet). In rootless mode, BuildKit stores its data in /home/user/.local/share/buildkit, and the content of that folder is lost after the termination of the AZP agent Pod.

The operator supports creating and assigning persistent cache volumes to the agent Pods it creates. The operator creates as many PersistentVolumeClaims (PVCs) as needed, and makes sure that a specific PVC is only bound to one Pod at a time. Once that Pod terminates, the operator does not delete the corresponding PVC, but keeps it, so that another Pod (that spawns some time in the future) can reuse this PVC again.

To add such reusable cache volumes, change the values.yaml file so that the BuildKit’s Pod-template in the podsWithCapabilities array looks as follows:

  - capabilities:
      buildkit: "1"
    minCount: 0
    maxCount: 5
    securityContext:
      fsGroup: 1000
      fsGroupChangePolicy: "OnRootMismatch"
    annotations:  # See https://github.com/moby/buildkit/issues/2441#issuecomment-1253683784
      container.apparmor.security.beta.kubernetes.io/buildkit: unconfined  # last segment must match the container name
    volumes:  # defines extra/custom volumes
      - name: buildkit-config
        configMap:
          name: buildkit-config
    containers:  # defines extra sidecar containers that run alongside the AZP agent container
      - name: buildkit
        image:
          registry: docker.io
          repository: moby/buildkit
          tag: master-rootless
          pullPolicy: Always
        command: [ ]  # optional, overwrites the image's ENTRYPOINT
        args: # optional, overwrites the image's CMD
          - --oci-worker-no-process-sandbox
        resources:
          limits:
            memory: 1Gi
          requests:
            memory: 1Gi
        securityContext:  # optional, may be necessary for some cluster configurations or images
          seccompProfile:
            type: Unconfined
          runAsUser: 1000
          runAsGroup: 1000
        readinessProbe:
          exec:
            command:
              - "buildctl"
              - "debug"
              - "workers"
        mountedReusableCacheVolumes:
          - name: buildkit-cache
            mountPath: /home/user/.local/share/buildkit
        volumeMounts: # defines extra volume mounts, if necessary
          - name: buildkit-config
            mountPath: /home/user/.config/buildkit
Code language: YAML (yaml)

Also, add the following block at the top of your values.yaml file:

reusableCacheVolumes:
  - name: buildkit-cache
    storageClassName: hostpath
    requestedStorage: 70GiCode language: YAML (yaml)

Depending on your Kubernetes cluster provider, you may want to change the storageClassName to something other than hostpath, or change the requestedStorage value to a different size. If you do change requestedStorage, you should also adapt the gc (=Garbage Collection) values for keepCacheMountsGi and keepTotalGi accordingly, as explained in the comments in the example values.yaml file.

Note that the name of the cache (here: “buildkit-cache”) defined in the name of reusableCacheVolumes needs to match the name in the mountedReusableCacheVolumes (line 41). When the operator creates the Pod, it automatically creates (or reuses) a PVC, and makes a corresponding entry in the Pod‘s volumes array.

Now, run the above “helm upgrade …” command again, to apply the updated configuration.

Because we changed the minCount to 0, you can observe that the existing BuildKit agent Pod is being terminated.

Now, run the AZP pipeline again. You will observe that the operator created a PVC and attached it to the BuildKit Pod. When you run the AZP pipeline yet another time, this PVC is being reused, and the image build should complete much quicker, because all layers of the image have been cached. However, make sure you allow for 5-10 seconds to pass between the end of pipeline run #1 before starting pipeline run #2, to give the operator enough time to update the housekeeping meta-data (Kubernetes labels), otherwise the operator might create a second PVC.

Cloud storage caveat when using multiple zones

If you use a cloud-hosted Kubernetes cluster (e.g. Azure AKS), you can choose to run nodes (or place volumes) in multiple AZs (availability zones) of a region. In other words: you are using zonal nodes. Or, you might be forced to use zonal nodes because your chosen reusable-cache-volume storage (such as Azure’s Premium SSD v2) can only be attached to zonal nodes. In any case, most cloud providers have the limitation that the zone in which a volume was created must be the same zone as the node to which you want to attach the volume. My operator does not account for that. The easiest solution, IMO, is to create a dedicated Kubernetes node pool that is zonal but only uses one specific zone. Otherwise, the Kubernetes scheduler will automatically ensure the AZP agent Pod is scheduled on a node that is in the same zone as the PVC that was chosen by my operator, which may have adverse performance effects on your cluster auto-scaling.

4. Dynamic sidecar containers

Until now, we defined the sidecar containers in our Pod templates, in the values.yaml file of the operator’s configuration Helm chart. I would call this static configuration, and it would typically be managed by an administrator of your Kubernetes cluster.

However, my Kubernetes operator also supports dynamic sidecar containers, which means that your pipeline YAML sets AZP variables whose values are used in a special demand called ExtraAgentContainers. Let’s see this feature in action.

Suppose your pipeline shall be able to build a Node.js application or a Java application, with the help of the respective Java/Node Docker images. You want the flexibility to choose the concrete Node/Java versions right in your pipeline, because the versions might change very often, and you don’t want to keep updating the values.yaml file of your static sidecar containers all the time.

First, add a agent-registrator.yaml file to your repository with the following content:

# This task registers a fake/dummy AZP agents whose AZP capabilities match dynamically-generated demands that
# contain ExtraAgentContainers definitions.
# This is a workaround for AZP limitations.

parameters:
  - name: extraAgentContainersCapability
    type: string
  - name: shortName
    type: string
    displayName: short name shown for the job display in the AZP web UI
  - name: stepName
    type: string
    displayName: name of the job, needed for follow-up jobs to reference it when reading output variables
  - name: registratorCliVersion
    type: string
    default: "1.3.1"
  - name: localBinaryName
    type: string
    default: agent-registrator

steps:
  # Download the agent-registrator binary (unless it already exists) and run it
  - script: |
      set -euo pipefail
      set -o errexit
      set -o nounset
      if [ ! -e "${{ parameters.localBinaryName }}" ]; then
        curl -L https://github.com/MShekow/azure-pipelines-agent-registrator/releases/download/v${{ parameters.registratorCliVersion }}/azure-pipelines-agent-registrator_${{ parameters.registratorCliVersion }}_linux_amd64 -o ${{ parameters.localBinaryName }}
        chmod +x ${{ parameters.localBinaryName }}
      fi

      echo "Registering ExtraAgentContainers capability value: ${{ parameters.extraAgentContainersCapability }}"

      ./${{ parameters.localBinaryName }} -organization-url $(System.CollectionUri) -pool-name $AZP_POOL -pat $(azureDevOpsPat) \
        -agent-name-prefix dummy-agent -capabilities 'ExtraAgentContainers=${{ parameters.extraAgentContainersCapability }}'

      echo "##vso[task.setvariable variable=out;isoutput=true]${{ parameters.extraAgentContainersCapability }}"
    name: ${{ parameters.stepName }}
    displayName: register ${{ parameters.shortName }} EAC capability
    workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)

Next, update your azure-pipeline.yaml file as follows:

trigger:
  batch: true
  branches:
    include:
      - "*"

parameters:
  - name: applicationType
    displayName: Application type
    type: string
    default: java
    values:
      - java
      - node

pool:
  name: operator-pool

variables:
  ${{ if eq(parameters.applicationType, 'node') }}:
    image: "docker.io/library/node:20.1.0"
    containerName: "node"
    extraAgentContainers: "name=$(containerName),image=$(image),cpu=750m,memory=1Gi"
  ${{ if eq(parameters.applicationType, 'java') }}:
    image: "docker.io/library/ibm-semeru-runtimes:open-17.0.8.1_1-jdk-jammy"
    containerName: "jdk"
    extraAgentContainers: "name=$(containerName),image=$(image),cpu=750m,memory=1Gi"

jobs:
  - job: register_agent
    steps:
      - template: agent-registrator.yaml
        parameters:
          extraAgentContainersCapability: $(extraAgentContainers)
          shortName: app-type
          stepName: appCapability
  - job: use_dynamic_sidecar_container
    dependsOn: register_agent
    variables:
      redefinedEAC: $[ dependencies.register_agent.outputs['appCapability.out'] ]
    pool:
      name: operator-pool
      demands:
        - ExtraAgentContainers -equals $(redefinedEAC)
    steps:
      - script: ./execContainer.sh -n $(containerName) -c '(which node || which java) && sleep 30'
        displayName: Dummy build-app-job
        workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)

What is happening above is the following:

  • The first job, register_agent, uses the agent registrator CLI to register a dummy/offline AZP agent in the operator-pool pool, with the capability $(extraAgentContainers).
    • We need to do this because of a limitation of the AZP platform: if a job (like use_dynamic_sidecar_container) has a demand for which the pool has no registered agent yet with the matching capability, AZP immediately aborts the job. The AZP platform would not even advertise the job via the AZP job API that is regularly polled by our operator. Thus, the operator would not get the chance to start Pods for such jobs. I created a feature request asking to change the AZP platform behavior (see ticket), but it is unlikely that Microsoft will react any time soon.
    • The registrator CLI expects the AZP PAT to be exposed as AZP variable $(azureDevOpsPat) (see agent-registrator.yaml, line 34). To expose this variable, create a secret variable for your pipeline as documented here, named azureDevOpsPat using the PAT that you created in step 1 of this tutorial.
  • In the second job, use_dynamic_sidecar_container, we verify that the sidecar container works as expected. Because I did not provide any real application code in this tutorial, we are not really building an application. Instead we just verify whether the node or java binary is available. Like in part 2 and 3 of this tutorial, we need to use the execContainer.sh script to forward the command to the sidecar container.
    • Under the hood, when the operator sees a pending job with the ExtraAgentContainers demand, it parses the demand-string and correspondingly creates additional sidecar containers in the Pod spec. For instance, if the ExtraAgentContainers demand is set to “name=jdk,image=docker.io/library/ibm-semeru-runtimes:open-17.0.8.1_1-jdk-jammy,cpu=750m,memory=1Gi”, the operator uses “jdk” for the name of the container, sets the provided image (.../ibm-semeru-runtimes...) and assigns both limits and requests to the provided cpu/memory values (which are optional). The ENTRYPOINT and CMD of the sidecar container are overwritten with “/bin/sh -c 'trap : TERM INT; sleep 9999999999d & wait'”.
    • You probably wonder why the demand uses $(redefinedEAC) instead of $(extraAgentContainers). The reason for this is, yet again, a limitation of the AZP platform. extraAgentContainers is a nested variable, which is not supported by “demands” (for unknown reasons). As you can see here, a workaround is to use an output variable (docs) that is redefined as a job-local variable. The agent-registrator.yaml creates this output variable.

You can now run the pipeline, choosing “java” or “node” in the Run pipeline dialog. The last line of the script step’s output of the use_dynamic_sidecar_container job should be the absolute path to the node or java binary. When the Pod runs on a Kubernetes node for the first time, the execContainer.sh script may emit many log lines of the sort “Pod not running yet, waiting for 5s to query again”, because the container runtime needs time to pull the Java or Node Docker image.

Running multiple sidecar containers

The above example added one sidecar container. To add multiple ones, use || as separator, that is, you need to set ExtraAgentContainers to something like “name=c1,image=img:tag,cpu=250m,memory=64Mi||name=c2,image=img2:tag2,cpu=500m,memory=128Mi

Automatic clean-up of agents

To avoid that the list of registered dummy agents grows indefinitely, the Kubernetes operator regularly unregisters all agents named dummy-agent-<random-postfix> that are offline and have been created several hours ago. The clean up interval and the minimum age of cleaned up agents can be configured in operator’s configuration chart via dummyAgentGarbageCollectionInterval and dummyAgentDeletionMinAge. Make sure you only use h/m/s units in the values (e.g. “24h” for 24 hours), d (days) is not supported!

Once the operator has unregistered a dummy agent with a unique capability X (being unique because it contains something like Build.BuildId), any job that demands X can no longer be restarted. This might be an issue if you have flaky jobs which fail on the first run, but may succeed on a repeated run (by clicking on the Rerun failed jobs button). Make sure you provide a large enough value for dummyAgentDeletionMinAge to account for such scenarios.

Conclusion

Elastic scalability of CI/CD pipeline agents is very important, both from an economic perspective and to keep your development teams productive (because scaling avoids that jobs queue up). Using a mature platform like Kubernetes as a foundation makes sense, especially if you already use Kubernetes anyway, e.g. for hosting your services.

For Azure Pipelines, KEDA has been the only choice for running AZP agents on Kubernetes – until now. We replaced KEDA in our projects with my Kubernetes operator several months ago, and observed a much improved reliability in terms of flaky jobs, along with improved cost efficiency. Let me know if you have been trying it out!

Leave a Comment