This article explores my Kubernetes operator that provisions Azure Pipelines agents as Kubernetes Pods. It comes with an in-depth tutorial for how to use my operator. I also explore all alternative options for running Azure Pipeline agents, and explain why using a Kubernetes-based option beats all other options.
Introduction
CI/CD pipelines are a cornerstone to improve the efficiency of software development teams. A pipeline automates all steps that happen between a developer pushing new code, and the end user trying the changes in some deployed environment. See this blog post for more details about CI/CD pipelines.
To keep the development team productive, a CI/CD pipeline should complete as quickly as possible. One possible cause for slow pipelines is the build agent infrastructure, by which I mean the deployment approach you use to run your pipeline agents. These agents do the actual computational work defined in your pipeline, e.g. building your Docker images or deploying them. While you could simply set up hundreds of very fast machines to reduce the pipeline duration, this would be very expensive, and not particularly “green”. Consequently, you need elastic scaling for your build agents, where some scaling mechanism automatically provisions (and unprovisions) agents, depending on the number of pending CI/CD jobs.
Azure Pipelines (AZP) is one of many CI/CD platform alternatives. It is especially popular in enterprises who already use the Azure cloud anyway, because AZP integrates into Azure much better than other CI/CD platforms (say, GitLab).
Unfortunately, at the time of writing, AZP does not offer an actually good elastic scaling approach for the AZP agents. In this article, I discuss and compare the available existing AZP agent deployment choices, and then present my own Kubernetes operator that solves the issues of the other approaches.
Options for Azure Pipelines build agent infrastructure
The following table lists and compares the officially-supported options to run AZP agents:
Microsoft-hosted VMs | Self-hosted VM / server | Azure VM Scale Set | ACI Terraform Module | KEDA | |
Customizability of used hardware | ❌ 2 vCPUs, 8 GB RAM | ✅ | ✅ | ✅ Limited to choosing vCPU count and GBs of memory | ✅ |
Customizability of pre-installed tools | ❌ | ✅ Via customized disk image | ✅ Via customized disk image | ✅ Via customized Docker image | ✅ Via customized Docker image |
Supported operating systems | Win, Linux, macOS | Win, Linux, macOS | Win, Linux | Win, Linux | Win, Linux |
Elastic scaling | ✅ | ❌ | ✅ | ❌ | ✅ |
Provisioning speed | A few seconds | — | Up to 20 minutes | — | From a few seconds up to 1-2 minutes |
Resource usage efficiency | Poor (only one agent per VM) | Poor (only one agent per VM) | Poor (only one agent per VM) | Good | Good |
Pricing | 37€/month per agent | Depends on chosen CPU and RAM | Depends on the number of VMs and VM size | Depends on chosen CPU and RAM | Depends on chosen K8s node VM sizes |
Technical issues | unknown | unknown | See their huge FAQ | unknown | See section “A closer look at KEDA” below |
Here are a few notes about each approach:
- Microsoft-hosted VMs: using them has the benefits of avoiding any maintenance work and that all pre-made tasks work out-of-the-box. However, because you do not have any influence over the pre-baked VM disk images, your pipeline might break whenever Microsoft decides to change some of the pre-installed tool versions.
- Self-hosted VM / server: on a static set of physical or virtual servers you install the AZP agent, either directly on the host, or running it in a Docker container
- Azure VM Scale Set: you first create a Azure VM Scale Set (VMSS) where you chose the VM size and disk image, and then grant AZP the permissions to manage the VMSS. As described here, upscaling happens only every 5 minutes, and you should “… allow 20 minutes for machines to be created”, and “it can take an hour or more for Azure Pipelines to scale out or scale in”. Yay! You should definitely use that 😀 (you get that I’m being sarcastic…. right?)
- ACI Terraform Module: provisions a static number of AZP agent containers, running on Azure Container Instances (ACI)
- KEDA: KEDA is a general-purpose Kubernetes operator that can scale
Jobs
, orDeployments/StatefuSets
, based data emitted by some event source. Here, this event source is AZP’s (undocumented) jobs API, which lists only pending and running jobs. The provisioning time depends on whether a Pod can fit on an already-running K8s node or not – if not, the cluster autoscaler needs to provision a node first, which may take 1-2 minutes.
One option not listed above is apparently to be announced by Microsoft soon. They have promised it for Q3, moved it to Q4, and we’ll see whether Microsoft will deliver.
There are also numerous third-party Kubernetes operators such as this or this one, which have all been discontinued, thus I did not analyze or try them.
A closer look at KEDA
Once you examine the above comparison table, your conclusion should be to either use Microsofts hosted agents, or to use KEDA. KEDA offers better customizability (of hardware and software), so that’s what my company chose, at first. But over time, we experienced the following uncomfortable issues (which is why I implemented my own Kubernetes operator):
- It is not easily possible to run agents with (sidecar) containers dynamically defined in your pipeline YAML file. Example: job #1 builds and pushes a Docker image (with a version tag that depends on an Azure Pipelines variable, e.g.
Build.BuildId
) that you want to run with a KEDA-based agent in job #2 (job #2 starts after job #1). The only solution is to start a dynamic container as an ephemeral container (in an already-running agentPod
). But this has many other drawbacks: for instance, an ephemeral container cannot be protected from termination via apreStop
lifecycle hook, it is invisible in most tools, and its resource usage is not accounted for viarequests
/limits
. - Using “scale to zero” is more difficult with KEDA: you either have to manually register a fake/dummy agent for each pool/demand, or set the
minReplicaCount > 0
in yourScaledObject
. Otherwise, your jobs would not even start (I discuss this limitation of the AZP platform in more detail below). - If you use long-running agent pods (i.e., not providing the
--once
flag to the Azure Pipelines agent container), KEDA may prematurely kill your agent pods, resulting in aborted pipelines and many ‘offline’ agents in your agent pool. Why? Because KEDA scales yourDeployments
/Jobs
only based on the number of pending jobs. Suppose two jobs are pending, and aDeployment
with 2replicas
is scheduled by KEDA. One job terminates finished (successfully) quickly, the other one takes a bit longer. The pending job count reported by the AZP job API gets reduced from 2 to 1, and KEDA down-scales theDeployment
, by changing itsreplica
count to 1. Now, Kubernetes’Deployment
andReplicaSet
controllers arbitrarily terminate one of thePods
. Murphy makes sure it’s the one that still runs the active job.- One solution for this problem is to use short-lived Kubernetes Jobs, as done in https://github.com/clemlesne/azure-pipelines-agent. Unfortunately, their disadvantage is that they lack support for cache volumes: Kubernetes has no mechanism to ensure that a cache volume is only concurrently used by one
Job
: theReadWriteOnce
accessMode
does not mean that only oneJob
can access a volume!
- One solution for this problem is to use short-lived Kubernetes Jobs, as done in https://github.com/clemlesne/azure-pipelines-agent. Unfortunately, their disadvantage is that they lack support for cache volumes: Kubernetes has no mechanism to ensure that a cache volume is only concurrently used by one
Kubernetes operator to scale Azure Pipelines agents
I built a Kubernetes operator called azure-pipelines-k8s-agent-scaler which solves all problems we had with KEDA.
What is a Kubernetes operator?
In a nutshell, a Kubernetes operator consists of a CustomResourceDefinition
and a controller application (which is deployed as a container in a Deployment
). The controller essentially translates whatever you define in CustomResource
(CR) objects to “normal” Kubernetes workloads (such as Pods
, ConfigMaps
, etc.), and ensures that all divergences are continuously reconciled.
Like the KEDA operator, my operator also queries the AZP jobs API, which announces pending and ongoing jobs, and then creates corresponding Pods
. This query is repeated every couple of seconds. In a CustomResource
that you deploy, you define the AZP credentials, AZP pool name, and the different Pod
templates (for the different AZP capabilities/demands you want to support).
The key features of the operator are:
- Scale to zero: Scale to zero saves CPU & memory resources, and therefore reduces infrastructure costs. My operator creates (and destroys) Kubernetes Pods (one
Pod
per AZP job). The agentPods
are ephemeral (using the agent software’s--once
flag, see here for details), such that agent pods terminate automatically after finishing an AZP job.- You may wonder: why does my operator manage
Pods
directly, while other solutions (such as KEDA) instead manage higher-level workloads, such asDeployment
/StatefulSet
objects, updating theirreplicas
count? By managingPods
directly, the operator has full control over which superfluousPods
to terminate. Suppose the user starts an AZP pipeline, the operator then schedules aPod
, but then the user cancels that pipeline run again. The idling agentPod
is no longer needed, and the operator should terminate it. But Kubernetes operators often have some delay or they base their decisions on a slightly outdated view of thePods
, and thus it could happen that the idle Agent Pod is actually no longer idle by the time the operator decides to terminate it, and therefore thatPod
should not be killed. In essence, my operator does the equivalent of a “kubectl exec
” into the agentPod
, to determine whether it is running an active AZP job, and only if this is not the case, the operator terminates the agentPod
. - Because ephemeral pods lack storage-persistence, my operator supports defining and mounting persistent, reusable cache volumes. An example scenario where this is useful: building Docker images with BuildKit, which benefits from a persistent local cache, as discussed in this article.
- My operator automatically takes care of the registration (and deletion) of fake offline agents in the AZP agent pool, which is required by the AZP platform so that it even announces jobs on the AZP jobs API.
- You may wonder: why does my operator manage
- Dynamic sidecar containers: sometimes, a job needs other sidecar containers, whose images contain binaries that are missing from the AZP container’s image (e.g. Terraform, Cypress, Java SDK, etc.). These containers should run in the same
Pod
as the AZP agent container, to allow for efficient file sharing via anemptyDir
volume. My operator allows you to define dynamic sidecar containers directly in the AZP pipeline YAML file: you define a demand calledExtraAgentContainers
whose value specifies the sidecars (more details in the tutorial below). When the operator starts an agentPod
, it parses this demand and creates the defined sidecar containers as regular containers in thePod
. They show up as normal containers in Kubernetes Dashboard (or other tools), and you can execute into them, if needed. To run commands in these sidecars in an AZP job, you can use theexecContainer.sh
script, which handles waiting until the container is up (which could take longer if pulling the container’s image takes a very long time).
Tutorial for using the Azure Pipelines agent scaler operator
Let’s see my operator in action. The tutorial is divided into four sections, with each section progressively showcasing more features of the operator.
1. Basic setup
1.1 Installation of the operator + preparing the AZP pool
First, you need to install the operator into your cluster. Follow the instructions here to do so. In the azp-operator
namespace, you should now see a Deployment
and Pod
for the controller of the operator, and the AutoScaledAgent
CRD should also be installed.
Next, open the AZP web interface, navigate to your project’s settings page, and create a new self-hosted agent pool. For this tutorial, I’ll use “operator-pool” as name, so replace it with your pool name.
1.2 Creating the CR
Next, we need to create an AutoScaledAgent
CR in your cluster, which configures the operator. I recommend using a dedicated Kubernetes namespace for this, because the operator creates Pods
or PVCs
in the same namespace as the CR. For this tutorial, I choose “azp-agents” as namespace.
Because the AutoScaledAgent
CR references a Secret
that contains your Personal Access Token (PAT), we need to create that PAT first. See the docs for details. In the AZP web interface, in the popup dialog that opens, you need to click on the “Show all scopes” link at the bottom, then the “Agent pools” category will show up, where you need to select the “Read & manage” checkbox. Define a PAT name and an expiration date, and click “Create”. Make sure you save the PAT value somewhere safe (e.g. a password manager), because the AZP web interface will only show it once.
Now we can create the AutoScaledAgent
CR. There are two options to do so:
- Craft the contents of the
AutoScaledAgent
CR yourself, e.g. based on the sample - Make a copy of the demo-agent Helm chart and customize it
In the tutorial we use the second approach, because it is easier. The demo-agent Helm chart essentially generates an AutoScaledAgent
CR and applies it to the cluster, together with a few other helper files (such as a Secret
that stores your AZP PAT).
Make a copy of the demo-agent chart folder, and change its values.yaml
file as follows:
poolName: "operator-pool"
organizationUrl: "https://dev.azure.com/REPLACEME"
maxTerminatedPodsToKeep: 1
pat: "" # Override this via --set
# Overwrite default values, if necessary:
dummyAgentGarbageCollectionInterval: "30m"
dummyAgentDeletionMinAge: "2h"
normalOfflineAgentDeletionMinAge: "5h"
azpAgentContainer:
image:
registry: ghcr.io
repository: mshekow/azp-agent
tag: "2023.09.04"
pullPolicy: Always
resources:
limits:
memory: 512Mi
requests:
memory: 512Mi
reusableCacheVolumes: []
buildkitConfig:
debug: false
gc:
keepCacheMountsGi: 30
keepTotalGi: 60
podsWithCapabilities:
- capabilities: { }
minCount: 1
maxCount: 5
containers: [ ]
imagePullSecrets: [ ]
terminationGracePeriodSeconds: 1200
nameOverride: ""
fullnameOverride: ""
podLabels: { }
nodeSelector: { }
tolerations: [ ]
affinity: { }
Code language: YAML (yaml)
Some remarks about the values.yaml
file:
- At the top, change the value for
organizationUrl
to contain your organization name instead - The most important part is the
podsWithCapabilities
setting. It defines a singlePod
template that has no particular AZP capabilities/demands. We want at least onePod
(and at most 5 of them) running at any given time. Also, thatPod
should have no other sidecar containers (so the only active container is the one for the AZP agent, which we don’t need to explicitly define in thevalues.yaml
file – the Helm chart template already does this for us). - The agent image (
mshekow/azp-agent
) is based on this Dockerfile. In a production setting of a real project, you may want to build your own Docker image, so that you control which other tools are installed into the image, and to have control over the installed AZP agent version.
Now, run Helm to install your modified chart, also creating the namespace:
helm upgrade --install --namespace azp-agents --create-namespace --set pat=PASTE-YOUR-PAT-HERE demo-agent-release demo-agent
The last argument is the relative path to directory in which you store your Helm chart copy, so modify it if necessary.
Via “kubectl get pod -n azp-agents
” you should now see a Pod
with an AZP agent, because of the minCount: 1
in the above values.yaml
file. When you look at the Agent pools in your AZP project settings, you should also see this agent as “online”.
Now change the minCount
value to 0
(so that we have the “scale to zero” approach) and run the above “helm upgrade …
” command again. The AZP agent Pod
should now be terminated. The agent Pod
still does exist (e.g. it is listed by “kubectl get pod
“), but all its containers have stopped. The maxTerminatedPodsToKeep
setting in the above values.yaml
file controls how many of the most recently terminated Pods you want the operator to keep. Only those terminated Pods
exceeding this limit are completely removed by the operator.
1.3 Run a simple hello-world pipeline
In your Git repo (e.g. a Git repo stored in Azure Repos), create an azure-pipeline.yaml
file with the following content:
trigger:
batch: true
branches:
include:
- "*"
pool:
name: operator-pool
jobs:
- job: hello_world
steps:
- script: echo "hello world"
displayName: hello world
Code language: YAML (yaml)
Create an AZP pipeline for that YAML file and run it. The first job will be pending for a few seconds. You should observe that the operator spins up a Pod
, the agent container in that Pod
runs the job, then terminates again.
2. Static sidecar containers
If you want to use tools in your pipeline that are not included in the AZP agent’s Docker image, you can make use of static or dynamic sidecar containers. This section discusses static sidecars, which you define “statically” in the values.yaml
.
Suppose we want the ability to build new Docker images in the pipeline. To achieve this, we add a second Pod
template with a “buildkit” capability, and we define a BuildKit sidecar container which we run in rootless mode, and configure it with a ConfigMap
named “buildkit-config
” that the demo-agent Helm chart already prepares for us. Because the rootless-mode of BuildKit requires some Pod-wide annotations
and securityContext
tweaks, we also provide them. Add the following Pod
template to podsWithCapabilities
:
- capabilities:
buildkit: "1"
minCount: 1
maxCount: 5
securityContext:
fsGroup: 1000
fsGroupChangePolicy: "OnRootMismatch"
annotations: # See https://github.com/moby/buildkit/issues/2441#issuecomment-1253683784
container.apparmor.security.beta.kubernetes.io/buildkit: unconfined # last segment must match the container name
volumes: # defines extra/custom volumes
- name: buildkit-config
configMap:
name: buildkit-config
containers: # defines extra sidecar containers that run alongside the AZP agent container
- name: buildkit
image:
registry: docker.io
repository: moby/buildkit
tag: master-rootless
pullPolicy: Always
command: [ ] # optional, overwrites the image's ENTRYPOINT
args: # optional, overwrites the image's CMD
- --oci-worker-no-process-sandbox
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
securityContext: # optional, may be necessary for some cluster configurations or images
seccompProfile:
type: Unconfined
runAsUser: 1000
runAsGroup: 1000
readinessProbe:
exec:
command:
- "buildctl"
- "debug"
- "workers"
volumeMounts: # defines extra volume mounts, if necessary
- name: buildkit-config
mountPath: /home/user/.config/buildkit
Code language: YAML (yaml)
Next, run the above “helm upgrade …
” command again, to apply the updated configuration. Your AZP agent pool now shows an agent that has the “buildkit” capability.
Commit a Dockerfile
to your repository, e.g. with a dummy content such as this:
FROM alpine:latest
RUN echo 1234
Code language: Dockerfile (dockerfile)
Change your azure-pipeline.yaml
file as follows to run a Docker image build:
trigger:
batch: true
branches:
include:
- "*"
pool:
name: operator-pool
demands:
- buildkit
jobs:
- job: build_image
steps:
- script: ./execContainer.sh -n buildkit -c 'cd $(System.DefaultWorkingDirectory) && buildctl build --frontend dockerfile.v0 --local context=. --local dockerfile=.'
displayName: Build Docker image
workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)
Your pipeline should now successfully build an image (without pushing it). The execContainer.sh script is part of the mshekow/azp-agent
image. It essentially forwards the command provided in -c
to the BuildKit container via “kubectl exec
”, but first waits for the BuildKit container to be up and running. This waiting procedure is necessary, because it can happen that the AZP agent container starts quickly, but pulling the BuildKit image takes a long time → by the time the script
steps runs, the BuildKit container might still not be ready yet.
The operator makes sure that all sidecar containers share an emptyDir
volume with the AZP agent container. When the AZP agent clones the repository’s code to $(System.DefaultWorkingDirectory)
(which is located inside the emptyDir
volume), it is also available to each sidecar container under that same path. In the script
step, the “cd $(System.DefaultWorkingDirectory)
” statement is necessary, because the working directory of the BuildKit container is arbitrary (each Docker image defines some working directory), so we change it to $(System.DefaultWorkingDirectory)
first. If you are not familiar with “buildctl build
” (but you know “docker build
”), see this blog post to learn about the differences of these two client CLIs.
If you take a closer look at the Pods in the azp-agents namespace, you will notice the following pattern:
- Once the AZP agent has finished the job, its container terminates, but the BuildKit container (and its
buildkitd
daemon process) is still running - A few seconds later, the operator detects this situation and terminates the BuildKit sidecar container, so that the entire
Pod
is now in a terminated state. The operator uses a simple hack: the image version tag is changed to a non-existing tag, thus the container runtime attempts to restart the container, stopping it first, but then fails to restart it. The consequential downside of this approach is that some Kubernetes client tools (e.g. Lens) may show the Pod’s state as “failed”. This is expected behavior.
3. Reusable Cache volumes
A major downside of the above BuildKit example is that BuildKit’s local cache is not persistent (yet). In rootless mode, BuildKit stores its data in /home/user/.local/share/buildkit
, and the content of that folder is lost after the termination of the AZP agent Pod
.
The operator supports creating and assigning persistent cache volumes to the agent Pods
it creates. The operator creates as many PersistentVolumeClaims
(PVCs) as needed, and makes sure that a specific PVC is only bound to one Pod
at a time. Once that Pod
terminates, the operator does not delete the corresponding PVC, but keeps it, so that another Pod
(that spawns some time in the future) can reuse this PVC again.
To add such reusable cache volumes, change the values.yaml
file so that the BuildKit’s Pod-template in the podsWithCapabilities
array looks as follows:
- capabilities:
buildkit: "1"
minCount: 0
maxCount: 5
securityContext:
fsGroup: 1000
fsGroupChangePolicy: "OnRootMismatch"
annotations: # See https://github.com/moby/buildkit/issues/2441#issuecomment-1253683784
container.apparmor.security.beta.kubernetes.io/buildkit: unconfined # last segment must match the container name
volumes: # defines extra/custom volumes
- name: buildkit-config
configMap:
name: buildkit-config
containers: # defines extra sidecar containers that run alongside the AZP agent container
- name: buildkit
image:
registry: docker.io
repository: moby/buildkit
tag: master-rootless
pullPolicy: Always
command: [ ] # optional, overwrites the image's ENTRYPOINT
args: # optional, overwrites the image's CMD
- --oci-worker-no-process-sandbox
resources:
limits:
memory: 1Gi
requests:
memory: 1Gi
securityContext: # optional, may be necessary for some cluster configurations or images
seccompProfile:
type: Unconfined
runAsUser: 1000
runAsGroup: 1000
readinessProbe:
exec:
command:
- "buildctl"
- "debug"
- "workers"
mountedReusableCacheVolumes:
- name: buildkit-cache
mountPath: /home/user/.local/share/buildkit
volumeMounts: # defines extra volume mounts, if necessary
- name: buildkit-config
mountPath: /home/user/.config/buildkit
Code language: YAML (yaml)
Also, add the following block at the top of your values.yaml
file:
reusableCacheVolumes:
- name: buildkit-cache
storageClassName: hostpath
requestedStorage: 70Gi
Code language: YAML (yaml)
Depending on your Kubernetes cluster provider, you may want to change the storageClassName
to something other than hostpath
, or change the requestedStorage
value to a different size. If you do change requestedStorage
, you should also adapt the gc
(=Garbage Collection) values for keepCacheMountsGi
and keepTotalGi
accordingly, as explained in the comments in the example values.yaml file.
Note that the name
of the cache (here: “buildkit-cache”) defined in the name
of reusableCacheVolumes
needs to match the name
in the mountedReusableCacheVolumes
(line 41). When the operator creates the Pod
, it automatically creates (or reuses) a PVC, and makes a corresponding entry in the Pod
‘s volumes
array.
Now, run the above “helm upgrade …
” command again, to apply the updated configuration.
Because we changed the minCount
to 0
, you can observe that the existing BuildKit agent Pod is being terminated.
Now, run the AZP pipeline again. You will observe that the operator created a PVC and attached it to the BuildKit Pod
. When you run the AZP pipeline yet another time, this PVC is being reused, and the image build should complete much quicker, because all layers of the image have been cached. However, make sure you allow for 5-10 seconds to pass between the end of pipeline run #1 before starting pipeline run #2, to give the operator enough time to update the housekeeping meta-data (Kubernetes labels), otherwise the operator might create a second PVC.
Cloud storage caveat when using multiple zones
If you use a cloud-hosted Kubernetes cluster (e.g. Azure AKS), you can choose to run nodes (or place volumes) in multiple AZs (availability zones) of a region. In other words: you are using zonal nodes. Or, you might be forced to use zonal nodes because your chosen reusable-cache-volume storage (such as Azure’s Premium SSD v2) can only be attached to zonal nodes. In any case, most cloud providers have the limitation that the zone in which a volume was created must be the same zone as the node to which you want to attach the volume. My operator does not account for that. The easiest solution, IMO, is to create a dedicated Kubernetes node pool that is zonal but only uses one specific zone. Otherwise, the Kubernetes scheduler will automatically ensure the AZP agent Pod
is scheduled on a node that is in the same zone as the PVC that was chosen by my operator, which may have adverse performance effects on your cluster auto-scaling.
4. Dynamic sidecar containers
Until now, we defined the sidecar containers in our Pod
templates, in the values.yaml
file of the operator’s configuration Helm chart. I would call this static configuration, and it would typically be managed by an administrator of your Kubernetes cluster.
However, my Kubernetes operator also supports dynamic sidecar containers, which means that your pipeline YAML sets AZP variables whose values are used in a special demand called ExtraAgentContainers
. Let’s see this feature in action.
Suppose your pipeline shall be able to build a Node.js application or a Java application, with the help of the respective Java/Node Docker images. You want the flexibility to choose the concrete Node/Java versions right in your pipeline, because the versions might change very often, and you don’t want to keep updating the values.yaml
file of your static sidecar containers all the time.
First, add a agent-registrator.yaml
file to your repository with the following content:
# This task registers a fake/dummy AZP agents whose AZP capabilities match dynamically-generated demands that
# contain ExtraAgentContainers definitions.
# This is a workaround for AZP limitations.
parameters:
- name: extraAgentContainersCapability
type: string
- name: shortName
type: string
displayName: short name shown for the job display in the AZP web UI
- name: stepName
type: string
displayName: name of the job, needed for follow-up jobs to reference it when reading output variables
- name: registratorCliVersion
type: string
default: "1.3.1"
- name: localBinaryName
type: string
default: agent-registrator
steps:
# Download the agent-registrator binary (unless it already exists) and run it
- script: |
set -euo pipefail
set -o errexit
set -o nounset
if [ ! -e "${{ parameters.localBinaryName }}" ]; then
curl -L https://github.com/MShekow/azure-pipelines-agent-registrator/releases/download/v${{ parameters.registratorCliVersion }}/azure-pipelines-agent-registrator_${{ parameters.registratorCliVersion }}_linux_amd64 -o ${{ parameters.localBinaryName }}
chmod +x ${{ parameters.localBinaryName }}
fi
echo "Registering ExtraAgentContainers capability value: ${{ parameters.extraAgentContainersCapability }}"
./${{ parameters.localBinaryName }} -organization-url $(System.CollectionUri) -pool-name $AZP_POOL -pat $(azureDevOpsPat) \
-agent-name-prefix dummy-agent -capabilities 'ExtraAgentContainers=${{ parameters.extraAgentContainersCapability }}'
echo "##vso[task.setvariable variable=out;isoutput=true]${{ parameters.extraAgentContainersCapability }}"
name: ${{ parameters.stepName }}
displayName: register ${{ parameters.shortName }} EAC capability
workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)
Next, update your azure-pipeline.yaml
file as follows:
trigger:
batch: true
branches:
include:
- "*"
parameters:
- name: applicationType
displayName: Application type
type: string
default: java
values:
- java
- node
pool:
name: operator-pool
variables:
${{ if eq(parameters.applicationType, 'node') }}:
image: "docker.io/library/node:20.1.0"
containerName: "node"
extraAgentContainers: "name=$(containerName),image=$(image),cpu=750m,memory=1Gi"
${{ if eq(parameters.applicationType, 'java') }}:
image: "docker.io/library/ibm-semeru-runtimes:open-17.0.8.1_1-jdk-jammy"
containerName: "jdk"
extraAgentContainers: "name=$(containerName),image=$(image),cpu=750m,memory=1Gi"
jobs:
- job: register_agent
steps:
- template: agent-registrator.yaml
parameters:
extraAgentContainersCapability: $(extraAgentContainers)
shortName: app-type
stepName: appCapability
- job: use_dynamic_sidecar_container
dependsOn: register_agent
variables:
redefinedEAC: $[ dependencies.register_agent.outputs['appCapability.out'] ]
pool:
name: operator-pool
demands:
- ExtraAgentContainers -equals $(redefinedEAC)
steps:
- script: ./execContainer.sh -n $(containerName) -c '(which node || which java) && sleep 30'
displayName: Dummy build-app-job
workingDirectory: $(Agent.WorkFolder)
Code language: YAML (yaml)
What is happening above is the following:
- The first job,
register_agent
, uses the agent registrator CLI to register a dummy/offline AZP agent in theoperator-pool
pool, with the capability$(extraAgentContainers)
.- We need to do this because of a limitation of the AZP platform: if a job (like
use_dynamic_sidecar_container
) has a demand for which the pool has no registered agent yet with the matching capability, AZP immediately aborts the job. The AZP platform would not even advertise the job via the AZP job API that is regularly polled by our operator. Thus, the operator would not get the chance to startPods
for such jobs. I created a feature request asking to change the AZP platform behavior (see ticket), but it is unlikely that Microsoft will react any time soon. - The registrator CLI expects the AZP PAT to be exposed as AZP variable
$(azureDevOpsPat)
(seeagent-registrator.yaml
, line 34). To expose this variable, create a secret variable for your pipeline as documented here, namedazureDevOpsPat
using the PAT that you created in step 1 of this tutorial.
- We need to do this because of a limitation of the AZP platform: if a job (like
- In the second job,
use_dynamic_sidecar_container
, we verify that the sidecar container works as expected. Because I did not provide any real application code in this tutorial, we are not really building an application. Instead we just verify whether thenode
orjava
binary is available. Like in part 2 and 3 of this tutorial, we need to use theexecContainer.sh
script to forward the command to the sidecar container.- Under the hood, when the operator sees a pending job with the
ExtraAgentContainers
demand, it parses the demand-string and correspondingly creates additional sidecar containers in thePod
spec. For instance, if theExtraAgentContainers
demand is set to “name=jdk,image=docker.io/library/ibm-semeru-runtimes:open-17.0.8.1_1-jdk-jammy,cpu=750m,memory=1Gi
”, the operator uses “jdk” for the name of the container, sets the providedimage
(.../ibm-semeru-runtimes...
) and assigns bothlimits
andrequests
to the providedcpu
/memory
values (which are optional). TheENTRYPOINT
andCMD
of the sidecar container are overwritten with “/bin/sh -c 'trap : TERM INT; sleep 9999999999d & wait'
”. - You probably wonder why the
demand
uses$(redefinedEAC)
instead of$(extraAgentContainers)
. The reason for this is, yet again, a limitation of the AZP platform.extraAgentContainers
is a nested variable, which is not supported by “demands” (for unknown reasons). As you can see here, a workaround is to use an output variable (docs) that is redefined as a job-local variable. Theagent-registrator.yaml
creates this output variable.
- Under the hood, when the operator sees a pending job with the
You can now run the pipeline, choosing “java” or “node” in the Run pipeline dialog. The last line of the script
step’s output of the use_dynamic_sidecar_container
job should be the absolute path to the node
or java
binary. When the Pod
runs on a Kubernetes node for the first time, the execContainer.sh
script may emit many log lines of the sort “Pod not running yet, waiting for 5s to query again”, because the container runtime needs time to pull the Java or Node Docker image.
Running multiple sidecar containers
The above example added one sidecar container. To add multiple ones, use ||
as separator, that is, you need to set ExtraAgentContainers
to something like “name=c1,image=img:tag,cpu=250m,memory=64Mi||name=c2,image=img2:tag2,cpu=500m,memory=128Mi
”
Automatic clean-up of agents
To avoid that the list of registered dummy agents grows indefinitely, the Kubernetes operator regularly unregisters all agents named dummy-agent-<random-postfix>
that are offline and have been created several hours ago. The clean up interval and the minimum age of cleaned up agents can be configured in operator’s configuration chart via dummyAgentGarbageCollectionInterval
and dummyAgentDeletionMinAge
. Make sure you only use h/m/s
units in the values (e.g. “24h
” for 24 hours), d
(days) is not supported!
Once the operator has unregistered a dummy agent with a unique capability X (being unique because it contains something like Build.BuildId
), any job that demands X can no longer be restarted. This might be an issue if you have flaky jobs which fail on the first run, but may succeed on a repeated run (by clicking on the Rerun failed jobs button). Make sure you provide a large enough value for dummyAgentDeletionMinAge
to account for such scenarios.
Conclusion
Elastic scalability of CI/CD pipeline agents is very important, both from an economic perspective and to keep your development teams productive (because scaling avoids that jobs queue up). Using a mature platform like Kubernetes as a foundation makes sense, especially if you already use Kubernetes anyway, e.g. for hosting your services.
For Azure Pipelines, KEDA has been the only choice for running AZP agents on Kubernetes – until now. We replaced KEDA in our projects with my Kubernetes operator several months ago, and observed a much improved reliability in terms of flaky jobs, along with improved cost efficiency. Let me know if you have been trying it out!