Docker build cache: debug techniques

The Docker build cache avoids rebuilding those parts of a Docker image that were already built. Unfortunately, cache misses are hard to debug. In this article I explain three frequent yet unexpected reasons for cache misses, with solutions. One of them is that COPY or ADD statements are rebuilt, because files have changed. To diagnose the exact files, I present a new CLI tool called directory-checksum, with a small tutorial that illustrates its use.

Table Of Contents

Introduction
Frequent reasons for cache misses
A look at existing tools
Time to build a better tool
Mini tutorial
Conclusion

Introduction

Optimizing the execution time of CI/CD pipelines is very important, as I’ve already explained in a previous blog post. To reduce the build time of Docker images (which is part of CI), Docker layer caching plays a crucial role. You can find out more about Docker’s image-layer cache invalidation logic here.

Unfortunately, Docker layer caching is sometimes “broken”: Docker (or whatever tool you use to build images, e.g. buildah or kaniko) keeps rebuilding certain image layers, even with an already optimized .dockerignore file. There does not seem to be a good way to debug the invalidation of the image layer cache, and I’m not the first person who finds this problematic. There are various forum threads (see e.g. here, here or here) and even an GitHub issue for Docker/BuildKit, with a promise by the Docker CTO that they would look into it.

Frequent reasons for cache misses

While a cache miss often happens when you do expect it (e.g. when you change a statement in your Dockerfile, or when you change files), there are sometimes unobvious cases. Let’s look at a few of these reasons I have observed in practice, for image layer cache misses, and how to avoid them:

If the problem happens in a CI pipeline, and if you use multiple build agent machines (or “ephemeral” machines), it might happen that build job #1 was executed on agent #A, but build job #2 was executed on agent #B, which has a different local cache than agent #A. Consequently, when you look at your CI pipeline output, be sure to check on which agent the jobs are executed. To prevent these kinds of cache misses, you can use a remote cache, storing caching information in a remote image registry. There are two implementation approaches: inline caching (where the image builder embeds caching meta-data into the image it builds), or using a separate registry cache (where a separate image is pushed that contains only cache blobs). The usage details of remote caching depend on your image builder tool. For instance, at the time of writing, docker build supports only inline caching (see here), docker buildx (or when using BuildKit directly) supports both approaches (see here), Buildah and kaniko only support the registry cache (see here for Buildah).
If you use ARG in your Dockerfile, it is easy to accidentally break the cache invalidation. Whenever the value of some ARG is different between two docker build executions, the second execution won’t be able to reuse the previously cached layer for a RUN or ENV command that uses the ARG‘s value. This then also invalidates all follow-up layers. See here for background information. If you use multi-stage builds, and if you run docker build several times (for different targets), make sure you always provide the same ARG values to all docker build calls!
Sometimes the entire image is rebuilt whenever a new base image has been released (that you reference in a FROM statement). This particularly happens if you use docker build --pull. You need to closely look at the builder’s output of the first layer, which includes the SHA-256 checksum of the base image. If it keeps changing frequently, there is no real “fix”. Your image should be rebuilt, to include the most recent security fixes of the base image. However, if the base image is rebuilt very often (e.g. multiple times per day), you may want to stop using the --pull flag, and instead have a different approach that only runs docker pull <base image> (or delete the base image) more rarely, e.g. once per day.
Layers for COPY or ADD statements are rebuilt “unexpectedly” whenever files change that you did not have on your radar. Example #1: files that should be excluded, but they are not in your .dockerignore file yet, e.g. the “.git” folder, or files created during building/testing (e.g. unit test report files, or log files). This typically happens when running “COPY . .“, because then your entire project directory is copied from the build context into the build container, which increases the chance that you missed excluding some superfluous files (that do not belong into the container anyway) via .dockerignore. Example #2: cache misses that happen in a Pull Request pipeline which does not run on the exact code of the PR’s branch (which has not changed), but on a (virtual) merge commit of the PR to the target branch (which might have changed in the meantime). To simplify discovering such kinds of cache misses, I developed a small CLI tool called Directory Checksum, presented below.

A look at existing tools

To solve cache misses that happen on image layers with a ADD or COPY command, the basic approach is to compare the contents of the source directory that you are copying into the build container, between two docker build runs. Since printing the actual contents of the directories to the console would be extremely verbose (and hard to compare), it is much more practical to instead print the directory listings (including checksums of files and folders). Assuming that you already have a .dockerignore file, comparing the directories should happen inside the build container, not on your host, so that the pre-filtering of files (covered by the .dockerignore file) is already applied.

When I looked at existing tools that supposedly solve this problem, I found several caveats:

dtreetrawl was hard to get working on arbitrary Linux distributions, because it requires glibc to be installed. Another problem is that the checksums printed for directories incorporate meta-data, which is ignored by the cache invalidation logic of image builder tools.
md5deep is unsuitable because ignores empty directories (md5deep only considers files). Image builders, however, do account for empty directories! md5deep is also hard to deploy, because you need to compile it first.
Some forums (see e.g. here) suggest to chain the outputs of basic UNIX tools (such as find and md5sum), but this typically also ignores empty directories.

Time to build a better tool

To properly debug cache misses for ADD or COPY statements, I decided to build a new CLI tool with the following requirements:

Compute the checksum of a directory in the same way that image building tools do, only considering the names of files / directories, the binary content of files, and the (simplified) listings of directories, but ignoring any meta-data (e.g. creation timestamp or owner) and file identities (e.g. inodes).
Decouple the checksum computation from printing them: a properly implemented tool needs to scan and compute the checksums for all directory levels that exist (going “infinitely deep”), but the user should be able to limit the levels that are printed, to avoid spamming the console and losing overview. Many tools do not allow to decouple these two aspects.
Deployment aspects:
- Tool must be a static binary without any dependencies, so that it works in any Linux distro
- The static binary should be readily available (precompiled) and be small in size: otherwise it would be annoying having to build the tool first, or having to install an interpreter first (e.g. for the checksumdir Python package), or having to wait a long time to download a large precompiled binary.

Since I wanted to learn the Go programming language anyways, I decided to implement the tool with it. Go can produce static, self-contained, small binaries, and comes with all batteries included to solve this kind of problem. In this article, I take a closer look at my (learning) experience with Go.

Mini tutorial

Let’s see the Directory Checksum tool in action, solving the mystery of a cache miss affecting a COPY statement.

Set up the project

Let’s build the Directory Checksum tool itself, in Docker.

First, checkout the project (git clone https://github.com/MShekow/directory-checksum.git).

In the project root, create a .dockerignore file with the following content:

# Ignore all files and dirs starting with a dot, e.g. ".git", ".idea", etc.
.*Code language: CSS (css)

Next, create the following Dockerfile:

FROM golang:1.19-alpine

WORKDIR /app

COPY go.mod ./
COPY go.sum ./
RUN go mod download

COPY . .

RUN go build -o directory-checksum

ENTRYPOINT [ "/app/directory-checksum" ]
CMD [ "." ]
Code language: Dockerfile (dockerfile)

Build the tool with docker build -t directory-checksum .

Discover the problem

To see the problem in action, add some explanatory comments to the Dockerfile (e.g. # Install module dependencies right above the first COPY statement), then repeat the build: contrary to our expectations, the COPY layer in line 9 (and all subsequent layers) are rebuilt, even though our code has not changed, nor has the Dockerfile really changed (after all, comments are no-ops).

Debug the problem with directory-checksum

Following the instructions in the README, you add several statements to the Dockerfile, so that it now looks as follows:

# syntax=docker/dockerfile:1
FROM golang:1.19-alpine

ADD --chmod=755 https://github.com/MShekow/directory-checksum/releases/download/v1.4/directory-checksum_1.4_linux_amd64 /usr/local/bin/directory-checksum

WORKDIR /app

# Install module dependencies
COPY go.mod ./
COPY go.sum ./
RUN go mod download

COPY . .

RUN directory-checksum --max-depth 2 .

RUN go build -o directory-checksum

ENTRYPOINT [ "/app/directory-checksum" ]
CMD [ "." ]

Code language: Dockerfile (dockerfile)

You may have to replace the amd64 with arm64 if you use an ARM-based CPU (e.g. M1 macs).

Now run our build again, adding another parameter so that the output of directory-checksum is not truncated (which is BuildKit’s default behavior): docker build -t directory-checksum --progress=plain .

Note: if you follow this tutorial on Linux, you have to first set the environment variable DOCKER_BUILDKIT to 1, e.g. by running export DOCKER_BUILDKIT=1 in the shell, prior to running docker build.

The output of the directory-checksum tool looks similar to the following:

#19 0.464 f2bf79a90acf39cf0355ca5348eba4e367bbdb99 D .
#19 0.464 3554651c071b29ebcbbf1938c9a9e174c4e97752 D directory_checksum
#19 0.464 d3535b82d40d8ae1db287116cfd9f84dd96ddbcc F directory_checksum/checksum_utils.go
#19 0.464 48201039d8ee9bfdcb27f1671c6d22ff64df763a F directory_checksum/checksum_utils_test.go
#19 0.464 f5c804a673af22628d7e8c001448c85ea3a27d0c F directory_checksum/dict_utils.go
#19 0.464 bbbc5d6ca6b038b3c935f520b819ac5930eebb8c F directory_checksum/dict_utils_test.go
#19 0.464 04582c0bec7ca654081c524c4a7525d41780b277 F directory_checksum/directory.go
#19 0.464 6cab31684c5652e27f582e274ccf3f1cf3ee7c1c F directory_checksum/fs_scanner.go
#19 0.464 c3cf698288784fd8247475378b8fe27c9debe7fc F directory_checksum/fs_scanner_test.go
#19 0.464 83096e6f8a2d99eef1b380836106b4865d60eddd F directory_checksum/utils_test.go
#19 0.464 14522b988fd897dd72a56a29881eda710daf6bca F Dockerfile
#19 0.464 4001605f39928bbe5a73273b2eef45388f88df66 F LICENSE
#19 0.464 4ecea9de8a419500c245b571268c5ab378c0060f F README.md
#19 0.464 ec1e91de37e43d0900a7f2d9804ea01a853bf781 F go.mod
#19 0.464 2e28201df61e9c5fcb010c5b7835bcb568b5a79d F go.sum
#19 0.464 0b4450cd1798283423d829cec367ad4ab796dbed F main.go
Code language: plaintext (plaintext)

If you were to repeat the above command (and not change anything at all), the build should complete instantly, using only cached layers.

But let’s see what happens if you add more documentation (as comments) to the Dockerfile and repeat the docker build command: the “COPY . .” layer is rebuilt again, and the directory-checksum tool now produces the following output:

#17 0.459 546f59c038298896cbd4c085e70a2a6646cc8ee7 D .
#17 0.459 3554651c071b29ebcbbf1938c9a9e174c4e97752 D directory_checksum
#17 0.459 d3535b82d40d8ae1db287116cfd9f84dd96ddbcc F directory_checksum/checksum_utils.go
#17 0.459 48201039d8ee9bfdcb27f1671c6d22ff64df763a F directory_checksum/checksum_utils_test.go
#17 0.459 f5c804a673af22628d7e8c001448c85ea3a27d0c F directory_checksum/dict_utils.go
#17 0.459 bbbc5d6ca6b038b3c935f520b819ac5930eebb8c F directory_checksum/dict_utils_test.go
#17 0.459 04582c0bec7ca654081c524c4a7525d41780b277 F directory_checksum/directory.go
#17 0.459 6cab31684c5652e27f582e274ccf3f1cf3ee7c1c F directory_checksum/fs_scanner.go
#17 0.459 c3cf698288784fd8247475378b8fe27c9debe7fc F directory_checksum/fs_scanner_test.go
#17 0.459 83096e6f8a2d99eef1b380836106b4865d60eddd F directory_checksum/utils_test.go
#17 0.459 88ea3fc50919a4eac93f8d1c5a828eb655246176 F Dockerfile
#17 0.459 4001605f39928bbe5a73273b2eef45388f88df66 F LICENSE
#17 0.459 4ecea9de8a419500c245b571268c5ab378c0060f F README.md
#17 0.459 ec1e91de37e43d0900a7f2d9804ea01a853bf781 F go.mod
#17 0.459 2e28201df61e9c5fcb010c5b7835bcb568b5a79d F go.sum
#17 0.459 0b4450cd1798283423d829cec367ad4ab796dbed F main.go

Code language: plaintext (plaintext)

As you can see in the highlighted output lines, only the Dockerfile has changed, and therefore also the checksum of the entire WORKDIR directory. At this point you realize: The “COPY . .” statement also copies the Dockerfile into the container, which is a bad idea anyway.

The solution: add the Dockerfile to your .dockerignore file. From now on, repeated builds should no longer have a cache miss, and you can remove the directory-checksum related lines from the Dockerfile again. I suggest you just comment these lines out, to be able to quickly comment them in again, should you run into the problem in the future.

Conclusion

Getting fast Docker image builds is a corner stone to get fast Continuous Integration cycles. I’ve discussed how to optimize the Docker image build speed before (see here). In this piece, I looked at less obvious causes for misses in the image layer cache, and how to address them, including the use of my directory-checksum tool.

You can download and use directory-checksum on every common operating system and CPU model. Just download the automatically-built binary from here.

Let me know in the comments if you have any further tips for debugging the Docker build cache. If you have issues with directory-checksum, feel free to create an issue on GitHub.