Tips for operating Docker – the best tools and commands

This article presents useful tips for operating Docker engine on a Linux server. I explain how you can improve the server’s security via automatic updates and scanning the images of your running containers, how to reduce the maintenance efforts by making sure your disks don’t run out of space, and how to set up a Prometheus-based monitoring system that lets you quickly diagnose performance issues. Finally, I present useful Docker management tools, such as Portainer.

Introduction

When it comes to operating (running) software packaged as Docker / OCI images, you have several options:

  • You can use a scalable solution, such as Kubernetes, which is useful if you expect your app to be under strong load, making it necessary to orchestrate many containers on multiple nodes.
  • You are not expecting very much load for your app. A single server can handle it, therefore you can use solutions such as Docker engine, or podman.

In this article, I’ll focus on the latter case, using the Docker engine on a single Linux server. I will explain best practices and useful commands and tools, which get you the following advantages:

  • Improve server security, e.g. via automatic updates, and using tools that detect insecure configurations / containers
  • Reduce maintenance efforts via automatic updates, and a sane default configuration that won’t trash your disk
  • Ability to quickly diagnose performance issues, in case they do happen, by using Prometheus-based monitoring

Some of these tips also apply to a local Docker for Desktop setup.

Configuring the Docker daemon

One very common problem with operating a Docker server (long-term) is running out of disk space. The reason is that Docker does not do any automatic clean-ups. Most notably, container logs and unused images and volumes are never cleaned. We can solve this by running cron jobs that do regular cleaning, but also by configuring the Docker daemon.

Daemon configuration

When the Docker daemon starts, it reads the configuration file located at /etc/docker/daemon.json. You may need to create this file, in case it does not exist. From experience, I found it useful to tweak the following settings:

  • Location of the data partition: by default, Docker stores all persistent data (e.g. image layers, container file systems, volumes, etc.) in sub-directories of /var/lib/docker. If your server has multiple file systems, and the partition that stores /var/lib/docker is rather small (e.g. just a few GB), but there are other, much bigger data partitions, then you should change the data-root setting to the absolute path of the data partition you want to use. In general, it is not recommended to operate a Docker server with just a few GBs of storage, as it will fill up so quickly that you might not be able to delete outdated stuff fast enough.
  • Log driver: as documented here, Docker’s default logging driver is json-file, which does not limit the log file size. An easy fix is to switch to the more efficient local log driver, by adding "log-driver": "local" to your daemon.json file. The local log driver comes with sane defaults, creating up to 3 (rotated) log files of up to 20 MB each, but you could also overwrite the defaults, as documented here.
  • Pull-through cache (a.k.a registry mirror): whenever you use Docker images whose tag do not have a registry-prefix (e.g. postgres:12 vs. registry.myserver.com/postgres:12), the Docker daemon attempts to pull the images from Docker Hub. However, Docker Inc. added IP-based rate limiting to image downloads to Docker Hub in November 2020. You may be impacted by this rate limit, e.g. if all your Docker servers (and engineers using Docker) share a single IP (e.g. of your company’s Internet gateway), and consequently, all requests, combined, do exceed Docker Hub’s rate limit. Fortunately, you can tell your Docker daemon to use one (or more) registry mirrors, from which it attempts to pull images first, falling back to Docker Hub in case of errors. You can use public registries, such as Google’s mirror.gcr.io, or run your own private mirrors, e.g. using Harbor. Just set the registry-mirrors option to an array of URLs.

The following daemon.json example summarizes the tips above:


{
  "data-root": "/data",
  "log-driver": "local",
  "registry-mirrors": ["https://mirror.gcr.io"]
}
Code language: JSON / JSON with Comments (json)

Warning

Changing the content in daemon.json only takes effect after restarting Docker (e.g. via sudo systemctl restart docker). Also, some settings (such as log-driver) apply only to new containers – existing containers will not automatically use such changed settings. Therefore, you may want to stop and remove your running containers and re-create them, e.g. via
docker-compose stop && docker-compose rm && docker-compose up -d

If you just learnt about these tips and want to learn how much disk space the Docker logs already consume, run
find /var/lib/docker/containers/ -type f -name "*.log" | xargs du -sh

To delete these log files, run find /var/lib/docker/containers/ -type f -name "*.log" -delete

Customizing logs per container

It is possible to fine-tune the log driver (and its settings) per container. This is useful if you have containers with very verbose logging (that you cannot configure otherwise), and you want to grant this container a larger upper limit than other contains.

For instance, the docker run command supports the arguments --log-driver and --log-opt, such that you could run something like
docker run --rm some-verbose-container-image \
--log-driver local --log-opt max-size=200m --log-opt max-file=5

The docker-compose.yml file also supports corresponding settings, see here.

Cleaning images etc. via cron

To get an overview of the currently consumed disk space of volumes, images and containers, run docker system df. You can also add the -v to this command to see a more verbose list of objects.

Cleaning up these items can be done by a very simple cron job that you run in any frequency you like, e.g. once per week or month. It is, however, challenging, to determine how aggressive you want pruning to be. Here are a few options:

  • Delete dangling images (docker image prune -f): this is a very non-aggressive clean-up (and is generally recommended, as there are no negative side effects). Dangling images are images that are not used by any (running or stopped) container, and which don’t even have a tag anymore. This could happen, for instance, if you run a container with image “postgres:latest“, and a few weeks later, you stop the container and restart it, pulling a new “postgres:latest” image. The older postgres image becomes a dangling image.
  • Delete unused images (docker image prune -af): unused images are also not used by any (running or stopped) container. Compared to dangling images, unused images still have a valid reference tag. You can generally clean these up too. The only downside is that Docker has to re-download these images if you re-create a container based on them.
  • Delete unused volumes (docker volume prune -f): unused volumes are defined as volumes not mounted into any (running or stopped) container. You should be very careful with automatically cleaning up volumes. You may need the data stored in them after all. For instance, in development, volumes auto-created in a Docker-compose stack may contain important data for your development or testing scenarios. The volumes may only appear to be unused, because you previously ran “docker-compose rm -f” to stop these containers. A good “middle ground” can be to use a clean-up filter that specifies that only those unused volumes should be deleted that do not have a special “deletion-protection” label (more details below).
  • Delete stopped containers (docker container prune -f): while doing so is generally a good idea, you should be careful not to overlook a stopped container which you actually still need, because it contains important data in its writable file system layers. Also, once this command has deleted stopped containers, their images and volumes become unused and a follow-up clean-up call for these volumes/images may have unintended side effects.
  • Delete the local build cache (docker builder prune -f): whenever you build new images from a Dockerfile, the builder component of Docker creates a new layer for each statement in the Dockerfile. These layers are kept in the (local) build cache indefinitely. The docker builder prune -f command gets rid of them.
    • “Build cache” is a rather generic term. It not only stores the locally-created image layers, but also the files of BuildKit’s cache mount feature (which I already discussed here).
    • For the builder’s prune command to work properly, you should first prune dangling/unused images. The reason is that Docker internally maintains the dependency between cached layers, and the corresponding base images. Pruning just one of them won’t be enough. You can try it yourself by building a new image, then run docker image prune -af, then rebuild the image right away: the build command will complete very quickly, because the cached layers “restore” the image. Likewise, if you build the image, then run docker builder prune -f and immediately rebuild the image, Docker can repopulate the cache layers, because the final image still exists (including its intermediate layers). You will also notice that the byte-count in the “RECLAIMABLE” column of the Build Cache row of docker system df will increase significantly, after you pruned the images.

Note: the -f flag makes the commands “unattended”, i.e., gets rid of the confirmation prompt.

There is also docker system prune -f which deletes all stopped containers, unused networks, dangling images, and build caches (but keeps volumes and unused images). Adding the -a flag (docker system prune -af) applies the -a semantic for images, i.e., it deletes both dangling and unused images. Adding --volumes at the end of the command additionally deletes unused volumes.

A useful clean-up option is filtering (see docs), which supports labels and timestamps:

  • Filtering by labels: e.g. docker system prune -af --filter 'label!=keep_when_pruning' combined with creating objects (such as volumes, containers, or networks) with that label where applicable, e.g. docker volume create --label keep_when_pruning your_volume_name
  • Filtering by timestamps: e.g. docker container prune -f --filter "until=4w" where the valid time units are “s” (seconds), “m” (minutes), “h” (hours), “d” (days), “w” (weeks), “y” (years). The example command would delete all containers, except those created during the last 4 weeks. Note that creation time may be much older than the last-started time.

Make sure that you are aware to which objects the filter applies. This is a non-brainer if you run object-specific commands, such as docker (volume|container|image) prune, but it is less obvious when you run docker system prune. Carefully read the filtering docs of each command, and actually test them, before putting them into a cron job, to avoid unfortunate surprises. For instance, docker volume prune does not support filtering by timestamps, but only by labels!

To run these commands regularly, consult your Linux distribution’s manual for how to set up cron jobs. I also found crontab.guru helpful to build cron time strings.

Auto-update the Docker engine

When you install Docker engine into your Linux distribution using the official Docker docs, it may very well happen that your distro’s auto-update mechanism does not update the Docker engine by default. For instance, on Debian (which uses unattended-upgrades to install updates from package repositories every night), Docker is not covered, because it comes from a separate repository. To fix this for Debian, edit the file /etc/apt/apt.conf.d/50unattended-upgrades and add the line
"origin=Docker";
into the Unattended-Upgrade::Origins-Pattern section.

If you use a different Linux distro, you need to figure out yourself whether your distro also has this caveat.

Another problem is that the auto-update mechanism does not necessarily restart any services after it updated the files. Consequently, you either need to find ways to achieve this (e.g. see here for Fedora/CentOS) or simply have a cronjob that regularly reboots the server. The latter also ensures that kernel updates are properly applied.

Monitoring for security and performance issues

It is important to explicitly configure automated monitoring of your Docker-related server activity. Without such monitoring, you are basically asking your users to do monitoring for you, were users send you messages whenever a service goes down. Automated monitoring improves this, having the following advantages:

  • Security:
    • Using security scanning tools, you can detect misconfigurations or insecure images of third party components that are run in containers on your Docker server, without requiring any expertise on your end. This is particularly useful if you are sharing the Docker server with multiple developers or admins, who might not be well-versed when it comes to security.
    • You can prevent security-related incidents, and its fallout (such as loss of reputation of your company).
  • Performance:
    • If problems occur (e.g. a service going down), you can get to the root cause more easily, with the help of historically-collected metrics data,
    • You can prevent problems to begin with, using appropriately-tuned alerting rules, which may even warn you before something becomes a problem (e.g. a disk that is about to fill up).

There are many tools on the market to achieve this. The stack I liked so far is the following:

  • Security:
    • I regularly run the docker-bench-security tool, which performs many checks from the CIS Docker benchmark. It finds all kinds of issues, both in the images you are running, and in the configuration of the host. I recommend that you git clone the shell-script version of docker-bench-security. Don’t use its Docker image, which may be outdated, as described here.
    • Additionally, you can scan all images that are currently in use for security vulnerabilities, e.g. using Trivy. Using a simple shell script, you can first get the images of all (running or stopped) containers, build a unique list of images from that (removing duplicates, e.g. via docker container ls -a --format '{{.Image}}' | uniq -u), and then feed each image to the Trivy CLI, parsing the resulting scan result.
  • Performance: the Prometheus monitoring stack is a very powerful set of tools to monitor your server performance. I have written about the Prometheus stack extensively, so if you are unaware of what it is and how it works, or need a refresher, I recommend the parts 2-5 of my Kubernetes observability series (parts 2-5 explain Prometheus in general, independent of Kubernetes). In practical terms, I recommend Stefan Prodan’s dockprom repository which gets you started very quickly. It preconfigures the necessary services (Prometheus, node exporter, cAdvisor for Docker, Grafana) in a docker-compose setup, and also preconfigures several dashboards and alerting rules for the host and containers. You still need to make customizations for your needs. For the alerting rules, this database has a great list of rules to get your started. The most useful changes are I found useful are:
    • Configuration of the Alertmanager (including a receiver, and a routing tree – both are missing entirely in the dockprom template).
    • Building Grafana dashboards that are specific to your operated (Dockerized) services.

It is possible to automate security monitoring (vs. running these jobs manually): simply run the security tools (which scan your host and the used images) regularly as a cron job, and plug the output of these tools into your Prometheus stack via the Node Exporter’s textfile collector. For instance:

  • docker-bench-security outputs a machine-readable file containing the scan results (named docker-bench.log.json) which is very easy to process, e.g. with a Python or Go script. You can invent your own metrics format, where you e.g. use the failed test identifier as label. A concrete example for a metric sample that your cron job writes to the file parsed by the textfile collector: docker_cis_benchmark_failures{test="1.1.1"} 1
  • Trivy supports machine-readable output formats, too, such as json, see official docs. Similar to docker-bench-security, you could invent new metrics, such as trivy_vulnerabilities{image="postgres:9.6",pkgName="somelib",severity="HIGH",vulnerabilityId="CVE-2014-1234"} 1

Limiting per-service resources

In case you run many services on a single machine, it is useful to limit the CPU and memory resources for each service. You can read up on the gory details in Docker’s official docs. The following subsections present “in a nutshell” summaries:

Limiting memory

If you limit memory, you should usually also disable the use of swap memory.

If you use docker run commands to manage your containers, use something like --memory="2g" --memory-swap="2g". Both values should be the same (to disable swapping), and you can use the suffixes b, k, m, g, to indicate bytes, kilobytes, megabytes, or gigabytes.

The docker-compose.yml equivalent (for the non-Swarm mode) looks as follows:

services:
    my_service:
        mem_limit: 2g
        memswap_limit: 2gCode language: YAML (yaml)

Limiting CPU

Docker lets you:

  • Limit the number of (virtual) CPU cores, with the --cpus setting, which you set to a number of virtual cores, e.g. 0.5 or 2
  • Pin the CPU cores via --cpuset-cpus, as comma-separated list of 0-indexed cores, e.g. 1,3,5 or as a range of cores, e.g. 0-2

You can also do both, e.g. pinning to 3 specific cores, but at the same time limit the overall use to 1 virtual core, which means that the container is only allowed to use about one third of each of these 3 specific cores.

Docker also offers a completely core-independent approach with --cpu-shares, which you set to an integer (relative to 1024, which is the default value). Here, the usable CPUs are not limited, but Docker assignes priorities when multiple containers (combined) want to use more CPUs than are available. It is similar to the “niceness” of Linux processes. A higher value for --cpu-shares means that the container gets a higher priority.

In docker-compose.yml, the settings are called cpus, cpuset and cpu_shares respectively.

Note: you can verify the effective memory limit and the current CPU usage via docker stats.

Useful management tools

Most people manage their Docker server with the docker CLI, using commands such as docker ps, docker run, etc.

However, there are several more graphically-oriented tools that can make your life easier:

  • Portainer: Portainer is a web interface and management platform for Kubernetes and Docker (supporting single host or swarm mode). Portainer is a container by itself, so installing/deploying it is easy. The UI lets you create and manage existing containers, networks, etc., and it comes with integrated user management and an elaborate permissions system. Consequently, you can give permissions to outsiders to do certain things, without having to provide them with SSH access to the Docker host. If you have several Docker hosts, you can also decide to configure just a single Portainer instance which manages all Docker hosts, which you add as (remote) endpoints. Portainer has an “edge computing” feature which lets you deploy Docker(-compose) services on multiple endpoints in parallel, or run one-time jobs on the endpoints.
  • lazydocker: lazydocker is a simple terminal UI for both Docker and docker-compose, which also supports mouse input. It shows the running containers (or compose stacks), container metrics as ASCII graph, logs, lets you start/stop containers, prune images, and more.
  • Nginx Proxy Manager: this tool provides a simple-to-use web UI that lets you manage an Nginx reverse proxy, which would usually expose a selected set of Dockerized services to other hosts (or even the Internet). It includes the ability to manage several domains (virtual hosts), automatic Let’s Encrypt SSL certificates, and lets you configure access permissions, such as Basic Auth. Use Nginx Proxy Manager if you want to avoid the hassle of learning the underlying basics (of Nginx, certbot, etc.) and how to set these up yourself, or if CLI-based solutions such as nginx-proxy are too much “working on the shell” for you.

Conclusion

In many projects, a single server can easily handle the load caused by your users. Using Docker (as opposed to Kubernetes) is a good idea. Docker’s advantages are the huge ecosystem of images, and the isolation it provides between containers. This eliminates any kinds of conflicts that would have existed when installing everything via the distro’s package manager directly.

Naturally, you don’t want to have too much work to maintain a server. The tips offered in this article should vastly reduce maintenance efforts. Granted, it takes a bit of time to set up a working monitoring system, but once established, it is easy to replicate to other servers. The management tools I presented should simplify your life even more.

Leave a Comment