Benchmark series part 1: Benchmarking virtual cloud hardware using 6 great tools

This article discusses virtual cloud hardware benchmarks, which help you choose the best performing virtual cloud-based CPU/disk/memory hardware for your needs. I go into the cost-benefit ratio of benchmarking, and provide many good practices for hardware benchmarking, such as ensuring proper reproducibility. Finally, I explore 6 off-the-shelf benchmark tools that measure CPU, memory or disk performance: CoreMark, OpenSSL, 7-zip, Sysbench, FIO and Geekbench.

Introduction

Cloud providers offer a plethora of PaaS and IaaS services that run on some kind of virtualized hardware. This hardware is often described only in nebulous terms (e.g. “approximately equivalent to Intel Xeon XYZ”), and sometimes is not described at all (e.g. in case of most FaaS offerings). Sometimes you can choose from a catalog of hardware options (e.g. VM sizes), sometimes you can’t (and your only choice is to use a different cloud provider entirely). The problem we want to solve in this article is to determine which cloud provider (or which VM size) you should choose, to run your software on. The goal typically is to get the highest performance at the lowest possible costs, where “performance” could e.g. refer to CPU computational power, or disk / network speed.

Hardware and software benchmarks

To determine the performance of the different virtual hardware components, you need to run benchmarks. There are several benchmark variants:

  • Software benchmarks answer the question: “given some specific hardware, how well does software #A perform, or should I choose software #B over #A, or how could I tune #A to maximize its performance?”
  • Hardware benchmarks flip the variables and the constants. They answer the question: “Given a specific software problem that is executed for a limited time period, how well does the software perform on hardware #A, or should I choose hardware #B over #A?”.

Hardware benchmarks produce numbers that you can compare between different hardware configurations. For instance:

  • For CPU:
    • Duration (shorter is better), for problems that have a fixed size (e.g. compressing a specific video)
    • Computations per second (higher is better), for benchmarks where you can configure a fixed duration
  • For SSD/HDD disks (or network bandwidth):
    • Transfer speed / bandwidth (typically measured in MB/sec), and IOPS – higher is better
    • Transfer time (shorter is better), for benchmarks where files of a fixed size are transferred

Benchmarking with (Docker) containers

Benchmark software has existed for a long time, in many shapes and forms. Many of them are installed natively on the machine (e.g. with “apt-get install” on Ubuntu), or are even GUI-based (designed for end-users).

In this article, we look at options to run benchmark tools in (Docker) containers: this allows for testing hardware that power container-based environments, such as Kubernetes or FaaS. But containers also simplify the setup and reproducibility in bare VMs, where you can simply install a container engine and run an image.

Having your benchmark tools in a container image gives you perfect reproducibility: assuming you always use the same image, you can be sure that the benchmark measures the same thing on Linux distribution #A or #B, today or one year from now, because the image contains the exact same versions of the benchmark tools, compiled with the same compiler/linker flags, against the same base library (e.g. glibc).

Good hardware benchmark practices

Here are a few important considerations when benchmarking hardware performance:

  • Avoid noisy neighbors: the hardware under test should only run the benchmark software. No other software applications should be running on that hardware, because their load on the system would manipulate the results
  • Ensure full reproducibility of the software under test: when benchmarking on two machines, use the same tool version, compiler version / compilation flags (if you compile the tool from source), configuration parameters, etc., to ensure that the kind of load that the software puts on the hardware is exactly the same
  • Ensure the statistical reliability of your benchmark results:
    • Repeat tests at least 3 times and compare different hardware based on the average value
    • Check for too large standard deviations in the results, because these could be indicators for “noisy neighbors”
    • Choose a benchmark duration that is long enough: the higher you expect the result’s standard deviation to be, the longer you should run the benchmark. For instance, 10 seconds are likely enough for a purely CPU-based test, but when benchmarking storage systems, you should aim towards 30 seconds or more. If possible, also allow for a warm-up period during which the hardware is stressed, but the results are not recorded yet.
  • When you benchmark CPUs, beware of single- vs. multi-core performance! First, you need to figure out to which extent your workloads profit from multiple cores. You could check your code, read the manual (of off-the-shelf apps), or conduct some tests. If you find that your workloads do not profit (much) from multiple cores, make sure to use benchmark tools that can measure single-core performance! Because you would want to choose hardware that has the strongest single core performance, even if that hardware’s multi-core scaling factor is poor.
  • Beware of burstable hardware: sometimes, virtual hardware is designed for “bursty” workloads (which are idle most of the time, e.g. database servers with few users). For instance, cloud providers offer “burstable” VM types that accumulate CPU credits over time, which are then spent when the VM does computational work. When you benchmark, make sure that you have enough of these CPU credits during the benchmark (alternatively, completely exhaust the credits first, before starting the benchmark, to test the “baseline” performance). Besides CPUs, there are also storage systems and network links which can be burstable.
  • If you use off-the-shelf benchmark software:
    • Make sure you properly understand what it is doing and how its configuration options work. Read the benchmark tool documentation, and do sanity checks to identify possible bugs in the software.
    • Ensure the statistical validity of your benchmark results: Run multiple different benchmark tools that measure similar things. Doing so helps putting the results in perspective. For instance, if you run two different CPU single-core benchmark tools #X and #Y, and both #X and #Y show an ~20% performance improvement on hardware #B compared to hardware #A, this is good. But if the results of #X improve by 20% and those of #Y decrease by 10% or more, something is off, and you need to look more closely at what these tools really measure, or whether there might be a bug (in either software).

Cost-benefit ratio of benchmarking

The ultimate goal of hardware benchmarking is to save money in the long run, by switching to the “optimal” hardware that you determined with your benchmarks, and thus reducing hardware costs. It’s essentially a form of “rightsizing” of your hardware. However, benchmarking itself is not free: it takes time to investigate existing benchmarks, or run existing benchmark software, or even build your own benchmark software or scripts. The more time you invest, the more precise the results will become, but it’s very difficult to find the “cut-off point” at which you should stop digging deeper.

What you can do, though, is:

  • Evaluate (at the beginning!) how large the savings could potentially be, e.g. if you replaced the existing hardware with the cheapest option you would consider to buy/rent,
  • Extrapolate the savings, e.g. over the time span of one year,
  • Compare this with your salary: how many working hours could you spend on the problem until you would have eaten up the savings?.

Of course, the calculation I just described is overly simplistic, because the savings might not just result from saving hardware costs, but from efficiency gains of your (end) users (or your development team) who profit from faster software. For instance, suppose you switched your infrastructure from Intel x86 (64 bit) to ARM64 processors. The hardware costs stay roughly the same, but your software performs 60% faster, increasing user satisfaction and therefore sales of your software. Unfortunately, these kinds of savings are very difficult to estimate in advance.

Investment levels of hardware benchmarking

I’d argue that there are three levels of (time) investment, ranging from low to high effort:

Level 1: consuming existing benchmarksLevel 2: running a suite of representative third party benchmarksLevel 3: building your own benchmarks
Basic ideaYou simply research existing benchmarks published by third parties.You design your own suite of third-party (off-the-shelf) benchmark tools which run workloads that are representative for your actual workloads. You then run this suite and collect the results, rendering charts to simplify the comparison. The executed tests could be purely synthetic tests (e.g. computing prime numbers to measure CPU speed), or be more like real-world scenarios that measure a mixture of CPU, disk access and memory access (e.g. multi-threaded compilation of a specific Linux kernel, or compressing a specific video with some codec).You build and run a test suite that runs your own workloads in an isolated fashion, on dedicated test hardware. This avoids the noisy neighbors problem. If applicable, you also build tooling that collects the test results.
AdvantageRequires the least amount of effortYou have full control over the workloads and can ensure that the tools are executed correctlyHighest applicability to your circumstances (the benchmark results directly translate to your workloads) 
Disadvantages– You cannot influence the benchmark software, and thus the benchmark workloads might not match the kinds of workloads that you run
– The benchmarks might be already out of date (e.g. when a cloud provider recently released a new generation of CPU models, but the benchmark still covers the older generation)
– The benchmarks might not have been executed correctly
– Researching the right tools (and their correct usage) is quite a lot of work (hopefully, this article reduces this effort considerably)
– If you run many tests (on a wide variety of hardware), you need to build tooling to simplify the collection of the benchmark results (and possibly graphical comparison). In parts 2 and 3 of this series, I provide hints how to simplify this.
– Results might not be as “generalizable”, compared to level 2 benchmarks: once you introduce an entirely new kind of workload, your existing level 3 benchmarks won’t help you decide whether you should switch to different hardware, so you will have to build (and run) another customized benchmark

Which level should you choose? It depends on your budget. Let the cost-benefit ratio (presented above) be your guide.

The remainder of this article examines the levels 1 and 2 in more detail.

Level 1: Consuming existing benchmarks

The basic idea is to ask any search engine of your choice for something like “<cloud provider name> [disk|network|cpu] benchmark”. The results often help you make a quick decision, at minimum efforts. At the time of writing, I found the following websites helpful:

  • If you are interested in CPU benchmarks (single- and multi-core), check out this great benchmark done by Dimitrios Kechagias, who compares various VM types of many public cloud providers
  • Many cloud providers do not offer official benchmarks, but some do, e.g. Microsoft Azure (see Linux and Windows benchmarks) and Google Cloud (see here). Google and Microsoft report the multi-core performance of CoreMark, which I present in more detail below. On https://azureprice.net/performance you can find a graph that computes “monthly price divided by CoreMark performance” for Azure VMs, which may also help you during your research.
  • vpsbenchmarks.com also offers benchmarks of a (smaller) set of VMs (and other services, e.g. storage) for various cloud providers. It even allows comparing the results interactively.
  • benchANT offers a ranking of database benchmarks (DBaaS)

Level 2: Running representative 3rd-party benchmarks

Let’s see which popular third party off-the-shelf benchmarking tools there are. For each tool I demonstrate:

  • How to install and use it (including compiling from source, if necessary)
  • Useful configuration options, if they exist
  • How the results look like
  • Possible caveats I (or others) found

In this article, I focus on CLI tools (because they are easier to automate than graphical tools), running them in a Linux-based Docker container. Here is an overview of the tools:

 Types of measurementsLicense / SourceTypical execution time per run
CoreMarkCPU onlyOpen Source (repo)10-20 seconds
OpenSSLCPU onlyOpen Source (repo)Configurable
7-zipCPU + memoryOpen Source (repo)ca. 30 seconds
SysbenchCPU, memory, diskOpen Source (repo)10 seconds by default (configurable)
FIODiskOpen Source (repo)Configurable
GeekbenchCPU, GPU, memoryProprietary8-10 minutes

CoreMark focuses exclusively on CPU performance. You need to download the source and compile it. The generated binary does not have any meaningful CLI arguments, you need to set those at compilation time. The most relevant parameter is the number of parallel threads.

CoreMark is the tool used by the Azure and Google Cloud benchmarks mentioned above (in level 1).

Example for compiling the program:


git clone https://github.com/eembc/coremark.git
cd coremark
# recommendation: checkout a specific commit, for reproducibility reasons (omitted)

# To build the single-threaded version:
make compile

# To build the multi-threaded version (here: 4 threads, change it if necessary):
make XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1" compileCode language: Bash (bash)

To run the benchmark, call ./coremark.exe

Example result (of running the multi-threaded version):

2K performance run parameters for coremark.
CoreMark Size    : 666
Total ticks      : 33380
Total time (secs): 33.380000
Iterations/Sec   : 47932.893948
Iterations       : 1600000
Compiler version : GCC11.4.0
Compiler flags   : -O2 -DMULTITHREAD=4 -DUSE_FORK=1  -lrt
Parallel Fork : 4
Memory location  : Please put data memory location here
                        (e.g. code in flash, data on heap etc)
seedcrc          : 0xe9f5
[0]crclist       : 0xe714
...
[0]crcmatrix     : 0x1fd7
...
[0]crcstate      : 0x8e3a
...
[0]crcfinal      : 0x65c5
...
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 47932.893948 / GCC11.4.0 -O2 -DMULTITHREAD=4 -DUSE_FORK=1  -lrt / Heap / 4:Fork
Code language: plaintext (plaintext)

As you woud expect, higher results are better.

Test reproducibility

Keep in mind that the reproducibility of your result is only given if you used the same compiler version and compilation flags. Thanks to Docker, you can compile the binary in the Dockerfile when building the image, and thus ensure that the binaries are equal (by using the same image). This remark applies to all open source tools that you compile yourself.

CoreMark Pro

The CoreMark README mentions this: “For a more compute-intensive version of CoreMark that uses larger datasets and execution loops taken from common applications, please check out EEMBC’s CoreMark-PRO benchmark, also on GitHub.”

I tested it, and it works without problems. Although the tool has “Pro” in its name, you do not need a license to use it. Just follow the official readme for usage instructions. The CoreMark Pro Readme mentions the term “contexts”, by which it means the number of parallel threads. Like with CoreMark, higher results are better.

In my tests, the execution time of a CoreMark Pro run is about 1 minute.

The “openssl speed” command (docs) is a benchmark tool included with OpenSSL. It determines the hash speed of various hashing algorithms (e.g. MD5, or SHA256). We can use it for hardware benchmarks by fixing the set of algorithms to a small set (e.g. just one algorithm, because we don’t care about the speed difference of two or more hashing algorithms, but we want to know the speed difference of different CPU models for a particular hashing algorithm).

To use it, just install the OpenSSL library using something like “apt install openssl”, or download and compile the source code.

I highly discourage you from running “openssl speed” without any arguments. This would run a benchmark of all (dozens) algorithms, and for each algorithm, 6 different buffer sizes are tested. Also, each test runs for 3 or 10 seconds, resulting in at least 20 minutes of total execution time!

Instead, I suggest you run the following command, adjusting the arguments as necessary:
openssl speed -seconds 10 -bytes 1024 -multi 2 sha256

  • -bytes overrides the buffer size – just use a multiple of 2, 1024 worked fine for me
  • -seconds controls the execution time per algorithm and buffer size combination (here we just have one combination, so -seconds actually sets the total benchmark duration)
  • -multi controls the number of parallel threads (by default, one thread is used)
  • as last argument (here: sha256) you can specify a list of one or more space-separated hashing algorithms

When specifying only one hashing algorithm, the total execution time equals the seconds you provided. Here is an example output:

Forked child 0
+DT:sha256:10:1024
Forked child 1
+DT:sha256:10:1024
+R:3780776:sha256:10.000000
+R:3803184:sha256:10.000000
Got: +H:1024 from 0
Got: +F:6:sha256:387151462.40 from 0
Got: +H:1024 from 1
Got: +F:6:sha256:389446041.60 from 1
version: 3.0.2
built on: Wed May 24 17:12:55 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-Z1YLmC/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0xfeda32035f8bffff:0x27a9
sha256          776597.50k
Code language: plaintext (plaintext)

The relevant value is shown in the last line. The value shows operations per second, thus higher is better.

7-zip comes with a benchmark mode. You can run it with the “7z b” command. As the docs explain, this only runs a LZMA-based compression+decompression test, which runs for about 30 seconds.

I recommend that you download its source code and compile it yourself (because the Linux distro repository might contain a very outdated version), as follows:

git clone https://github.com/mcmilk/7-Zip
cd 7-Zip
# recommendation: checkout a specific commit, for reproducibility reasons (omitted)
cd CPP/7zip/Bundles/Alone2
CFLAGS="-O3 -march=native -Wno-error" make -j 2 -f makefile.gcc
# binary is now available in CPP/7zip/Bundles/Alone2/_o/7zzCode language: Bash (bash)

Alternatively, others have also built Docker images, e.g. this one.

Here is an example result:

7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
 64-bit locale=C.UTF-8 Threads:2 OPEN_MAX:1048576

Compiler: 11.4.0 GCC 11.4.0: SSE2
Linux : 5.4.0-104-generic : #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 : x86_64 : Microsoft Hv : Hv#1 : 10.0.19041.3.0.3693
PageSize:4KB THP:madvise hwcap:2
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz (306C3)

1T CPU Freq (MHz):  3244  3259  3310  3532  3594  3558  3535
1T CPU Freq (MHz): 101% 3567    99% 3562

RAM size:    1918 MB,  # CPU hardware threads:   2
RAM usage:    444 MB,  # Benchmark threads:      2

                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS

22:      10803   177   5922  10509  |      88480   199   3789   7554
23:      10772   185   5924  10976  |      86621   200   3757   7498
24:      10288   188   5895  11062  |      86257   200   3795   7573
25:      10222   191   6115  11672  |      79112   198   3555   7041
----------------------------------  | ------------------------------
Avr:     10521   185   5964  11055  |      85118   199   3724   7417
Tot:             192   4844   9236
Code language: plaintext (plaintext)

The docs explain many details regarding what the benchmark does. In essence:

  • You can either note down the compression and decompression speed separately, by choosing the value of column 4 (here: 11055 for compression and 7417 for decompression)
  • Or you pick the value of the last column of the last line, which is simply computed as “<compression-result> + <decompression-result>) / 2” (here: 9236).

By default, the benchmark uses all available CPU cores. If you want to limit the benchmark to a single core, add the “-mmt1” argument. The execution time does not change significantly when running only on a single core.

Alternatively, you can also run all tests (via “-mm=*“). This takes about 2 minutes and produces a result like this:

Click to show result
7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
 64-bit locale=C.UTF-8 Threads:2 OPEN_MAX:1048576

 m=*
Compiler: 11.4.0 GCC 11.4.0: SSE2
Linux : 5.4.0-104-generic : #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 : x86_64 : Microsoft Hv : Hv#1 : 10.0.19041.3.0.3693
PageSize:4KB THP:madvise hwcap:2
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz (306C3)

1T CPU Freq (MHz):  3451  3432  3387  3460  3510  3556  3559
1T CPU Freq (MHz): 101% 3549    99% 3447

RAM size:    1918 MB,  # CPU hardware threads:   2
RAM usage:    455 MB,  # Benchmark threads:      2


Method           Speed Usage    R/U Rating   E/U Effec
                 KiB/s     %   MIPS   MIPS     %     %

CPU                      200   3419   6829
CPU                      200   3481   6962
CPU                      200   3499   6987   100   200

LZMA:x1          38762   199   7190  14289   206   409
                 81452   200   3263   6514    93   186
LZMA:x3          15193   199   4682   9335   134   267
                 86013   200   3613   7216   103   207
LZMA:x5:mt1       7871   199   4944   9834   142   281
                 85285   200   3600   7191   103   206
LZMA:x5:mt2      10133   199   6366  12659   182   362
                 83556   199   3535   7045   101   202
Deflate:x1       76502   198   4895   9714   140   278
                293171   200   4565   9108   131   261
Deflate:x5       26575   199   5139  10232   147   293
                302179   200   4695   9380   134   268
Deflate:x7       11094   199   6174  12292   177   352
                311997   200   4842   9681   139   277
Deflate64:x5     22978   198   5004   9930   143   284
                305222   200   4775   9543   137   273
BZip2:x1         12450   200   3764   7522   108   215
                 87463   200   4749   9480   136   271
BZip2:x5         10873   199   4558   9074   130   260
                 64629   199   6367  12683   182   363
BZip2:x5:mt2      9907   197   4188   8269   120   237
                 60147   199   5943  11803   170   338
BZip2:x7          3402   200   4415   8814   126   252
                 61458   198   6096  12050   174   345
PPMD:x1          10804   198   5639  11175   161   320
                  8288   199   4909   9760   141   279
PPMD:x5           7221   199   6143  12238   176   350
                  5944   199   5587  11140   160   319
Swap4         58814220   200   1886   3764    54   108
              59638149   200   1909   3817    55   109
Delta:4        4407016   199   6790  13538   194   388
               3105539   200   6375  12720   182   364
BCJ            5481816   199   5633  11227   161   321
               4324010   199   4439   8856   127   253
ARM64          7472124   199   3839   7651   110   219
               7980940   200   4095   8172   117   234
AES256CBC:1     295710   198   3671   7267   105   208
                317353   200   3900   7799   112   223
AES256CBC:2    1072742   199   4422   8788   127   252
               7375528   200   3776   7553   108   216
AES256CBC:3

CRC32:8        3648741   200   2479   4948    71   142
CRC32:32
CRC32:64
CRC64          2144323   200   2198   4392    63   126
SHA256:1        451450   199   4624   9210   132   264
SHA256:2
SHA1:1         1140771   200   5352  10678   153   306
SHA1:2
BLAKE2sp        522395   198   5793  11493   166   329

CPU                      200   3464   6917
------------------------------------------------------
Tot:                     199   4682   9325   134   267Code language: plaintext (plaintext)

I did not find that benchmark mode particularly useful.

Sysbench is an open-source suite of benchmark tools for measuring CPU, RAM or disk performance. There is a maintained third-party Docker image you can use in your container-based environment.

All tests run for 10 seconds by default, which can be changed via “--time=<seconds>“.

Let’s take a look at each of the benchmark modes, and some of their caveats:

1. CPU

Command example: sysbench cpu --cpu-max-prime=20000 --threads=2 run

Example output:

sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 2
Initializing random number generator from current time


Prime numbers limit: 20000

Initializing worker threads...

Threads started!

CPU speed:
    events per second:   677.70

General statistics:
    total time:                          10.0014s
    total number of events:              6779

Latency (ms):
         min:                                    2.70
         avg:                                    2.95
         max:                                   12.97
         95th percentile:                        3.25
         sum:                                19997.56

Threads fairness:
    events (avg/stddev):           3389.5000/3.50
    execution time (avg/stddev):   9.9988/0.00
Code language: plaintext (plaintext)

The CPU benchmark computes prime numbers up to a certain number (configured via --cpu-max-prime=20000, which is 10000 by default in version 1.0.20, but it might change over time, so you better set its value). Computing a prime number just takes a few milliseconds. As explained here, each finished prime number computation is counted as “event”, and thus the events per second are an indication for computational performance. Higher is better. Keep in mind that benchmark results that use a different --cpu-max-prime value are not comparable!

--threads configures how many parallel threads should be active (1 by default), increasing it should also increase your events per second score.

Lack of comparability

Unfortunately, the results of the CPU benchmark are sometimes not comparable between different CPU models, as demonstrates in this ticket, where an ARM-based Mac is 18000x times faster than an Intel i7 4770HQ (which, naturally, cannot possibly be true).

2. Disk speed

Command example: “sysbench fileio --file-test-mode=rndrw preparefollowed by the same command, but replacing “prepare” with “run (keeping all other arguments equal). The “prepare” command prepares a number of test files (128 by default) in your current working directory.

Example output:

sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Initializing worker threads...

Threads started!


File operations:
    reads/s:                      0.00
    writes/s:                     3315.00
    fsyncs/s:                     4257.04

Throughput:
    read, MiB/s:                  0.00
    written, MiB/s:               51.80

General statistics:
    total time:                          10.0438s
    total number of events:              75807

Latency (ms):
         min:                                    0.01
         avg:                                    0.26
         max:                                   25.22
         95th percentile:                        0.36
         sum:                                19963.31

Threads fairness:
    events (avg/stddev):           37903.5000/487.50
    execution time (avg/stddev):   9.9817/0.00
Code language: plaintext (plaintext)

If you run “sysbench fileio help”, you see the many configuration options (including the default values in angle brackets):

fileio options:
  --file-num=N                  number of files to create [128]
  --file-block-size=N           block size to use in all IO operations [16384]
  --file-total-size=SIZE        total size of files to create [2G]
  --file-test-mode=STRING       test mode {seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw}
  --file-io-mode=STRING         file operations mode {sync,async,mmap} [sync]
  --file-async-backlog=N        number of asynchronous operatons to queue per thread [128]
  --file-extra-flags=[LIST,...] list of additional flags to use to open files {sync,dsync,direct} []
  --file-fsync-freq=N           do fsync() after this number of requests (0 - don't use fsync()) [100]
  --file-fsync-all[=on|off]     do fsync() after each write operation [off]
  --file-fsync-end[=on|off]     do fsync() at the end of test [on]
  --file-fsync-mode=STRING      which method to use for synchronization {fsync, fdatasync} [fsync]
  --file-merged-requests=N      merge at most this number of IO requests if possible (0 - don't merge) [0]
  --file-rw-ratio=N             reads/writes ratio for combined test [1.5]
Code language: plaintext (plaintext)

The --file-test-mode option is the most important one: it controls which I/O pattern is executed: “seq…” stands for sequentially reading/writing to files, whereas “rnd…” stands for randomly reading/writing to files. rndrw combines both randomly reading and writing. Note: seqrewr does not stand for “sequential read+write”, but for “sequential rewrite” (presumably writing to already-existing files)!

The results will be vastly different if you change any of the arguments. For instance, on the same test machine, I got these results:

  • sysbench fileio --file-test-mode=seqwr --file-num=1 run” yields 183 MB/s throughput
  • sysbench fileio --file-test-mode=seqwr --file-num=128 run” yields 30 MB/s throughput

I think the difference can be explained by the varying number of fsync operations (more files require more fsync ops, and fsync ops are slow). In any case: be wary of the CLI arguments that you use: only compare results that use the exact same values.

There is also a “cleanup” command you can run after the “run” command, which cleans up the test files created in prepare/run. However,  you might not need the “cleanup” command when running Sysbench in a container, which you discard after a benchmark anyway. “cleanup” might still be worth using if you benchmark the performance of a remote storage.

3. Memory / RAM

Using “sysbench memory --memory-oper=write --memory-access-mode=seq run” you can benchmark the raw memory access speed.

Example output:

sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time


Running memory speed test with the following options:
  block size: 1KiB
  total size: 102400MiB
  operation: write
  scope: global

Initializing worker threads...

Threads started!

Total operations: 45658628 (4565125.60 per second)

44588.50 MiB transferred (4458.13 MiB/sec)


General statistics:
    total time:                          10.0001s
    total number of events:              45658628

Latency (ms):
         min:                                    0.00
         avg:                                    0.00
         max:                                    0.42
         95th percentile:                        0.00
         sum:                                 5247.42

Threads fairness:
    events (avg/stddev):           45658628.0000/0.00
    execution time (avg/stddev):   5.2474/0.00
Code language: plaintext (plaintext)

sysbench memory help” shows the notable configuration options and its default values:

memory options:
  --memory-block-size=SIZE    size of memory block for test [1K]
  --memory-total-size=SIZE    total size of data to transfer [100G]
  --memory-scope=STRING       memory access scope {global,local} [global]
  --memory-hugetlb[=on|off]   allocate memory from HugeTLB pool [off]
  --memory-oper=STRING        type of memory operations {read, write, none} [write]
  --memory-access-mode=STRING memory access mode {seq,rnd} [seq]
Code language: plaintext (plaintext)

Most notably, you can configure whether you want to benchmark reading or writing, and switch between sequential (seq) and random (rnd) memory access mode.

fio is an advanced disk benchmark tool. It is available in the repositories of many Linux distributions, but you can also download its source code and compile it, as documented in the project readme. I recommend that you compile the code yourself, because the fio version available in the Linux repositories is likely out of date.

Fio is configured via a “job file” that uses an INI-style format. Here is an example:

[global]
rw=randread
ioengine=libaio
iodepth=64
size=1g
direct=1
buffered=0
startdelay=20

ramp_time=5
runtime=60
group_reporting=1
numjobs=1
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=fiofile

[test]
name=test
bs=64k
stonewall
Code language: plaintext (plaintext)

The above example job file configures starts a random read test that has a warm-up time of 20 seconds, then runs for 60 seconds, with 1 parallel thread, using a block size of 64 KB. The docs explain what the individual parameters mean (including alternative read/write patterns other than randread).

The resulting output of running “fio <path-to-jobfile>” looks like this:

test: (g=0): rw=randread, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=64
fio-3.35
Starting 1 process
test: Laying out IO file (1 file / 1024MiB)

test: (groupid=0, jobs=1): err= 0: pid=3601: Wed Nov 29 10:38:22 2023
  read: IOPS=2959, BW=185MiB/s (194MB/s)(10.8GiB/60001msec)
   bw (  KiB/s): min=165376, max=209955, per=99.99%, avg=189472.50, stdev=6106.75, samples=120
   iops        : min= 2584, max= 3280, avg=2960.44, stdev=95.43, samples=120
  cpu          : usr=3.06%, sys=11.09%, ctx=177668, majf=0, minf=36
  IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=177584,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=10.8GiB (11.6GB), run=60001-60001msec
Code language: plaintext (plaintext)

The results indicate an average throughput (bw=bandwidth) of 189472.50 KiB/s and 2960.44 IOPS (higher is better).

Geekbench is a proprietary benchmark suite geared towards end-users (e.g. gamers), who want to compare their machine’s performance with other end-users. Geekbench can benchmark CPU, GPU and several real-world scenarios that stress the CPU, GPU, memory and disk to varying extents.

The CLI version runs the following tests (every test is run twice, in a single-core and a multi-core variant):

  • File Compression
  • Navigation
  • HTML5 Browser
  • PDF Renderer
  • Photo Library
  • Clang
  • Text Processing
  • Asset Compression
  • Object Detection
  • Background Blur
  • Horizon Detection
  • Object Remover
  • HDR
  • Photo Filter
  • Ray Tracer
  • Structure from Motion

The tests are explained in more detail here.

Because Geekbench is proprietary (the source code is not available), you have to download binaries for the correct CPU platform. By default, Intel/AMD 32/64 bit are supported, and there is a preview version for other platforms like ARM64, but it is known to be unstable, sometimes computing incorrect benchmark scores.

There is a free version (as well as a free unofficial AMD64-only Docker Image) and a commercial version:

  • The free version requires internet access, because the benchmark results are not printed to the console, but they are only uploaded to https://browser.geekbench.com/ where you can look at the results (example result). The console output of the Geekbench CLI only prints the URL at which your uploaded results are made available, or an error if uploading failed. Some automation can be done to “extract” the scores from the result web page, though, which is e.g. done by “yabs”, see here.
  • The commercial version can additionally publish the result to various local file formats, which makes it suitable for automation.

In summary, while Geekbench runs many useful tests, it has two main caveats:

  1. unless you are willing to pay $99 for a commercial license, integrating Geekbench into your own test suite (that includes other benchmark tools) will require custom result parsing code of the Geekbench HTML report
  2. Geekbench’s ARM64 variant is not yet stable, so comparing the results of ARM-based CPUs with Intel/AMD-based ones might yield incorrect results

Honorable mentions of tools I did not test myself:

Conclusion

Before you engage with benchmarking, always consider the cost-benefit ratio first! Only spend the time on benchmarks you can afford, because of the estimated (long-term) amortization of your efforts.

If you think it is worth the effort to do the hardware benchmarks yourself (→ level 2 or 3), consider the tips from the “Good hardware benchmark practices” section. While you could build a separate Docker image for each tool (or use existing images), I instead recommend building a single customized Docker image that contains all tools. Part 2 will go into details how to achieve this with the Phoronix test suite.

Leave a Comment