This article discusses virtual cloud hardware benchmarks, which help you choose the best performing virtual cloud-based CPU/disk/memory hardware for your needs. I go into the cost-benefit ratio of benchmarking, and provide many good practices for hardware benchmarking, such as ensuring proper reproducibility. Finally, I explore 6 off-the-shelf benchmark tools that measure CPU, memory or disk performance: CoreMark, OpenSSL, 7-zip, Sysbench, FIO and Geekbench.
Benchmark series
This article is part of a multi-part series about benchmarking virtual cloud hardware:
- Part 1: Benchmarking virtual cloud hardware using 6 great tools (this article)
- Part 2: Hardware benchmarks with Phoronix Test Suite
- Part 3: Automated hardware benchmarks in Kubernetes
There is also the related Performance benchmark of over 30 Azure VMs in AKS you might find interesting.
Introduction
Cloud providers offer a plethora of PaaS and IaaS services that run on some kind of virtualized hardware. This hardware is often described only in nebulous terms (e.g. “approximately equivalent to Intel Xeon XYZ”), and sometimes is not described at all (e.g. in case of most FaaS offerings). Sometimes you can choose from a catalog of hardware options (e.g. VM sizes), sometimes you can’t (and your only choice is to use a different cloud provider entirely). The problem we want to solve in this article is to determine which cloud provider (or which VM size) you should choose, to run your software on. The goal typically is to get the highest performance at the lowest possible costs, where “performance” could e.g. refer to CPU computational power, or disk / network speed.
Hardware and software benchmarks
To determine the performance of the different virtual hardware components, you need to run benchmarks. There are several benchmark variants:
- Software benchmarks answer the question: “given some specific hardware, how well does software #A perform, or should I choose software #B over #A, or how could I tune #A to maximize its performance?”
- Hardware benchmarks flip the variables and the constants. They answer the question: “Given a specific software problem that is executed for a limited time period, how well does the software perform on hardware #A, or should I choose hardware #B over #A?”.
Hardware benchmarks produce numbers that you can compare between different hardware configurations. For instance:
- For CPU:
- Duration (shorter is better), for problems that have a fixed size (e.g. compressing a specific video)
- Computations per second (higher is better), for benchmarks where you can configure a fixed duration
- For SSD/HDD disks (or network bandwidth):
- Transfer speed / bandwidth (typically measured in MB/sec), and IOPS – higher is better
- Transfer time (shorter is better), for benchmarks where files of a fixed size are transferred
Benchmarking with (Docker) containers
Benchmark software has existed for a long time, in many shapes and forms. Many of them are installed natively on the machine (e.g. with “apt-get install
” on Ubuntu), or are even GUI-based (designed for end-users).
In this article, we look at options to run benchmark tools in (Docker) containers: this allows for testing hardware that power container-based environments, such as Kubernetes or FaaS. But containers also simplify the setup and reproducibility in bare VMs, where you can simply install a container engine and run an image.
Having your benchmark tools in a container image gives you perfect reproducibility: assuming you always use the same image, you can be sure that the benchmark measures the same thing on Linux distribution #A or #B, today or one year from now, because the image contains the exact same versions of the benchmark tools, compiled with the same compiler/linker flags, against the same base library (e.g. glibc).
Good hardware benchmark practices
Here are a few important considerations when benchmarking hardware performance:
- Avoid noisy neighbors: the hardware under test should only run the benchmark software. No other software applications should be running on that hardware, because their load on the system would manipulate the results
- Ensure full reproducibility of the software under test: when benchmarking on two machines, use the same tool version, compiler version / compilation flags (if you compile the tool from source), configuration parameters, etc., to ensure that the kind of load that the software puts on the hardware is exactly the same
- Ensure the statistical reliability of your benchmark results:
- Repeat tests at least 3 times and compare different hardware based on the average value
- Check for too large standard deviations in the results, because these could be indicators for “noisy neighbors”
- Choose a benchmark duration that is long enough: the higher you expect the result’s standard deviation to be, the longer you should run the benchmark. For instance, 10 seconds are likely enough for a purely CPU-based test, but when benchmarking storage systems, you should aim towards 30 seconds or more. If possible, also allow for a warm-up period during which the hardware is stressed, but the results are not recorded yet.
- When you benchmark CPUs, beware of single- vs. multi-core performance! First, you need to figure out to which extent your workloads profit from multiple cores. You could check your code, read the manual (of off-the-shelf apps), or conduct some tests. If you find that your workloads do not profit (much) from multiple cores, make sure to use benchmark tools that can measure single-core performance! Because you would want to choose hardware that has the strongest single core performance, even if that hardware’s multi-core scaling factor is poor.
- Beware of burstable hardware: sometimes, virtual hardware is designed for “bursty” workloads (which are idle most of the time, e.g. database servers with few users). For instance, cloud providers offer “burstable” VM types that accumulate CPU credits over time, which are then spent when the VM does computational work. When you benchmark, make sure that you have enough of these CPU credits during the benchmark (alternatively, completely exhaust the credits first, before starting the benchmark, to test the “baseline” performance). Besides CPUs, there are also storage systems and network links which can be burstable.
- If you use off-the-shelf benchmark software:
- Make sure you properly understand what it is doing and how its configuration options work. Read the benchmark tool documentation, and do sanity checks to identify possible bugs in the software.
- Ensure the statistical validity of your benchmark results: Run multiple different benchmark tools that measure similar things. Doing so helps putting the results in perspective. For instance, if you run two different CPU single-core benchmark tools #X and #Y, and both #X and #Y show an ~20% performance improvement on hardware #B compared to hardware #A, this is good. But if the results of #X improve by 20% and those of #Y decrease by 10% or more, something is off, and you need to look more closely at what these tools really measure, or whether there might be a bug (in either software).
Cost-benefit ratio of benchmarking
The ultimate goal of hardware benchmarking is to save money in the long run, by switching to the “optimal” hardware that you determined with your benchmarks, and thus reducing hardware costs. It’s essentially a form of “rightsizing” of your hardware. However, benchmarking itself is not free: it takes time to investigate existing benchmarks, or run existing benchmark software, or even build your own benchmark software or scripts. The more time you invest, the more precise the results will become, but it’s very difficult to find the “cut-off point” at which you should stop digging deeper.
What you can do, though, is:
- Evaluate (at the beginning!) how large the savings could potentially be, e.g. if you replaced the existing hardware with the cheapest option you would consider to buy/rent,
- Extrapolate the savings, e.g. over the time span of one year,
- Compare this with your salary: how many working hours could you spend on the problem until you would have eaten up the savings?.
Of course, the calculation I just described is overly simplistic, because the savings might not just result from saving hardware costs, but from efficiency gains of your (end) users (or your development team) who profit from faster software. For instance, suppose you switched your infrastructure from Intel x86 (64 bit) to ARM64 processors. The hardware costs stay roughly the same, but your software performs 60% faster, increasing user satisfaction and therefore sales of your software. Unfortunately, these kinds of savings are very difficult to estimate in advance.
Investment levels of hardware benchmarking
I’d argue that there are three levels of (time) investment, ranging from low to high effort:
Level 1: consuming existing benchmarks | Level 2: running a suite of representative third party benchmarks | Level 3: building your own benchmarks | |
---|---|---|---|
Basic idea | You simply research existing benchmarks published by third parties. | You design your own suite of third-party (off-the-shelf) benchmark tools which run workloads that are representative for your actual workloads. You then run this suite and collect the results, rendering charts to simplify the comparison. The executed tests could be purely synthetic tests (e.g. computing prime numbers to measure CPU speed), or be more like real-world scenarios that measure a mixture of CPU, disk access and memory access (e.g. multi-threaded compilation of a specific Linux kernel, or compressing a specific video with some codec). | You build and run a test suite that runs your own workloads in an isolated fashion, on dedicated test hardware. This avoids the noisy neighbors problem. If applicable, you also build tooling that collects the test results. |
Advantage | Requires the least amount of effort | You have full control over the workloads and can ensure that the tools are executed correctly | Highest applicability to your circumstances (the benchmark results directly translate to your workloads) |
Disadvantages | – You cannot influence the benchmark software, and thus the benchmark workloads might not match the kinds of workloads that you run – The benchmarks might be already out of date (e.g. when a cloud provider recently released a new generation of CPU models, but the benchmark still covers the older generation) – The benchmarks might not have been executed correctly | – Researching the right tools (and their correct usage) is quite a lot of work (hopefully, this article reduces this effort considerably) – If you run many tests (on a wide variety of hardware), you need to build tooling to simplify the collection of the benchmark results (and possibly graphical comparison). In parts 2 and 3 of this series, I provide hints how to simplify this. | – Results might not be as “generalizable”, compared to level 2 benchmarks: once you introduce an entirely new kind of workload, your existing level 3 benchmarks won’t help you decide whether you should switch to different hardware, so you will have to build (and run) another customized benchmark |
Which level should you choose? It depends on your budget. Let the cost-benefit ratio (presented above) be your guide.
The remainder of this article examines the levels 1 and 2 in more detail.
Level 1: Consuming existing benchmarks
The basic idea is to ask any search engine of your choice for something like “<cloud provider name> [disk|network|cpu] benchmark”. The results often help you make a quick decision, at minimum efforts. At the time of writing, I found the following websites helpful:
- If you are interested in CPU benchmarks (single- and multi-core), check out this great benchmark done by Dimitrios Kechagias, who compares various VM types of many public cloud providers
- Many cloud providers do not offer official benchmarks, but some do, e.g. Microsoft Azure (see Linux and Windows benchmarks) and Google Cloud (see here). Google and Microsoft report the multi-core performance of CoreMark, which I present in more detail below. On https://azureprice.net/performance you can find a graph that computes “monthly price divided by CoreMark performance” for Azure VMs, which may also help you during your research.
- vpsbenchmarks.com also offers benchmarks of a (smaller) set of VMs (and other services, e.g. storage) for various cloud providers. It even allows comparing the results interactively.
- benchANT offers a ranking of database benchmarks (DBaaS)
Level 2: Running representative 3rd-party benchmarks
Let’s see which popular third party off-the-shelf benchmarking tools there are. For each tool I demonstrate:
- How to install and use it (including compiling from source, if necessary)
- Useful configuration options, if they exist
- How the results look like
- Possible caveats I (or others) found
In this article, I focus on CLI tools (because they are easier to automate than graphical tools), running them in a Linux-based Docker container. Here is an overview of the tools:
Types of measurements | License / Source | Typical execution time per run | |
CoreMark | CPU only | Open Source (repo) | 10-20 seconds |
OpenSSL | CPU only | Open Source (repo) | Configurable |
7-zip | CPU + memory | Open Source (repo) | ca. 30 seconds |
Sysbench | CPU, memory, disk | Open Source (repo) | 10 seconds by default (configurable) |
FIO | Disk | Open Source (repo) | Configurable |
Geekbench | CPU, GPU, memory | Proprietary | 8-10 minutes |
CoreMark focuses exclusively on CPU performance. You need to download the source and compile it. The generated binary does not have any meaningful CLI arguments, you need to set those at compilation time. The most relevant parameter is the number of parallel threads.
CoreMark is the tool used by the Azure and Google Cloud benchmarks mentioned above (in level 1).
Example for compiling the program:
git clone https://github.com/eembc/coremark.git
cd coremark
# recommendation: checkout a specific commit, for reproducibility reasons (omitted)
# To build the single-threaded version:
make compile
# To build the multi-threaded version (here: 4 threads, change it if necessary):
make XCFLAGS="-DMULTITHREAD=4 -DUSE_FORK=1" compile
Code language: Bash (bash)
To run the benchmark, call ./coremark.exe
Example result (of running the multi-threaded version):
2K performance run parameters for coremark.
CoreMark Size : 666
Total ticks : 33380
Total time (secs): 33.380000
Iterations/Sec : 47932.893948
Iterations : 1600000
Compiler version : GCC11.4.0
Compiler flags : -O2 -DMULTITHREAD=4 -DUSE_FORK=1 -lrt
Parallel Fork : 4
Memory location : Please put data memory location here
(e.g. code in flash, data on heap etc)
seedcrc : 0xe9f5
[0]crclist : 0xe714
...
[0]crcmatrix : 0x1fd7
...
[0]crcstate : 0x8e3a
...
[0]crcfinal : 0x65c5
...
Correct operation validated. See README.md for run and reporting rules.
CoreMark 1.0 : 47932.893948 / GCC11.4.0 -O2 -DMULTITHREAD=4 -DUSE_FORK=1 -lrt / Heap / 4:Fork
Code language: plaintext (plaintext)
As you woud expect, higher results are better.
Test reproducibility
Keep in mind that the reproducibility of your result is only given if you used the same compiler version and compilation flags. Thanks to Docker, you can compile the binary in the Dockerfile
when building the image, and thus ensure that the binaries are equal (by using the same image). This remark applies to all open source tools that you compile yourself.
CoreMark Pro
The CoreMark README mentions this: “For a more compute-intensive version of CoreMark that uses larger datasets and execution loops taken from common applications, please check out EEMBC’s CoreMark-PRO benchmark, also on GitHub.”
I tested it, and it works without problems. Although the tool has “Pro” in its name, you do not need a license to use it. Just follow the official readme for usage instructions. The CoreMark Pro Readme mentions the term “contexts”, by which it means the number of parallel threads. Like with CoreMark, higher results are better.
In my tests, the execution time of a CoreMark Pro run is about 1 minute.
The “openssl speed
” command (docs) is a benchmark tool included with OpenSSL. It determines the hash speed of various hashing algorithms (e.g. MD5, or SHA256). We can use it for hardware benchmarks by fixing the set of algorithms to a small set (e.g. just one algorithm, because we don’t care about the speed difference of two or more hashing algorithms, but we want to know the speed difference of different CPU models for a particular hashing algorithm).
To use it, just install the OpenSSL library using something like “apt install openssl
”, or download and compile the source code.
I highly discourage you from running “openssl speed
” without any arguments. This would run a benchmark of all (dozens) algorithms, and for each algorithm, 6 different buffer sizes are tested. Also, each test runs for 3 or 10 seconds, resulting in at least 20 minutes of total execution time!
Instead, I suggest you run the following command, adjusting the arguments as necessary:openssl speed -seconds 10 -bytes 1024 -multi 2 sha256
-bytes
overrides the buffer size – just use a multiple of 2, 1024 worked fine for me-seconds
controls the execution time per algorithm and buffer size combination (here we just have one combination, so-seconds
actually sets the total benchmark duration)-multi
controls the number of parallel threads (by default, one thread is used)- as last argument (here:
sha256
) you can specify a list of one or more space-separated hashing algorithms
When specifying only one hashing algorithm, the total execution time equals the seconds you provided. Here is an example output:
Forked child 0
+DT:sha256:10:1024
Forked child 1
+DT:sha256:10:1024
+R:3780776:sha256:10.000000
+R:3803184:sha256:10.000000
Got: +H:1024 from 0
Got: +F:6:sha256:387151462.40 from 0
Got: +H:1024 from 1
Got: +F:6:sha256:389446041.60 from 1
version: 3.0.2
built on: Wed May 24 17:12:55 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-Z1YLmC/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0xfeda32035f8bffff:0x27a9
sha256 776597.50k
Code language: plaintext (plaintext)
The relevant value is shown in the last line. The value shows operations per second, thus higher is better.
7-zip comes with a benchmark mode. You can run it with the “7z b
” command. As the docs explain, this only runs a LZMA-based compression+decompression test, which runs for about 30 seconds.
I recommend that you download its source code and compile it yourself (because the Linux distro repository might contain a very outdated version), as follows:
git clone https://github.com/mcmilk/7-Zip
cd 7-Zip
# recommendation: checkout a specific commit, for reproducibility reasons (omitted)
cd CPP/7zip/Bundles/Alone2
CFLAGS="-O3 -march=native -Wno-error" make -j 2 -f makefile.gcc
# binary is now available in CPP/7zip/Bundles/Alone2/_o/7zz
Code language: Bash (bash)
Alternatively, others have also built Docker images, e.g. this one.
Here is an example result:
7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
64-bit locale=C.UTF-8 Threads:2 OPEN_MAX:1048576
Compiler: 11.4.0 GCC 11.4.0: SSE2
Linux : 5.4.0-104-generic : #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 : x86_64 : Microsoft Hv : Hv#1 : 10.0.19041.3.0.3693
PageSize:4KB THP:madvise hwcap:2
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz (306C3)
1T CPU Freq (MHz): 3244 3259 3310 3532 3594 3558 3535
1T CPU Freq (MHz): 101% 3567 99% 3562
RAM size: 1918 MB, # CPU hardware threads: 2
RAM usage: 444 MB, # Benchmark threads: 2
Compressing | Decompressing
Dict Speed Usage R/U Rating | Speed Usage R/U Rating
KiB/s % MIPS MIPS | KiB/s % MIPS MIPS
22: 10803 177 5922 10509 | 88480 199 3789 7554
23: 10772 185 5924 10976 | 86621 200 3757 7498
24: 10288 188 5895 11062 | 86257 200 3795 7573
25: 10222 191 6115 11672 | 79112 198 3555 7041
---------------------------------- | ------------------------------
Avr: 10521 185 5964 11055 | 85118 199 3724 7417
Tot: 192 4844 9236
Code language: plaintext (plaintext)
The docs explain many details regarding what the benchmark does. In essence:
- You can either note down the compression and decompression speed separately, by choosing the value of column 4 (here: 11055 for compression and 7417 for decompression)
- Or you pick the value of the last column of the last line, which is simply computed as “<compression-result> + <decompression-result>) / 2” (here: 9236).
By default, the benchmark uses all available CPU cores. If you want to limit the benchmark to a single core, add the “-mmt1
” argument. The execution time does not change significantly when running only on a single core.
Alternatively, you can also run all tests (via “-mm=*
“). This takes about 2 minutes and produces a result like this:
Click to show result
7-Zip (z) 23.01 (x64) : Copyright (c) 1999-2023 Igor Pavlov : 2023-06-20
64-bit locale=C.UTF-8 Threads:2 OPEN_MAX:1048576
m=*
Compiler: 11.4.0 GCC 11.4.0: SSE2
Linux : 5.4.0-104-generic : #118-Ubuntu SMP Wed Mar 2 19:02:41 UTC 2022 : x86_64 : Microsoft Hv : Hv#1 : 10.0.19041.3.0.3693
PageSize:4KB THP:madvise hwcap:2
Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz (306C3)
1T CPU Freq (MHz): 3451 3432 3387 3460 3510 3556 3559
1T CPU Freq (MHz): 101% 3549 99% 3447
RAM size: 1918 MB, # CPU hardware threads: 2
RAM usage: 455 MB, # Benchmark threads: 2
Method Speed Usage R/U Rating E/U Effec
KiB/s % MIPS MIPS % %
CPU 200 3419 6829
CPU 200 3481 6962
CPU 200 3499 6987 100 200
LZMA:x1 38762 199 7190 14289 206 409
81452 200 3263 6514 93 186
LZMA:x3 15193 199 4682 9335 134 267
86013 200 3613 7216 103 207
LZMA:x5:mt1 7871 199 4944 9834 142 281
85285 200 3600 7191 103 206
LZMA:x5:mt2 10133 199 6366 12659 182 362
83556 199 3535 7045 101 202
Deflate:x1 76502 198 4895 9714 140 278
293171 200 4565 9108 131 261
Deflate:x5 26575 199 5139 10232 147 293
302179 200 4695 9380 134 268
Deflate:x7 11094 199 6174 12292 177 352
311997 200 4842 9681 139 277
Deflate64:x5 22978 198 5004 9930 143 284
305222 200 4775 9543 137 273
BZip2:x1 12450 200 3764 7522 108 215
87463 200 4749 9480 136 271
BZip2:x5 10873 199 4558 9074 130 260
64629 199 6367 12683 182 363
BZip2:x5:mt2 9907 197 4188 8269 120 237
60147 199 5943 11803 170 338
BZip2:x7 3402 200 4415 8814 126 252
61458 198 6096 12050 174 345
PPMD:x1 10804 198 5639 11175 161 320
8288 199 4909 9760 141 279
PPMD:x5 7221 199 6143 12238 176 350
5944 199 5587 11140 160 319
Swap4 58814220 200 1886 3764 54 108
59638149 200 1909 3817 55 109
Delta:4 4407016 199 6790 13538 194 388
3105539 200 6375 12720 182 364
BCJ 5481816 199 5633 11227 161 321
4324010 199 4439 8856 127 253
ARM64 7472124 199 3839 7651 110 219
7980940 200 4095 8172 117 234
AES256CBC:1 295710 198 3671 7267 105 208
317353 200 3900 7799 112 223
AES256CBC:2 1072742 199 4422 8788 127 252
7375528 200 3776 7553 108 216
AES256CBC:3
CRC32:8 3648741 200 2479 4948 71 142
CRC32:32
CRC32:64
CRC64 2144323 200 2198 4392 63 126
SHA256:1 451450 199 4624 9210 132 264
SHA256:2
SHA1:1 1140771 200 5352 10678 153 306
SHA1:2
BLAKE2sp 522395 198 5793 11493 166 329
CPU 200 3464 6917
------------------------------------------------------
Tot: 199 4682 9325 134 267
Code language: plaintext (plaintext)
I did not find that benchmark mode particularly useful.
Sysbench is an open-source suite of benchmark tools for measuring CPU, RAM or disk performance. There is a maintained third-party Docker image you can use in your container-based environment.
All tests run for 10 seconds by default, which can be changed via “--time=<seconds>
“.
Let’s take a look at each of the benchmark modes, and some of their caveats:
1. CPU
Command example: sysbench cpu --cpu-max-prime=20000 --threads=2 run
Example output:
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 2
Initializing random number generator from current time
Prime numbers limit: 20000
Initializing worker threads...
Threads started!
CPU speed:
events per second: 677.70
General statistics:
total time: 10.0014s
total number of events: 6779
Latency (ms):
min: 2.70
avg: 2.95
max: 12.97
95th percentile: 3.25
sum: 19997.56
Threads fairness:
events (avg/stddev): 3389.5000/3.50
execution time (avg/stddev): 9.9988/0.00
Code language: plaintext (plaintext)
The CPU benchmark computes prime numbers up to a certain number (configured via --cpu-max-prime=20000
, which is 10000
by default in version 1.0.20, but it might change over time, so you better set its value). Computing a prime number just takes a few milliseconds. As explained here, each finished prime number computation is counted as “event”, and thus the events per second are an indication for computational performance. Higher is better. Keep in mind that benchmark results that use a different --cpu-max-prime
value are not comparable!
--threads
configures how many parallel threads should be active (1 by default), increasing it should also increase your events per second score.
Lack of comparability
Unfortunately, the results of the CPU benchmark are sometimes not comparable between different CPU models, as demonstrates in this ticket, where an ARM-based Mac is 18000x times faster than an Intel i7 4770HQ (which, naturally, cannot possibly be true).
2. Disk speed
Command example: “sysbench fileio --file-test-mode=rndrw prepare
” followed by the same command, but replacing “prepare
” with “run
“ (keeping all other arguments equal). The “prepare
” command prepares a number of test files (128 by default) in your current working directory.
Example output:
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Extra file open flags: (none)
128 files, 16MiB each
2GiB total file size
Block size 16KiB
Periodic FSYNC enabled, calling fsync() each 100 requests.
Calling fsync() at the end of test, Enabled.
Using synchronous I/O mode
Doing sequential write (creation) test
Initializing worker threads...
Threads started!
File operations:
reads/s: 0.00
writes/s: 3315.00
fsyncs/s: 4257.04
Throughput:
read, MiB/s: 0.00
written, MiB/s: 51.80
General statistics:
total time: 10.0438s
total number of events: 75807
Latency (ms):
min: 0.01
avg: 0.26
max: 25.22
95th percentile: 0.36
sum: 19963.31
Threads fairness:
events (avg/stddev): 37903.5000/487.50
execution time (avg/stddev): 9.9817/0.00
Code language: plaintext (plaintext)
If you run “sysbench fileio help
”, you see the many configuration options (including the default values in angle brackets):
fileio options:
--file-num=N number of files to create [128]
--file-block-size=N block size to use in all IO operations [16384]
--file-total-size=SIZE total size of files to create [2G]
--file-test-mode=STRING test mode {seqwr, seqrewr, seqrd, rndrd, rndwr, rndrw}
--file-io-mode=STRING file operations mode {sync,async,mmap} [sync]
--file-async-backlog=N number of asynchronous operatons to queue per thread [128]
--file-extra-flags=[LIST,...] list of additional flags to use to open files {sync,dsync,direct} []
--file-fsync-freq=N do fsync() after this number of requests (0 - don't use fsync()) [100]
--file-fsync-all[=on|off] do fsync() after each write operation [off]
--file-fsync-end[=on|off] do fsync() at the end of test [on]
--file-fsync-mode=STRING which method to use for synchronization {fsync, fdatasync} [fsync]
--file-merged-requests=N merge at most this number of IO requests if possible (0 - don't merge) [0]
--file-rw-ratio=N reads/writes ratio for combined test [1.5]
Code language: plaintext (plaintext)
The --file-test-mode
option is the most important one: it controls which I/O pattern is executed: “seq…” stands for sequentially reading/writing to files, whereas “rnd…” stands for randomly reading/writing to files. rndrw
combines both randomly reading and writing. Note: seqrewr
does not stand for “sequential read+write”, but for “sequential rewrite” (presumably writing to already-existing files)!
The results will be vastly different if you change any of the arguments. For instance, on the same test machine, I got these results:
- “
sysbench fileio --file-test-mode=seqwr --file-num=1 run
” yields 183 MB/s throughput - “
sysbench fileio --file-test-mode=seqwr --file-num=128 run
” yields 30 MB/s throughput
I think the difference can be explained by the varying number of fsync operations (more files require more fsync ops, and fsync ops are slow). In any case: be wary of the CLI arguments that you use: only compare results that use the exact same values.
There is also a “cleanup
” command you can run after the “run
” command, which cleans up the test files created in prepare
/run
. However, you might not need the “cleanup
” command when running Sysbench in a container, which you discard after a benchmark anyway. “cleanup
” might still be worth using if you benchmark the performance of a remote storage.
3. Memory / RAM
Using “sysbench memory --memory-oper=write --memory-access-mode=seq run
” you can benchmark the raw memory access speed.
Example output:
sysbench 1.0.20 (using bundled LuaJIT 2.1.0-beta2)
Running the test with following options:
Number of threads: 1
Initializing random number generator from current time
Running memory speed test with the following options:
block size: 1KiB
total size: 102400MiB
operation: write
scope: global
Initializing worker threads...
Threads started!
Total operations: 45658628 (4565125.60 per second)
44588.50 MiB transferred (4458.13 MiB/sec)
General statistics:
total time: 10.0001s
total number of events: 45658628
Latency (ms):
min: 0.00
avg: 0.00
max: 0.42
95th percentile: 0.00
sum: 5247.42
Threads fairness:
events (avg/stddev): 45658628.0000/0.00
execution time (avg/stddev): 5.2474/0.00
Code language: plaintext (plaintext)
“sysbench memory help
” shows the notable configuration options and its default values:
memory options:
--memory-block-size=SIZE size of memory block for test [1K]
--memory-total-size=SIZE total size of data to transfer [100G]
--memory-scope=STRING memory access scope {global,local} [global]
--memory-hugetlb[=on|off] allocate memory from HugeTLB pool [off]
--memory-oper=STRING type of memory operations {read, write, none} [write]
--memory-access-mode=STRING memory access mode {seq,rnd} [seq]
Code language: plaintext (plaintext)
Most notably, you can configure whether you want to benchmark reading or writing, and switch between sequential (seq
) and random (rnd
) memory access mode.
fio is an advanced disk benchmark tool. It is available in the repositories of many Linux distributions, but you can also download its source code and compile it, as documented in the project readme. I recommend that you compile the code yourself, because the fio version available in the Linux repositories is likely out of date.
Fio is configured via a “job file” that uses an INI-style format. Here is an example:
[global]
rw=randread
ioengine=libaio
iodepth=64
size=1g
direct=1
buffered=0
startdelay=20
ramp_time=5
runtime=60
group_reporting=1
numjobs=1
time_based
disk_util=0
clat_percentiles=0
disable_lat=1
disable_clat=1
disable_slat=1
filename=fiofile
[test]
name=test
bs=64k
stonewall
Code language: plaintext (plaintext)
The above example job file configures starts a random read test that has a warm-up time of 20 seconds, then runs for 60 seconds, with 1 parallel thread, using a block size of 64 KB. The docs explain what the individual parameters mean (including alternative read/write patterns other than randread
).
The resulting output of running “fio <path-to-jobfile>
” looks like this:
test: (g=0): rw=randread, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=64
fio-3.35
Starting 1 process
test: Laying out IO file (1 file / 1024MiB)
test: (groupid=0, jobs=1): err= 0: pid=3601: Wed Nov 29 10:38:22 2023
read: IOPS=2959, BW=185MiB/s (194MB/s)(10.8GiB/60001msec)
bw ( KiB/s): min=165376, max=209955, per=99.99%, avg=189472.50, stdev=6106.75, samples=120
iops : min= 2584, max= 3280, avg=2960.44, stdev=95.43, samples=120
cpu : usr=3.06%, sys=11.09%, ctx=177668, majf=0, minf=36
IO depths : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
issued rwts: total=177584,0,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=64
Run status group 0 (all jobs):
READ: bw=185MiB/s (194MB/s), 185MiB/s-185MiB/s (194MB/s-194MB/s), io=10.8GiB (11.6GB), run=60001-60001msec
Code language: plaintext (plaintext)
The results indicate an average throughput (bw=bandwidth) of 189472.50 KiB/s
and 2960.44 IOPS
(higher is better).
Geekbench is a proprietary benchmark suite geared towards end-users (e.g. gamers), who want to compare their machine’s performance with other end-users. Geekbench can benchmark CPU, GPU and several real-world scenarios that stress the CPU, GPU, memory and disk to varying extents.
The CLI version runs the following tests (every test is run twice, in a single-core and a multi-core variant):
- File Compression
- Navigation
- HTML5 Browser
- PDF Renderer
- Photo Library
- Clang
- Text Processing
- Asset Compression
- Object Detection
- Background Blur
- Horizon Detection
- Object Remover
- HDR
- Photo Filter
- Ray Tracer
- Structure from Motion
The tests are explained in more detail here.
Because Geekbench is proprietary (the source code is not available), you have to download binaries for the correct CPU platform. By default, Intel/AMD 32/64 bit are supported, and there is a preview version for other platforms like ARM64, but it is known to be unstable, sometimes computing incorrect benchmark scores.
There is a free version (as well as a free unofficial AMD64-only Docker Image) and a commercial version:
- The free version requires internet access, because the benchmark results are not printed to the console, but they are only uploaded to https://browser.geekbench.com/ where you can look at the results (example result). The console output of the Geekbench CLI only prints the URL at which your uploaded results are made available, or an error if uploading failed. Some automation can be done to “extract” the scores from the result web page, though, which is e.g. done by “yabs”, see here.
- The commercial version can additionally publish the result to various local file formats, which makes it suitable for automation.
In summary, while Geekbench runs many useful tests, it has two main caveats:
- unless you are willing to pay $99 for a commercial license, integrating Geekbench into your own test suite (that includes other benchmark tools) will require custom result parsing code of the Geekbench HTML report
- Geekbench’s ARM64 variant is not yet stable, so comparing the results of ARM-based CPUs with Intel/AMD-based ones might yield incorrect results
Honorable mentions of tools I did not test myself:
- For database benchmarks:
- Transaction Processing Perfomance Council (TPC), e.g. TPC-C or TPC-E (for OLAP), or TPCx-HS / TPCx-BB (for Big Data), TPC-H (for analytics).
- Yahoo Cloud Serving Benchmark (for NoSQL)
- Time Series Benchmark Suite (TSBS) for time series DBs
- Graph Database Benchmark (GDB) for graph databases
- For message broker benchmarks: OpenMessaging Benchmark Framework
Conclusion
Before you engage with benchmarking, always consider the cost-benefit ratio first! Only spend the time on benchmarks you can afford, because of the estimated (long-term) amortization of your efforts.
If you think it is worth the effort to do the hardware benchmarks yourself (→ level 2 or 3), consider the tips from the “Good hardware benchmark practices” section. While you could build a separate Docker image for each tool (or use existing images), I instead recommend building a single customized Docker image that contains all tools. Part 2 will go into details how to achieve this with the Phoronix test suite.