In this book recommendation, I review the book Designing Data-Intensive Applications by Martin Kleppmann, released in 2017. I also offer my 46 page summary of the book as free download.
What are data-intensive applications?
This book targets software engineers and architects interested in building distributed systems. In contrast to compute-intensive applications, where CPU cycles are the primary concern, the term data-intensive applications refers to systems where data is the primary concern. This could mean that you need to store large amounts of data (including high throughput), or that the data has high complexity, or that the data changes at fast speeds. These problems generally imply that data needs to be stored (and processed) in a distributed fashion.
Goal of the book Designing Data-Intensive Applications
The goal of the book is to help the target audience understand the different basic principles of storing, transmitting, processing and querying data. It only briefly mentions concrete implementations (of, say, a database or message broker), because the tech changes rapidly, and no book could ever keep up. Instead, the book teaches you the broad categories of tools and technologies, their trade-offs, and how to combine them to a complete system. Provided with this high-level knowledge you can then choose the right implementation.
Key topics covered by the book
The book starts by discussing different (non-) functional requirements that data-intensive applications typically have, such as reliability, scalability and maintainability. It explains the two primary data models (relational and different sub-variants of non-relational), how they can be queried, and when you should favor one data model over the other.
A chapter about databases explains how they store data under the hood, including search indices. The book continues discussing OLTP and OLAP databases and how they are different. For me, it offered the first proper introduction to OLTP and OLAP databases and how they differ. I’ve read numerous articles on that subject, and they all pale in comparison to the book’s explanation.
Next, the author looks at different ways of (de-)serializing data (e.g. as text or binary), so that you can store or transmit it to other nodes. Schema evolution (forward/backward-compatibility) is also discussed, which is an important topic in practice, since applications and the data schema keeps evolving, but application code is rolled out incrementally in distributed systems.
A large part of the book covers three cornerstones of distributed databases in depth: replication, partitioning, and transactions. For each, it discusses various approaches and how they compare to each other. There are also nice “by the way” info blocks scattered throughout the book. For instance, Kleppmann explains why the C in ACID does not belong there. It was likely put there to make the acronym work (although I personally would have dug the “AID” acronym, too).
The book goes on to discuss different issues that any distributed system suffers from, such as partial failures (e.g. of a node’s hardware), unreliable networks, unreliable clocks, and the problematic consequence that a node can often not distinguish the different problems from one another.
Next, the book sheds light on the terms consistency and consensus. It explains the problems of the strongest consistency guarantee (linearizability), how they can be achieved (e.g. with 2-Phase-Commit, Raft, or (Multi-) Paxos), and why you might want to settle for weaker consistency levels.
The last chapters explain how (and why) you should integrate different kinds of data systems (e.g. caches, search indices, or data stores, which are each optimized for different use cases) to form a distributed system that offers the best possible performance. The book takes a detailed look at batch processing systems (e.g. MapReduce, dataflow engines like Spark or Flink, or graph processing engines) and stream processing systems (such as log-based message brokers).
The book closes with a lengthy section about responsibility we as developers have, touching on the ethics and rights of the end-users using our system.
The following is a list of the most noteworthy insights I had when reading the book:
- It is often the case that whenever the current load increases by one order of magnitude (10x), you need to completely (or at least partially) rewrite the architecture of your system.
- More and more products blur the lines when it comes to the feature set, e.g. databases supporting both SQL and NoSQL, or products like Redis that are both a cache, a DB and a message broker.
- I now have a clear idea of what caveats to consider when designing (and evolving) schemas – irrespective of whether this applies to a relational database schema, or other ways of storing and transmitting data.
- When working with databases, be careful regarding the wording of the configuration settings. They might not mean what you think they mean. Instead, read the documentation closely. For instance, Oracle’s “serializable” transaction isolation level is not what you think it is. Or when you configure a active/passive replication to synchronously propagate updates, it does not mean that it really propagates updates to all other nodes synchronously, just to some of them.
- In practice, the ACID implementation of database implementation #1 does not equal the ACID implementation of database implementation #2. The term “ACID compliance” has become an under-defined marketing term, because some subtle details are not clarified in the ACID definitions, so the DB vendors can choose their own interpretation.
- For applications using a relational DB, choosing the appropriate transaction isolation level is difficult. You need to find the right balance between “serializable” (which has a big performance penalty) and one of the lower levels. You also need to understand how the isolation levels defined by the DB vendor are different to those levels defined by experts of the field.
- If you need a data store that has “strong consistency”, where conflicts can never occur and where transactions that can complete timely are supported, then a distributed version of such a database cannot be much faster than its single-node version. This has been mathematically proven. Such systems require consensus and there is a lot of communication/messaging overhead, which makes the whole setup rather inefficient. There are many data store vendors who ignore this fact and lie through their teeth, claiming superior performance. In reality, the only way to get higher performance (throughput) is to either allow for a (low) probability of conflicts to arise (e.g. DBs with eventual consistency), or to be fine with a strong level of (conflict-free) consistency that is not timely anymore (making the user wait due to asynchronous processing).
- When choosing a distributed component (e.g. a DB or message broker), carefully examine how they scale. Typically, partitioning is part of the scaling approach. Learn how they the product does partitioning (e.g. by key, or hash of the key), which has different pros and cons (such as the inability to do efficient range queries). Also, find out how indices are partitioned (in addition to how the data is partitioned). Also, learn how rebalancing of partitioning happens, as it has a big impact on the operatinal performance. Also, learn how requests are routed to the correct partition, because it affects the availability of your system.
- When it comes to consistency levels (where the highest level is strong consistency, a.k.a. linearizability), think carefully whether your application really needs linearizability. The costs of linearizability are poor availability (any such system loses availability during a network partition) and poor performance (throughput, in transactions per second). During the analysis (requirements collection) phase of your system, investigate whether you can loosen the desired consistency level. It is usually easier to allow for inconsistencies to happen and to compensate at a later point of time, e.g. by giving a gift or voucher to the end-user for their trouble due to a cancellation. This is particularly true in those cases where the business has “apology-mechanisms” in place anyway. Exemplary, an online shop may have miscounted their inventory, having sold you goods it doesn’t even have. Thus, if you already have such an apology-mechanism in place anyway (on the business level), it makes little sense to require stronger consistency on the technological level than on the business level.
- Rather than building applications that store mutable state in a data store, you can instead apply the Event sourcing approach where you store user actions as events in an event store or a log-based message broker. The disadvantages are higher storage costs, and the increased complexity of (usually) having to maintain an additional state that reflects the accumulated events. But there is a plethora of advantages: auditability, easier to debug code (because you can replay events), easier to recover from errors (e.g. in a production system), higher degree of information (e.g. if two actions would cancel each other out), or the ability to implement additional features with additional (state-based) views, in parallel to the production system (not negatively affecting it, e.g. not having to do schema migrations). Also, database schema design becomes much more flexible, because you can always rebuild your DB (with a different schema) from the event logs.
- You should expect bugs even in “battle-tested” DBs. A good mitigation strategy is to regularly and continuously audit the integrity of your data. This means to write code that reads the data and checks that it makes sense. It also means to verify that backups are consistent, and that you can successfully restore them. Doing data integrity auditing does cost some money, but it saves much more money you would otherwise lose in a production incident that takes you a long time to fix, because you are having problems with the data restore process. Doing auditing continuously also helps you discover (and fix) bugs in your code more quickly, e.g. if the bug introduces integrity problems.
Get my book summary
If you want to learn more, I’m offering you my summary of the book (46 pages) for free. As the book has 544 pages, the compression factor is ~12x.
Creating summaries for good books has proven very beneficial for me: I regularly revisit them (e.g. once every 1-2 years), to refresh my memory, or to check where I have to refresh my skills by practicing.
In the summary, I inserted many references to the page numbers of the book, where you can find more details.
Designing Data-Intensive Applications is a masterful summary of many topics of the distributed systems literature. Never have I written a book summary that long, which is due to the very high density of facts, making it hard to omit much. Although the topics are difficult, Kleppmann always manages to present useful, practical examples that help the reader understand the content. Kleppmann is both a researcher and practitioner. He is actually knowledgable in the field, and for the curious reader he provides references to many other resources for further reading (including many academic papers) throughout the book.
Having read his work, I now have a much better understanding of the general techniques and approaches for distributed data storage and processing. I know how to cut through many of the marketing lingo of distributed storage products. I also have an idea which questions to ask the sales representative of a software vendor, regarding the product’s behavior under adverse conditions, or the supported consistency levels.
If you liked my review, please buy the book. It contains many further details. Since there have been many (product) developments since the book’s release in 2017, I hope that Kleppmann releases a second edition at some point.