In this book recommendation, I review the second edition of the book Release It! Design and Deploy Production-Ready Software by Michael Nygard, released in 2018. I also offer my 27 page summary of the book as free download.
A common misconception is that the majority of the costs in a software-based project is allocated to staffing, e.g. for your deveopment teams, management, operations, etc. However, software only delivers its value only in production, earning revenue. Once you have a production incident, the loss of revenue may become so large, that it eats up your staffing costs. When it comes to internet-based services, customers today have a lot of alternatives to go to, and not a lot of patience. If your production system stops working, they might just go elsewhere. And if you keep messing up repeatedly, you lose your reputation: (regular) customers will never come back, even after your system has finally been rearchitected and is now really working well.
The goal of “Release It!” is to avoid that this happens to you. The book is filled with tips for writing production-ready software that is as stable as possible. It mainly targets software engineers and architects that build larger-scale systems.
Key topics covered by Release It!
In chapters 1-3, the book defines (on a high level) what instability is, explaining terms such as fault, error, failure and failure mode. The high-level objective is to make sure that any damage that occurs in production is contained to a small part of the system. To achieve this, you need to be aware of the different faults that could occur, how they would spread through the system, and how you can control this spread. The book distinguishes two forms of negative forces that the environment may apply to your system: short-term impulses (e.g. a DoS attack), and long-term stresses (e.g. a component that is slowly nearing its capacity, or external systems that keep responding very slowly). The goal is to build a stable system that can keep processing transactions even when impulses or stresses are negatively affecting it, or when one or more components fail.
Chapter 4 bundles the (uncountably many) concrete faults (and their effects) into a small set of about ten stability antipatterns. One example is Dogpile, where a temporary load surge, caused by a bunch of components (the “dogs”), put too much load on another component. For instance: you restart all your application servers at once, and they all need to warm their cache, thus overloading the the database.
For many of the antipatterns, chapter 4 already provides hints for how to mitigate them. However, chapter 5 is dedicated to mitigation, presenting a list of about ten stability patterns, such as using timeouts, the circuit breaker pattern, or failing fast. Each stability pattern addresses one or more of the antipatterns.
Chapter 6 tells a story about a production incident, and there are also “Case study” stories like this in chapters 12 and 15. An interesting learning in the story of chapter 6: if you collect metrics for the duration of HTTP requests, the metrics aggregation system (e.g. Prometheus) can only collect durations of those requests that actually completed successfully (with some HTTP status code). Requests that failed with a timeout, e.g. because the client stopped accepting response data, won’t be collected!
Chapters 7-10 teach many basics of networking (e.g. service discovery, load balancing or multi-homing), deployment options (on physical hosts, VMs, or containers), deploying & configuring (e.g. using Version Control, CI/CD, immutable infrastructure, 12-factor-apps principles), and observability.
Chapter 11 covers Security. Apart from some general truths (the gist being “bake security into your SDLC” and “continuously work on security”), the book rehashes the OWASP Top 10 from 2017. Today, it makes more sense to look up the most recent version directly on their website.
Chapter 13 discusses the deployment process in detail. The development team should treat (zero-downtime) deployment aspects as a feature of the software, not as something that only the ops team is concerned with. That means that the dev team needs to take part in creating the CI/CD pipeline and all its tasks and tools. The chapter discusses the different sequential stages of a (rolling) deployment, and what caveats there are with database schema evolutions, presenting multiple ways to solve them.
Chapter 14 goes into API versioning. It explains how to identify and handle breaking vs. non-breaking changes, in your own and in third party APIs. It also discusses how to handle differences between implementation behavior and API specification.
Chapter 15 is another case study, telling a story about a failed product launch, where servers crashed because of too many concurrent sessions. It contains many real-world learnings and tips about load testing.
In chapter 16, the authors discuss how to adapt your software and process to the ever-changing business needs. They point out important caveats: for instance, having applied CI/CD and thus being able to quickly roll out is of little value if your feedback-collection-speed is slow. The book also discusses platform teams, and has tips for building an adaptable system architecture or information architecture – although I’d argue that there are much better books about these topics.
Chapter 17 concludes by presenting an introduction to chaos engineering, which is about injecting different kinds of chaos into the production system, to verify that the system (as a whole) stays operational. And should the system start failing, you at least learn how and why the system breaks.
Get my book summary
If you want to learn more, I’m offering you my summary of the book (27 pages) for free. As the book has ~330 pages, the compression factor is ~12x.
Creating summaries of books has proven very beneficial for me: I regularly revisit them (e.g. once every 1-2 years), to refresh my memory, or to check where I have to refresh my skills by practicing.
In the summary, I inserted many references to the page numbers of the book, where you can find more details.
I can generally recommend this book for anyone starting out with deploying software, because it contains many helpful tips. I loved the chapters 3 (fault taxonomy), 4 (stability antipatterns) and 5 (stability patterns), which are super-useful for developers, helping them make the code more resilient. I also learnt a lot in chapters 13 (zero-downtime deployments) and 14 (API versioning).
However, there were some “weaker” chapters, too, from my point of view. For instance, many parts of chapters 7-9 seem to be stuck in the “old ways” of running servers, where you (manually) apply configuration, e.g. for networking. I would have expected the second edition of the book to be more updated, covering more cloud-based tech, but maybe it is good to also have a perspective on how to run software without any workload schedulers, such as Kubernetes. For a few other topics, such as Security or Chaos Engineering, the respective book chapters are basic introductions. Other, more dedicated resources, will get you much further in these areas. This is not really a shortcoming, though. A book like this cannot possibly covery every topic in depth.
If you liked my review, please buy the book. If you have read it, feel free to share your findings or opinions in the comments.