What is the definition of a distributed system being operational?. In a single system, identifying the issue is easy because a program is either operational or it has encountered a crash, with the distinction typically evident from a stack trace. However, distributed systems are not as straightforward.
Each server may indicate healthiness when users encounter errors. The entire system could function correctly but remain trapped in a non-recoverable state. It can provide incorrect data without anyone suspecting anything, even with all dashboards showing positive results.
These issues might not be due to conventional bugs. These persistent error patterns have been appearing in various systems for many years, each with their own labels, processes, and established techniques for prevention. This article will explore the most crucial failure mode patterns in distributed systems and the typical methods used to address them.
