First, every component in your workload must behave in a way that does not negatively impact other components. Second, every component in your workload must be able to withstand the failure of one or more other components. Now, how can you achieve this?
The following best practices will help you limit the impact of failures on your workload.
Implementing Graceful Degradation to Transform Applicable Hard Dependencies into Soft Dependencies
This is a principle that has been successfully implemented by many companies running infrastructure on AWS. A concrete example of this is Netflix. Netflix has spoken publicly on numerous occasions about their approach to resiliency, so it is commonly known that, for instance, their user interface (UI) is designed to sustain the failure of the components in charge of providing information that it relies upon. As you log in to their platform and land on your home page, you see a list of various features (for instance, Continue watching, Recommended for you, and Watch again) each displaying a set of items (such as films, series, and documentaries). When the UI cannot reach one of its dependencies for any reason (component failure, network issue, or anything else), it misses a piece of information, such as all the items of a given category, for instance. In such a case, it would be very inefficient to display an error page to the end user simply because one of the many features provided is temporarily unavailable. It would also cause a very poor user experience. So, instead, the UI component is designed to keep working but will do so in degraded mode. In the example mentioned here, the UI would simply hide the feature related to the unavailable service and maybe serve cached data instead, if that makes sense for the impacted features. For end users, it is a much better experience (compared to landing on an error page) since they can still access the overall service even though some of its features are temporarily unavailable. In some cases, they may not even notice that the UI is operating in degraded mode.
The general idea is that when a component’s dependencies become unhealthy, the component itself can keep working, although in degraded mode. What degraded mode means depends entirely on the use case at hand. It will likely mean different things for a video-on-demand UI and a mobile banking application. Yet the same graceful degradation approach can still be applied in both cases.
A popular design pattern that makes use of this principle is the circuit breaker. Consider two components, a client and a server decoupled through a circuit breaker component. The circuit breaker component has two states: open and closed. When the server is healthy and available, the circuit breaker is closed and the traffic flows normally between the client and the server through the circuit breaker. If the server becomes unhealthy, the circuit breaker opens, and the flow is interrupted to spare the client a call to the server that would inevitably result in a failure. Instead, the circuit breaker sends back a response on behalf of the server (depending on the use case, it could serve content from a cache or predefined data, for instance). The circuit breaker will also typically implement a retry strategy to come back to a closed state whenever the server is available again.