One of the aspects that influences a distributed system most is communication between the different elements of the system. There is already quite a number of things that can go wrong in a simple client-server scenario; that number becomes simply mind-boggling in a large microservices architecture. If you wish to have a better grasp of why this is the case in detail, read the Challenges with Distributed Systems paper from the Amazon Builders’ Library, which is referenced in the Further Reading section of this chapter. In a nutshell, in a distributed system, you have to assume that every single line of code in your system that induces network communication may not work as expected. If you have a simple client-server system, it may still be managed without too much trouble, but if you have a system split into tens, hundreds, or more components distributed all over the place, a small glitch in one of these lines of code could cause a complete failure of the entire system. So, the more distributed your system is, the more complex failure handling becomes, which also explains why monolithic architectures are still around (ever wondered why some companies still have mainframes?).
So, how can you design your workload to keep it from being strongly impacted by network issues causing increased latency or even data loss?
Here are some best practices that you should follow.
One of the hardest things to handle with distributed systems is time constraints. If you’ve read the paper from the Amazon Builders’ Library that was pointed out earlier, you know that distributed systems with hard real-time constraints have extremely strict reliability requirements.
The following questions also become important to consider. Does your workload have hard real-time constraints? Does it require rapid, synchronous responses within seconds? Or does it have soft real-time constraints for which responses are allowed within a broader time window, for instance, a few minutes or more? Or, on the other end of the spectrum, does it mostly work as an offline system that can handle responses through batch processing.
In any case, having a clear understanding of which type of distributed system you have to build is paramount.
In a distributed system, what you want to avoid at all costs is that a component’s failure has a ripple effect and causes a disruption of other components until the entire system fails. So, rule #1 for the solutions architect when designing distributed systems is to loosely couple components that depend on each other.
Loose coupling brings more resiliency since a failure in a given component does not immediately impact other components that depend on it. It also brings more agility, as a positive side effect. When isolating components through loose coupling, components’ life cycles become independent of each other. You can start modifying the code of these components, to bring improvements or fix issues. Then, if you respect the contract that has been defined through the interface of a component, changes in their implementation are less likely to affect others. You may wonder why this is not being heavily asserted here. Well, changing a component’s implementation may, for instance, make it perform faster or slower, and even though you didn’t change its interface, it could still impact other components depending on it. For the same reasons, loose coupling also extends the flexibility and scalability of the system. In a system, not all components perform equally well or need identical resources for processing. But, when components are loosely coupled, you have the opportunity to scale them independently of each other, based on their individual needs.
Figure 6.1: Tight coupling
Now, you could think of implementing loose coupling in either synchronous or asynchronous mode. It may be a surprise to some, but loose coupling does not necessarily imply asynchronous communication, so don’t confuse the two.
Loose coupling two components means that you add an intermediate processing step, acting as a decoupling mechanism between the two. But the end-to-end communication could still take place synchronously. As illustrated in the following diagram, think of a load balancer, such as an ELB load balancer, between a client and a group of servers, for instance. A client making a request is loosely coupled to each server behind the load balancer, yet the communication happens synchronously.
Figure 6.2: Loose coupling (synchronous communication)
That said, to further improve resiliency, it is advised, whenever possible, to use asynchronous communication between loosely coupled components. That’s the best approach whenever an immediate response, other than the acknowledgment of the request, is not required. Think, for instance, of a queueing mechanism, such as Amazon SQS, buffering the requests between two components, as illustrated in the following diagram. Similarly, you could use some other mechanism, such as a notification system (for instance, Amazon SNS), a workflow engine (such as AWS Step Functions), or a streaming system (such as Amazon Kinesis), to handle asynchronous communication between loosely coupled components.
Figure 6.3: Loose coupling (asynchronous communication)
In a similar fashion, you could also think of using an event-driven architecture, where event producers and event consumers are loosely coupled, and an event engine (such as Amazon EventBridge) plays the role of an intermediate system.