Nowadays, reactive is a buzzword—so exciting but so confusing. However, should we still care about reactivity even if it takes an honorable place in conferences around the world? If we google the word reactive, we will see that the most popular association is programming,
in which it defines the meaning of a programming model. However, that is not the only meaning for reactivity. Behind that word, there are hidden fundamental design principles aimed at building a robust system. To understand the value of reactivity as an essential design principle, let's imagine that we are developing a small business.
Suppose our small business is a web store with a few cutting-edge products at an attractive price. As is the case with the majority of projects in this sector, we will hire software engineers to solve any problems that we encounter. We opted for the traditional approaches to development, and, during a few development interactions, we created our store.
Usually, our service is visited by about one thousand users per hour. To serve the usual demand, we bought a modern computer and ran the Tomcat web server as well as configuring Tomcat's thread pool with 500 allocated threads. The average response time for the majority of user requests is about 250 milliseconds. By doing a naive calculation of the capacity for that configuration, we can be sure that the system can handle about 2,000 user requests per second. According to statistics, the number of users previously mentioned produced around 1,000 requests per second on average. Consequently, the current system's capacity will be enough for the average load.
To summarize, we configured our application with the margin regarding capacity. Moreover, our web store had been working stably until the last Friday in November, which is Black Friday.
Black Friday is a valuable day for both customers and retailers. For the customer, it is a chance to buy goods at discounted prices. And for retailers, it is a way to earn money and popularize products. However, this day is characterized by an unusual influx of clients, and that may be a significant cause of failure in production.
And, of course, we failed! At some point in time, the load exceeded all expectations. There were no vacant threads in the thread pool to process user requests. In turn, the backup server was not able to handle such an unpredictable invasion, and, in the end, this caused a rise in the response time and periodic service outage. At this point, we started losing some user requests, and, finally, our clients became dissatisfied and preferred dealing with competitors.
In the end, a lot of potential customers and money were lost, and the store's rating decreased. This was all a result of the fact that we couldn't stay responsive under the increased workload.
But, don't worry, this is nothing new. At one point in time, giants such as Amazon and Walmart also faced this problem and have since found a solution. Nevertheless, we will follow the same roads as our predecessors, gaining an understanding of the central principles of designing robust systems and then providing a general definition for them.
- Amazon.com hit with outages (https://www.cnet.com/news/amazon-com-hit-with-outages/)
- Amazon.com Goes Down, Loses $66,240 Per Minute (https://www.forbes.com/sites/kellyclay/2013/08/19/amazon-com-goes-down-loses-66240-per-minute/#3fd8db37495c)
- Walmart's Black Friday Disaster: Website Crippled, Violence In Stores (https://techcrunch.com/2011/11/25/walmart-black-friday/)
Now, the central question that should remain in our minds is—How should we be responsive? As we might now understand from the example given previously, an application should react to changes. This should include changes in demand (load) and changes in the availability of external services. In other words, it should be reactive to any changes that may affect the system's ability to respond to user requests.
One of the first ways to achieve the primary goal is through elasticity. This describes the ability to stay responsive under a varying workload, meaning that the throughput of the system should increase automatically when more users start using it and it should decrease automatically when the demand goes down. From the application perspective, this feature enables system responsiveness because at any point in time the system can be expanded without affecting the average latency.
For example, by providing additional computation resources or additional instances, the throughput of our system might be increased. The responsiveness will then increase as a consequence. On the other hand, if demand is low, the system should shrink in terms of resource consumption, thereby reducing business expenses. We may achieve elasticity by employing scalability, which might either be horizontal or vertical. However, achieving scalability of the distributed system is a challenge that is typically limited by the introduction of bottlenecks or synchronization points within the system. From the theoretical and practical perspectives, such problems are explained by Amdahl's Law and Gunther's Universal Scalability Model. We will discuss these in Chapter 6, WebFlux Async Non-Blocking Communication.
However, building a scalable distributed system without the ability to stay responsive regardless of failures is a challenge. Let's think about a situation in which one part of our system is unavailable. Here, an external payment service goes down, and all user attempts to pay for the goods will fail. This is something that breaks the responsiveness of the system, which may be unacceptable in some cases. For example, if users cannot proceed with their purchases easily, they will probably go to a competitor's web store. To deliver a high-quality user experience, we must care about the system's responsiveness. The acceptance criteria for the system are the ability to stay responsive under failures, or, in other words, to be resilient. This may be achieved by applying isolation between functional components of the system, thereby isolating all internal failures and enabling independence. Let's switch back to the Amazon web store. Amazon has many different functional components such as the order list, payment service, advertising service, comment service, and many others. For example, in the case of a payment service outage, we may accept user orders and then schedule a request auto-retry, thereby protecting the user from undesired failures. Another example might be isolation from the comments service. If the comments service goes down, the purchasing and orders list services should not be affected and should work without any problems.
Another point to emphasize is that elasticity and resilience are tightly coupled, and we achieve a truly responsive system only by enabling both. With scalability, we can have multiple replicas of the component so that, if one fails, we can detect this, minimize its impact on the rest of the system, and switch to another replica.