An Internet-enabled world
We live in interesting times; in the last decade, a lot of changes have happened that have changed the way we experience the world of the Internet and the ecosystem around it. In this chapter, we will focus on some of the reasons that led to progress and discuss the developments happening in the world of data storage.
The following figure is a rough sketch of the evolution process that happened in cyberspace, the data for which is collected from the Internet, and gives a rough idea of the growth experienced in Internet-based services:
The preceding chart indicates that the hardware industry saw a paradigm shift during the middle half of the first decade. Instead of new processors coming out with increased clock speeds, the newer generation of processors came with multiple cores and their numbers increased in processors with a subsequent release. Gone were the days when a big machine with lots of memory and a powerful processor could solve any problem or, in other words, when an Enterprise depended on vertical scaling to solve their performance issues. What it signaled, in a way, was that parallel computing was the future and it will be deployed on commodity-based machines.
With the hardware industry signaling the arrival of parallel computing, the newer generation of solutions had to be distributed and parallel in nature. This means that they needed to have logic executed in parallel and data stored in distributed datastores; in other words, horizontal scaling was the way to go. Moreover, with Web 2.0, there was an emergence of social media, online gaming, online shopping, collaborative computing, cloud computing, and so on. The Internet was becoming a ubiquitous platform.
The popularity of the Internet and the number of people using the Internet was increasing by the day, and the amount of time spent on the Internet was also increasing. Another important aspect to be looked at was that users across geographies were coming together in this Internet-enabled world. There are many reasons for this; for one, websites were becoming intelligent and in a way, were engaging end users far effectively than their predecessors. Another factor that was making Internet adoption faster and easier were innovative handheld devices, such as smartphones, tablets, and so on. Nowadays, the kind of compute power these handheld devices have can be compared to that of computers. In this dynamically changing world, Internet-based software solutions and services are expanding the horizon of social media, which brings people together on a common platform. This created a new business domain like social-Enterprise media, where social media bridges with Enterprise. This was definitely going to have an impact on traditional Enterprise solutions.
The Internet effect made Enterprise solutions undergo a metamorphic shift. The shift in Enterprise architecture went from a nuanced set of requirements, typically expected from Enterprise solutions, to adopting newer requirements, which were the bastion of social media solutions. Nowadays, Enterprise solutions are integrating with social media sites to know what their customers are talking about; they themselves have started creating platforms and forums where the customer can come and contribute their impressions about products and services. All this data exchange happens in real time and needs a highly concurrent and scalable ecosystem. To sum it up, Enterprise solutions want to adopt the features of social media solutions, and this has a direct and proportional bearing on the nonfunctional requirements of their architectures. Features such as fault management, real-time big data crunching, eventual consistency, high numbers of reads and writes, responsiveness, horizontal scalability, manageability, maintainability, agility, and so on, and their impact on Enterprise architecture, are being looked at with renewed interest. Techniques, paradigms, frameworks, and patterns that were used in social media architecture are being studied and reapplied in Enterprise architecture.
One of the key layers in any solution (social media or Enterprise) is the data layer. Data, the way it is arranged and managed, and the choice of datastore forms the data layer. From a designer's perspective, data handling in any datastore is governed by perspectives such as consistency, availability, and partition tolerance, or better known as Eric Brewer's
CAP theorem. While it is desirable to have all the three, in reality, any data layer can have a combination of two of the mentioned perspectives. What this means is that the data in a solution can have many combinations of perspectives, such as availability-partition tolerance (this combination has to forego consistency in data handling), availability-consistency (this combination has to forego partition tolerance which will impact the amount of data that the data layer can handle), and consistency-partition tolerance (this combination has to forego availability).
The CAP theorem has a direct bearing on the behavior of the system, read/write speeds, concurrency, maintainability, clustering patterns, fault tolerance, data loads, and so on.
The most common approach when designing the data model is to arrange it in a relational and normalized way. This works well when the data is in transactional mode, needs consistency, and is structured, that is, it has a fixed schema. This approach of normalizing data appears over-engineered when the data is semistructured, has a tree-like structure, or is schema-less, where consistency can be relaxed. The end result of making semistructured data fit into a structured data model is the explosion of tables and a complicated data model to store simple data.
Due to the lack of alternatives, the solutions have been overtly relying on RDBMS to address concerns regarding data handling. The problem with this approach is RDBMS, which was primarily designed to address consistency and the availability perspective of data handling, also started to store data, which had concerns of partition tolerance. The end result was a bloated RDBMS with a very complex data model. This started impacting the nonfunctional requirements of a solution negatively, in the areas of fault management, performance, scalability, manageability, maintainability, and agility.
Another area of concern was
Data Interpretation, which is very important while designing the data layer. In a solution, the same data is viewed and interpreted differently by a different concerned group. To give a better idea, let's say that we have an e-commerce website that sells products. Three basic functional domains come into play in the design of this data layer; they are inventory management, account management, and customer management. From a core business standpoint, all the domains need atomicity, consistency, isolation, durability (ACID) properties in their data management, and from the CAP theorem point of view, they need consistency and availability. However, if the website needs to understand its customer in real time, an analytics team needs to analyze data from the inventory management, account management, and customer management domains. Apart from other data, it might collect separately at real time. The way the analytics team views the same data is totally different from the way other teams view it; for them, consistency is less of a concern, as they are more interested in the overall statistics, and a little inconsistent data will have no impact on the overall report. If all the data required for analytics from these domains is kept in the same data model as that for core business, the analytics will run into difficulty because it has to now work with this highly normalized and optimized structured data for business operations. The analytics team will also like to have their data denormalized for faster analysis.
Now, running real-time analytics on this normalized data on a RDBMS system will require heavy compute resources, which will impact the performance of core business during business hours. So, it is better for overall business if separate data models are created for these domains, one for business and one for analytics, where each is maintained separately as they have separate concerns. We will see in subsequent topics why RDBMS is not the right fit for analytics and some other use cases and how NoSQL solves the problem of explosion of data.