Understanding the nature of big data
In the 2021 Enterprise Trends in Machine Learning survey conducted on 403 business leaders by Algorithmia, 76% of enterprises prioritized AI and ML over other IT initiatives. The global pandemic of COVID-19 necessitated some of those companies to hasten the development of AI and ML, as their chief information officers (CIOs) recounted, and 83% of the surveyed organizations increased their budget for AI and ML year-over-year (YoY), with a quarter of them doing so by over 50%. Customer experience improvement and process automation, either through increased revenue or reduced costs, were the main drivers of the change. Other studies, including KPMG’s latest report, Thriving in an AI World, essentially tell the same story.
The ongoing spree of AI and ML development, epitomized by deep learning (DL), was made possible by the advent of big data in the last decade. Provided with Apache’s open source software utilities Hadoop and Spark, as well as cloud computing services such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, organizations in both the private and public sectors can solve problems by handling a massive amount of data in ways unthinkable theretofore. Companies and bureaus no longer have to be overcautious in developing data analytic models and designing data warehouses upfront so that relevant data will be stored in appropriate formats. Instead, they can simply cascade available raw data into their data lake, expecting that their data scientists will find out variables that are valuable down the line by checking their correlation with one another.
Big data might seem to be the ultimate solution to a wide range of problems, but as we will see in the following sections, it has several inherent issues. To clearly understand what the issues with big data could be, let’s examine what exactly big data is first.
Definition of big data
Big data represents vast sets of information. This information is now growing at an exponential rate. Big data has become so large now as humans now produce two quintillion bytes of data daily. Thus, it is getting to be quite difficult to process big data very efficiently for an ML purpose with existing traditional tools for data management. Three Vs are commonly used to define the characteristics of big data, as presented here:
- Volume: Data from various sources such as business transactions, Internet of Things (IoT) devices, social media, industrial equipment, videos, and so on, contribute to the sheer amount of data.
- Velocity: Data speed is also an essential characteristic of big data. Often, data is needed in real time or near real time.
- Variety: Data comes in all formats, such as numeric data, text documents, images, videos, emails, audio, financial transactions, and so on.
The following screenshot describes the intersection of the three Vs as big data:
Figure 1.1 – Big data’s three Vs
In 1880, the United States (US) Census Bureau gathered a lot of data from the census and estimated that it would take 8 years to process that amount of data. The following year, a man named Herman Hollerith invented the Hollerith tabulating machine, which reduced the work needed to process the data. The first data center was built in 1965 to store fingerprint data and tax information.
Big data now
The introduction of data lakes as a concept played a key role in ushering in the massive scales we see when working with data today. Data lakes give companies total freedom to store arbitrary types of data observed during operation, removing a restriction that otherwise would have prevented the company from collecting some data that ends up being necessary in the future. While this freedom allows data lakes to maintain the maximum potential of the data generated by the company, it also can lead to a key problem—complacency in understanding the collected data. The ease of storing different types of data in an unstructured manner can actually lead to a store now, sort out later mentality. The true difficulty of working with unstructured data actually stems from its processing; thus, the delayed processing mentality has the potential to lead to data lakes that have become highly cumbersome to sift through and work with due to unrestricted growth from the collection of data.
Raw data is only as valuable as the models and insights that can be derived from it. The central data lake approach leads to cases where derivation from the data is limited by a lack of structure, leading to issues ranging from storage inefficiency to actual intelligence inefficiency due to extraction difficulties. On the other side, approaches preceding data lakes suffered from a simple lack of access to the amount of data potentially available. The fact that FL allows for both classes of problems to be avoided is the key driving support for FL as the vehicle that will advance big data into the collective intelligence era.
This claim is substantiated by the fact that FL flips the big data flow from collect → derive intelligence to derive intelligence → collect. For humans, intelligence can be thought of as the condensed form of large swaths of experience. In a similar way, the derivation of intelligence at the source of the generated data— done by training a model on the data at the source location—succinctly summarizes the data in a format that maximizes accessibility for practical applications. The late collection step of FL leads to the creation of the desired global intelligence with maximal data access and data storage efficiency. Even cases with partial usage of the generated data sources can still greatly benefit from the joint storage of intelligence and data by greatly reducing the number of data formats entering the residual data lake.
Triple-A mindset for big data
While many definitions have been proposed with emphasis on different aspects, Oxford professor Viktor Mayer-Schönberger and The Economist senior editor Kenneth Cukier brilliantly elucidated the nature of big data in their 2013 international bestseller, Big Data: A Revolution That Will Transform How We Live, Work, and Think?. It is not about how big the data in a server is; big data is about three major shifts of a mindset that are interlinked and hence reinforce one another. Their argument boils down to what we can summarize and call the Triple-A mindset for big data, which consists of an abundance of observations, acceptance of messiness, and ambivalence of causality. Let’s take a look at them one by one.
Abundance of observations
Big data doesn’t have to be big in terms of columns and rows or file size. Big data has a number of observations, commonly denominated as n, close or equal to the size of the population of interest. In traditional statistics, collecting data from the entire population—for example, people interested in fitness in New York—was not possible or feasible, and researchers would have to randomly select a sample from the population—for example, 1,000 people interested in fitness in New York. Random sampling is often difficult to perform and so is justifying the narrow focus on particular subgroups: surveying people around gyms would miss those who run in parks and practice yoga at home, and why gym goers rather than runners and yoga fans? Thanks to the development and sophistication of Information and Communications Technology (ICT) systems, however, researchers today can access the data of approximately all of the population through multiple sources—for example, records of Google searches about fitness. This paradigm of abundance or n = all is advantageous since what the data says can be interpreted as a true statement about the population, whereas the old method could only infer such truth with a significant level of confidence expressed in a p-value, typically supposed to be under 0.05. Small data provides statistics; big data proves states.
Acceptance of messiness
Big data tends to be messy. If we use Google search data as a proxy for someone’s interest—for example—we could mistakenly attribute some of the searches made by their family or friends on their devices to them, and the estimated interest will be inaccurate to the degree of the ratio of such unowned-device searches. In some devices, a significant amount of searches may be made by multiple users, such as shared computers at an office or a smartphone belonging to a child whose younger siblings are yet to own one. Otherwise, people may search for words that pop up in a conversation with someone else, rather than self-talk, which does not necessarily reflect their own interests. In studies using traditional methods, researchers would have to make sure that such devices are not included in their sample data because the mess can affect the quality of inference significantly, as the number of observations would be small. This is not the case in big data studies. Researchers would be willing to accept the mess as its effect diminishes proportionally as the number of observations becomes large enough toward n = all. In most devices, Google searches would be made by the owner autonomously most of the time, and the impact of searches in other contexts would not matter.
Ambivalence of causality
Big data is often used to study correlation but not causation—in other words, it usually does not tell why but only what. For many practical questions, correlation alone can provide the answer. Mayer-Schönberger and Cukier give several examples in the Big Data: A Revolution That Will Transform How We Live, Work, and Think book, among which is Fair Isaac Corporation’s Medication Adherence Score established in 2011. In an era where people’s behavioral patterns are datafied, collecting n = all observations for the variables of interest is possible, and the correlation found among them is powerful enough to direct our decision-making. There is no need to know people’s psychological scores of consistency or conformity that cause their adherence to medical prescriptions; by looking at how they behave in other aspects of life, we can predict whether they will follow the prescription or not.
By embracing the triple mindset of abundance, acceptance, and ambivalence, enterprises and governments have generated intelligence across tasks from pricing services to recommending products, optimizing transportation routes, and identifying crime suspects. Nevertheless, that mindset has been challenged in recent years, as shown in the following sections. First, let’s glimpse into how the abundance of observations often taken for granted is currently under pressure.