This book introduces the reader to a broad spectrum of topics related to big data as used in the enterprise. Big data is a vast area that encompasses elements of technology, statistics, visualization, business intelligence, and many other related disciplines. To get true value from data that oftentimes remains inaccessible, either due to volume or technical limitations, companies must leverage proper tools both at the software as well as the hardware level.
To that end, the book not only covers the theoretical and practical aspects of big data, but also supplements the information with high-level topics such as the use of big data in the enterprise, big data and data science initiatives and key considerations such as resources, hardware/software stack and other related topics. Such discussions would be useful for IT departments in organizations that are planning to implement or upgrade the organizational big data and/or data science platform.
The book focuses on three primary areas:
1. Data mining on large-scale datasets
Big data is ubiquitous today, just as the term data warehouse was omnipresent not too long ago. There are a myriad of solutions in the industry. In particular, Hadoop and products in the Hadoop ecosystem have become both popular and increasingly common in the enterprise. Further, more recent innovations such as Apache Spark have also found a permanent presence in the enterprise - Hadoop clients, realizing that they may not need the complexity of the Hadoop framework have shifted to Spark in large numbers. Finally, NoSQL solutions, such as MongoDB, Redis, Cassandra and commercial solutions such as Teradata, Vertica and kdb+ have provided have taken the place of more conventional database systems.
This book will cover these areas with a fair degree of depth. Hadoop and related products such as Hive, HBase, Pig Latin and others have been covered. We have also covered Spark and explained key concepts in Spark such as Actions and Transformations. NoSQL solutions such as MongoDB and KDB+ have also been covered to a fair extent and hands-on tutorials have also been provided.
2. Machine learning and predictive analytics
The second topic that has been covered is machine learning, also known by various other names, such as Predictive Analytics, Statistical Learning and others. Detailed explanations with corresponding machine learning code written using R and machine learning packages in R have been provided. Algorithms, such as random forest, support vector machines, neural networks, stochastic gradient boosting, decision trees have been discussed. Further, key concepts in machine learning such as bias and variance, regularization, feature section, data pre-processing have also been covered.
3. Data mining in the enterprise
In general, books that cover theoretical topics seldom discuss the more high-level aspects of big data - such as the key requirements for a successful big data initiative. The book includes survey results from IT executives and highlights the shared needs that are common across the industry. The book also includes a step-by-step guide on how to select the right use cases, whether it is for big data or for machine learning based on lessons learned from deploying production solutions in large IT departments.
We believe that with a strong foundational knowledge of these three areas, any practitioner can deliver successful big data and/or data science projects. That is the primary intention behind the overall structure and content of the book.