Data is the new silicon of our age, and machine learning, coupled with biologically inspired cognitive systems, serves as the core foundation to not only enable but also accelerate the birth of the fourth industrial revolution. This book is dedicated to our parents, who through extreme hardship and sacrifice, made our education possible and taught us to always practice kindness.
The Apache Spark 2.x Machine Learning Cookbook is crafted by four friends with diverse background, who bring in a vast experience across multiple industries and academic disciplines. The team has immense experience in the subject matter at hand. The book is as much about friendship as it is about the science underpinning Spark and Machine Learning. We wanted to put our thoughts together and write a book for the community that not only combines Spark’s ML code and real-world data sets but also provides context-relevant explanation, references, and readings for a deeper understanding and promoting further research. This book is a reflection of what our team would have wished to have when we got started with Apache Spark.
My own interest in machine learning and artificial intelligence started in the mid eighties when I had the opportunity to read two significant artifacts that happened to be listed back to back in Artificial Intelligence, An International Journal, Volume 28, Number 1, February 1986. While it has been a long journey for engineers and scientists of my generation, fortunately, the advancements in resilient distributed computing, cloud computing, GPUs, cognitive computing, optimization, and advanced machine learning have made the dream of long decades come true. All these advancements have become accessible for the current generation of ML enthusiasts and data scientists alike.
We live in one of the rarest periods in history--a time when multiple technological and sociological trends have merged at the same point in time. The elasticity of cloud computing with built-in access to ML and deep learning nets will provide a whole new set of opportunities to create and capture new markets. The emergence of Apache Spark as the lingua franca or the common language of near real-time resilient distributed computing and data virtualization has provided smart companies the opportunity to employ ML techniques at a scale without a heavy investment in specialized data centers or hardware.
The Apache Spark 2.x Machine Learning Cookbook is one of the most comprehensive treatments of the Apache Spark machine learning API, with selected subcomponents of Spark to give you the foundation you need before you can master a high-end career in machine learning and Apache Spark. The book is written with the goal of providing clarity and accessibility, and it reflects our own experience (including reading the source code) and learning curve with Apache Spark, which started with Spark 1.0.
The Apache Spark 2.x Machine Learning Cookbook lives at the intersection of Apache Spark, machine learning, and Scala for developers, and data scientists through a practitioner’s lens who not only has to understand the code but also the details, theory, and inner workings of a given Spark ML algorithm or API to establish a successful career in the new economy.
The book takes the cookbook format to a whole new level by blending downloadable ready-to-run Apache Spark ML code recipes with background, actionable theory, references, research, and real-life data sets to help the reader understand the what, how and the why behind the extensive facilities offered by Spark for the machine learning library. The book starts by laying the foundations needed to succeed and then rapidly evolves to cover all the meaningful ML algorithms available in Apache Spark.