Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Practical Automated Machine Learning Using H2O.ai

You're reading from   Practical Automated Machine Learning Using H2O.ai Discover the power of automated machine learning, from experimentation through to deployment to production

Arrow left icon
Product type Paperback
Published in Sep 2022
Publisher Packt
ISBN-13 9781801074520
Length 396 pages
Edition 1st Edition
Tools
Arrow right icon
Author (1):
Arrow left icon
Salil Ajgaonkar Salil Ajgaonkar
Author Profile Icon Salil Ajgaonkar
Salil Ajgaonkar
Arrow right icon
View More author details
Toc

Table of Contents (19) Chapters Close

Preface 1. Part 1 H2O AutoML Basics
2. Chapter 1: Understanding H2O AutoML Basics FREE CHAPTER 3. Chapter 2: Working with H2O Flow (H2O’s Web UI) 4. Part 2 H2O AutoML Deep Dive
5. Chapter 3: Understanding Data Processing 6. Chapter 4: Understanding H2O AutoML Architecture and Training 7. Chapter 5: Understanding AutoML Algorithms 8. Chapter 6: Understanding H2O AutoML Leaderboard and Other Performance Metrics 9. Chapter 7: Working with Model Explainability 10. Part 3 H2O AutoML Advanced Implementation and Productization
11. Chapter 8: Exploring Optional Parameters for H2O AutoML 12. Chapter 9: Exploring Miscellaneous Features in H2O AutoML 13. Chapter 10: Working with Plain Old Java Objects (POJOs) 14. Chapter 11: Working with Model Object, Optimized (MOJO) 15. Chapter 12: Working with H2O AutoML and Apache Spark 16. Chapter 13: Using H2O AutoML with Other Technologies 17. Index 18. Other Books You May Enjoy

Understanding AutoML and H2O AutoML

Before we begin our journey with H2O AutoML, it is important to understand what exactly AutoML is and what part it plays in the entire ML pipeline. In this section, we will try to understand the various steps involved in the ML pipeline and where AutoML fits into it. Then, we will explore what makes H2O’s AutoML so unique among the various AutoML technologies.

Let’s start by learning a bit about AutoML in general.

AutoML

AutoML is the process of automating the various steps that are performed while developing a viable ML system for predictions. A typical ML pipeline consists of the following steps:

  1. Data Collection: This is the very first step in an ML pipeline. Data is collected from various sources. The sources can generate different types of data, such as categorical, numeric, textual, time series, or even visual and auditory data. All these types of data are aggregated together based on the requirements and are merged into a common structure. This could be a comma-separated value file, a parquet file, or even a table from a database.
  2. Data Exploration: Once data has been collected, it is explored using basic analytical techniques to identify what it contains, the completeness and correctness of the data, and if the data shows potential patterns that can build a model.
  3. Data Preparation: Missing values, duplicates, and noisy data can all affect the quality of the model as they introduce incorrect learning. Hence, the raw data that is collected and explored needs to be pre-processed to remove all anomalies using specific data processing methods.
  4. Data Transformation: A lot of ML models work with different types of data. Some can work with categorical data, while some can only work with numeric data. That is why you may need to convert certain types of data from one form into the other. This allows the dataset to be fed properly during model training.
  5. Model Selection: Once the dataset is ready, an ML model is selected to be trained. The model is chosen based on what type of data the dataset contains, what information is to be extracted from the dataset, as well as which model suits the data.
  6. Model Training: This is where the model is trained. The ML system will learn from the processed dataset and create a model. This training can be influenced by several factors, such as data attribute weighting, learning rate, and other hyperparameters.
  7. Hyperparameter Tuning: Apart from model training, another factor that needs to be considered is the model’s architecture. The model’s architecture depends on the type of algorithm used, such as the number of trees in a random forest or neurons in a neural network. We don’t immediately know which architecture is optimal for a given model, so experimentation is needed. The parameters that define the architecture of a model are called hyperparameters; finding the best combination of hyperparameter values is known as hyperparameter tuning.
  8. Prediction: The final step of the ML pipeline is prediction. Based on the patterns in the dataset that were learned by the model during training, the model can now make a generalized prediction on unseen data.

For non-experts, all these steps and their complexities can be overwhelming. Every step in the ML pipeline process has been developed over years of research and there are vast topics within themselves. AutoML is the process that automates the majority of these steps, from data exploration to hyperparameter tuning, and provides the best possible models to make predictions on. This helps companies focus on solving real-world problems with results rather than ML processes and workflows.

Now that you understand the different steps in an ML pipeline and how the steps are automated by AutoML, let’s see why H2O’s AutoML technology is one of the leading technologies in the industry.

H2O AutoML

H2O AutoML is an AutoML software technology developed by H2O.ai that simplifies how ML systems are developed by providing user-friendly interfaces that help non-experts experiment with ML. It is an in-memory, distributed, fast, and scalable ML and analytics platform that works on big data and can be used for enterprise needs.

It is written in Java and uses key-value storage to access data, models, and other ML objects that are involved. It runs on a cluster system and uses the multi-threaded MapReduce framework to parallelize data operations. It is also easy to communicate with it as it uses simple REST APIs. Finally, it has a web interface that provides a detailed graphical view of data and model details.

Not only does H2O AutoML automate the majority of the sophisticated steps involved in the ML life cycle, but it also provides a lot of flexibility for even expert data scientists to implement specialized model training processes. H2O AutoML provides a simple wrapper function that encapsulates several of the model training tasks that would otherwise be complicated to orchestrate. It also has extensive explainability functions that can describe the various details of the model training life cycle. This provides easy-to-export details of the models that users can use to explain the performance and justifications of the models that have been trained.

The best part about H2O AutoML is that it is entirely open source. You can find H2O’s source code at https://github.com/h2oai. It is actively maintained by a community of developers serving in both open as well as closed sources companies. At the time of writing, it is on its third major version, which indicates that it is quite a mature technology and is feature-intensive – that is, it supports several major companies in the world. It also supports several programming languages, including R, Scala, Python, and Java, that can run on several operating systems and provides support for a wide variety of data sources that are involved in the ML life cycle, such as Hadoop Distributed File System, Hive, Amazon S3, and even Java Database Connectivity (JDBC).

Now that you understand the basics of AutoML and how powerful H2O AutoML is, let’s see what the minimum requirements are for a system to run H2O AutoML without any performance issues.

You have been reading a chapter from
Practical Automated Machine Learning Using H2O.ai
Published in: Sep 2022
Publisher: Packt
ISBN-13: 9781801074520
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image