Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Practical Automated Machine Learning Using H2O.ai
Practical Automated Machine Learning Using H2O.ai

Practical Automated Machine Learning Using H2O.ai: Discover the power of automated machine learning, from experimentation through to deployment to production

Arrow left icon
Profile Icon Salil Ajgaonkar
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.6 (5 Ratings)
Paperback Sep 2022 396 pages 1st Edition
eBook
$22.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m
Arrow left icon
Profile Icon Salil Ajgaonkar
Arrow right icon
$19.99 per month
Full star icon Full star icon Full star icon Full star icon Half star icon 4.6 (5 Ratings)
Paperback Sep 2022 396 pages 1st Edition
eBook
$22.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m
eBook
$22.99 $33.99
Paperback
$41.99
Subscription
Free Trial
Renews at $19.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Practical Automated Machine Learning Using H2O.ai

Understanding H2O AutoML Basics

Machine Learning (ML) is the process of building analytical or statistical models using computer systems that learn from historical data and identify patterns in them. These systems then use these patterns and try to make predictive decisions that can provide value to businesses and research alike. However, the sophisticated mathematical knowledge required to implement an ML system that can provide any concrete value has discouraged several people from experimenting with it, leaving tons of undiscovered potential that they could have benefited from.

Automated Machine Learning (AutoML) is one of the latest ML technologies that has accelerated the adoption of ML by organizations of all sizes. It is the process of automating all these complex tasks involved in the ML life cycle. AutoML hides away all these complexities and automates them behind the scenes. This allows anyone to easily implement ML without any hassle and focus more on the results.

In this chapter, we will learn about one such AutoML technology by H2O.ai (https://www.h2o.ai/), which is simply named H2O AutoML. We will provide a brief history of AutoML in general and what problems it solves, as well as a bit about H2O.ai and its H2O AutoML technology. Then, we will code a simple ML implementation using H2O’s AutoML technology and build our first ML model.

By the end of this chapter, you will understand what exactly AutoML is, the company H2O.ai, and its technology H2O AutoML. You will also understand what minimum requirements are needed to use H2O AutoML, as well as how easy it is to train your very first ML model using H2O AutoML without having to understand any complex mathematical rocket science.

In this chapter, we are going to cover the following topics:

  • Understanding AutoML and H2O AutoML
  • Minimum system requirements to use H2O AutoML
  • Installing Java
  • Basic implementation of H2O using Python
  • Basic implementation of H2O using R
  • Training your first ML model using H2O AutoML

Technical requirements

For this chapter, you will need the following:

  • A decent web browser (Chrome, Firefox, or Edge), the latest version of your preferred web browser.
  • An Integrated Development Environment (IDE) of your choice
  • Jupyter Notebook by Project Jupyter (https://jupyter.org/) (optional)

All the code examples for this chapter can be found on GitHub at https://github.com/PacktPublishing/Practical-Automated-Machine-Learning-on-H2O/tree/main/Chapter%201.

Understanding AutoML and H2O AutoML

Before we begin our journey with H2O AutoML, it is important to understand what exactly AutoML is and what part it plays in the entire ML pipeline. In this section, we will try to understand the various steps involved in the ML pipeline and where AutoML fits into it. Then, we will explore what makes H2O’s AutoML so unique among the various AutoML technologies.

Let’s start by learning a bit about AutoML in general.

AutoML

AutoML is the process of automating the various steps that are performed while developing a viable ML system for predictions. A typical ML pipeline consists of the following steps:

  1. Data Collection: This is the very first step in an ML pipeline. Data is collected from various sources. The sources can generate different types of data, such as categorical, numeric, textual, time series, or even visual and auditory data. All these types of data are aggregated together based on the requirements and are merged into a common structure. This could be a comma-separated value file, a parquet file, or even a table from a database.
  2. Data Exploration: Once data has been collected, it is explored using basic analytical techniques to identify what it contains, the completeness and correctness of the data, and if the data shows potential patterns that can build a model.
  3. Data Preparation: Missing values, duplicates, and noisy data can all affect the quality of the model as they introduce incorrect learning. Hence, the raw data that is collected and explored needs to be pre-processed to remove all anomalies using specific data processing methods.
  4. Data Transformation: A lot of ML models work with different types of data. Some can work with categorical data, while some can only work with numeric data. That is why you may need to convert certain types of data from one form into the other. This allows the dataset to be fed properly during model training.
  5. Model Selection: Once the dataset is ready, an ML model is selected to be trained. The model is chosen based on what type of data the dataset contains, what information is to be extracted from the dataset, as well as which model suits the data.
  6. Model Training: This is where the model is trained. The ML system will learn from the processed dataset and create a model. This training can be influenced by several factors, such as data attribute weighting, learning rate, and other hyperparameters.
  7. Hyperparameter Tuning: Apart from model training, another factor that needs to be considered is the model’s architecture. The model’s architecture depends on the type of algorithm used, such as the number of trees in a random forest or neurons in a neural network. We don’t immediately know which architecture is optimal for a given model, so experimentation is needed. The parameters that define the architecture of a model are called hyperparameters; finding the best combination of hyperparameter values is known as hyperparameter tuning.
  8. Prediction: The final step of the ML pipeline is prediction. Based on the patterns in the dataset that were learned by the model during training, the model can now make a generalized prediction on unseen data.

For non-experts, all these steps and their complexities can be overwhelming. Every step in the ML pipeline process has been developed over years of research and there are vast topics within themselves. AutoML is the process that automates the majority of these steps, from data exploration to hyperparameter tuning, and provides the best possible models to make predictions on. This helps companies focus on solving real-world problems with results rather than ML processes and workflows.

Now that you understand the different steps in an ML pipeline and how the steps are automated by AutoML, let’s see why H2O’s AutoML technology is one of the leading technologies in the industry.

H2O AutoML

H2O AutoML is an AutoML software technology developed by H2O.ai that simplifies how ML systems are developed by providing user-friendly interfaces that help non-experts experiment with ML. It is an in-memory, distributed, fast, and scalable ML and analytics platform that works on big data and can be used for enterprise needs.

It is written in Java and uses key-value storage to access data, models, and other ML objects that are involved. It runs on a cluster system and uses the multi-threaded MapReduce framework to parallelize data operations. It is also easy to communicate with it as it uses simple REST APIs. Finally, it has a web interface that provides a detailed graphical view of data and model details.

Not only does H2O AutoML automate the majority of the sophisticated steps involved in the ML life cycle, but it also provides a lot of flexibility for even expert data scientists to implement specialized model training processes. H2O AutoML provides a simple wrapper function that encapsulates several of the model training tasks that would otherwise be complicated to orchestrate. It also has extensive explainability functions that can describe the various details of the model training life cycle. This provides easy-to-export details of the models that users can use to explain the performance and justifications of the models that have been trained.

The best part about H2O AutoML is that it is entirely open source. You can find H2O’s source code at https://github.com/h2oai. It is actively maintained by a community of developers serving in both open as well as closed sources companies. At the time of writing, it is on its third major version, which indicates that it is quite a mature technology and is feature-intensive – that is, it supports several major companies in the world. It also supports several programming languages, including R, Scala, Python, and Java, that can run on several operating systems and provides support for a wide variety of data sources that are involved in the ML life cycle, such as Hadoop Distributed File System, Hive, Amazon S3, and even Java Database Connectivity (JDBC).

Now that you understand the basics of AutoML and how powerful H2O AutoML is, let’s see what the minimum requirements are for a system to run H2O AutoML without any performance issues.

Minimum system requirements to use H2O AutoML

H2O is very easy to install, but certain minimum standard requirements need to be met for it to run smoothly and efficiently. The following are some of the minimum requirements needed by H2O in terms of hardware capabilities, along with other software support:

  • The minimum hardware required by H2O is as follows:
    • Memory: H2O runs on an in-memory architecture, so it is limited by the physical memory of the system that uses it. Thus, to be able to process huge chunks of data, the more memory the system, has the better.
    • Central Processing Unit (CPU): By default, H2O will use the maximum available CPUs of the system. However, at a minimum, it will need 4 CPUs.
    • Graphical Processing Unit (GPU): GPU support is only available for XGBoost models in AutoML if the GPUs are NVIDIA GPUs (GPU Cloud, DGX Station, DGX-1, or DGX-2) or if it is a CUDA 8 GPU.
  • The operating systems that support H2O are as follows:
    • Ubuntu 12.04
    • OS X 10.9 or later
    • Windows 7 or later
    • CentOS 6 or later
  • The programming languages that support H2O are as follows:
    • Java: Java is mandatory for H2O. H2O requires a 64-bit JDK to build H2O and a 64-bit JRE to run its binary:
      • Java versions supported: Java SE 15, 14, 13, 12, 11, 10, 9, and 8
    • Other Languages: The following languages are only required if H2O is being run in those environments:
      • Python 2.7.x, 3.5.x, or 3.6.x
      • Scala 2.10 or later
      • R version 3 or later
  • Additional requirements: The following requirements are only needed if H2O is being run in these environments:
    • Hadoop: Cloudera CDH 5.4 or later, Hortonworks HDP 2.2 or later, MapR 4.0 or later, or IBM Open Platform 4.2
    • Conda: 2.7, 3.5, or 3.6
    • Spark: Version 2.1, 2.2, or 2.3

Once we have a system that meets the minimum requirements, we need to focus on H2O’s functional dependencies on other software. H2O has only one dependency and that is Java. Let’s see why Java is important for H2O and how we can download and install the correct supported Java version.

Installing Java

H2O’s core code is written in Java. It needs Java Runtime Environment (JRE) installed in your system to spin up an H2O server cluster. H2O also trains all the ML algorithms in a multi-threaded manner, which uses the Java Fork/Join framework on top of its MapReduce framework. Hence, having the latest Java version that is compatible with H2O to run H2O smoothly is highly recommended.

You can install the latest stable version of Java from https://www.oracle.com/java/technologies/downloads/.

When installing Java, it is important to be aware of which bit version your system runs on. If it is a 64-bit version, then make sure you are installing the 64-bit Java version for your operating system. If it is 32-bit, then go for the 32-bit version.

Now that we have installed the correct Java version, we can download and install H2O. Let’s look at a simple example of how we can do that using Python.

Basic implementation of H2O using Python

Python is one of the most popular languages in the ML field of computer programming. It is widely used in all industries and has tons of actively maintained ML libraries that provide a lot of support in creating ML pipelines.

We will start by installing the Python programming language and then installing H2O using Python.

Installing Python

Installing Python is very straightforward. It does not matter whether it is Python 2.7 or Python 3 and above as H2O works completely fine with both versions of the language. However, if you are using anything older than Python 2.7, then you will need to upgrade your version.

It is best to go with Python 3 as it is the current standard and Python 2.7 is outdated. Along with Python, you will also need pip, Python’s package manager. Now, let’s learn how to install Python on various operating systems:

  • On Linux (Ubuntu, Mint, Debian):
    • For Python 2.7, run the following command in the system Terminal:
      sudo apt-get python-pip 
    • For Python 3, run the following command in the system Terminal:
      sudo apt-get python3-pip
  • On macOS: macOS version 10.8 comes with Python 2.7 pre-installed. If you want to install Python 3, then go to https://python.org, go to the Downloads section, and download the latest version of Python 3 for macOS.
  • On Windows: Unlike macOS, Windows does not come with any pre-installed Python language support. You will need to download a Python installer for Windows from https://python.org. The installer will depend on your Windows operating system – that is, if it is 64-bit or 32-bit.

Now that you know how to install the correct version of Python, let’s download and install the H2O Python module using Python.

Installing H2O using Python

H2O has a Python module available in the Python package index. To install the h2o Python module, all you need to do is to execute the following command in your Terminal:

pip install h2o

And that’s pretty much it.

To test if it has been successfully downloaded and installed, follow these steps:

  1. Open your Python Terminal.
  2. Import the h2o module by running the following command:
    import h2o
  3. Initialize H2O to spin up a local h2o server by running the following command:
    h2o.init()

The following screenshot shows the results you should get after initializing h2o:

Figure 1.1 – H2O execution using Python

Figure 1.1 – H2O execution using Python

Let’s have a quick look at the output we got. First, it ran successfully, so mission accomplished.

After executing h2o.init() by reading the output logs, you will see that H2O checked if there is already an H2O server instance running on localhost with port 54321. In this scenario, there wasn’t any H2O server instance running previously, so H2O attempted to start a local server on the same port. If it had found an already existing local H2O instance on the port, then it would have reused the same instance for any further H2O command executions.

Then, it used Java version 16 to start the H2O instance. You may see a different Java version, depending on which version you have installed in your system.

Next, you will see the location of the h2o jar file that the server was started from, followed by the location of the Java Virtual Machine (JVM) logs.

Once the server is up and running, it shows the URL of the H2O server locally hosted on your system and the status of the H2O Python library’s connection to the server.

Lastly, you will see some basic metadata regarding the server’s configuration. This metadata may be slightly different from what you see in your execution as it depends a lot on the specifications of your system. For example, by default, H2O will use all the cores available on your system for processing. So, if you have an 8-core system, then the H2O_cluster_allowed_cores property value will be 8. Alternatively, if you decide to use only four cores, then you can execute h2o.init(nthreads=4) to use only four cores, reflecting the same in the server configuration output.

Now that you know how to implement H2O using Python, let’s learn how to do the same in the R programming language.

Basic implementation of H2O using R

The R programming language is a very popular language in the field of ML and data science because of its extensive support for statistical and data manipulation operations. It is widely used by data scientists and data miners for developing analytical software.

We will start by installing the R programming language and then installing H2O using R.

Installing R

An international team of developers maintains the R programming language. They have a dedicated web page for the R programming language called The Comprehensive R Archive Network (CRAN): https://cran.r-project.org/. There are different ways of installing R, depending on what operating system you use:

  • On Linux (Ubuntu, Mint, Debian):

Execute the following command in the system Terminal:

sudo apt-get install r-base
  • On macOS: To install R, go to https://cran.r-project.org/, go to the Download R for macOS hyperlink, and download the latest release of R for macOS.
  • On Windows: Similar to how you install R on macOS, you can download the .exe file from https://cran.r-project.org/, go to the Download R for Windows hyperlink, and download the latest release of R for Windows.

Another great way of installing R on macOS and Windows is through RStudio. RStudio simplifies the installation of R-supported software and is also a very good IDE for R programming in general. You can download R studio from https://www.rstudio.com/.

Now that you know how to install the correct version of R, let’s download and install the H2O R package using the R programming language.

Installing H2O using R

Similar to Python, H2O provide support for the R programming language as well.

To install the R packages, follow these steps:

  1. First, we need to download the H2O R package dependencies. For this, execute the following command in your R Terminal:
    install.packages(c("RCurl", "jsonlite"))
  2. Then, to install the actual h2o package, execute the following command in your R Terminal:
    install.packages("h2o")

And you are done.

  1. To test if it has been successfully downloaded and installed, open your R Terminal, import the h2o library, and execute the h2o.init() command. This will spin up a local H2O server.

The results can be seen in the following screenshot:

Figure 1.2 – H2O execution using R

Figure 1.2 – H2O execution using R

Let’s have a quick look at the output we got.

After executing h2o.init(), the H2O client will check if there is an H2O server instance already running on the system. The H2O server is usually run locally on port 54321 by default. If it had found an already existing local H2O instance on the port, then it would have reused the same instance. However, in this scenario, there wasn’t any H2O server instance running on port 54321, which is why H2O attempted to start a local server on the same port.

Next, you will see the location of the JVM logs. Once the server is up and running, the H2O client tries to connect to it and the status of the connection to the server is displayed. Lastly, you will see some basic metadata regarding the server’s configuration. This metadata may be slightly different from what you see in your execution as it depends a lot on the specifications of your system. For example, by default, H2O will use all the cores available on your system for processing. So, if you have an 8-core system, then the H2O_cluster_allowed_cores property value will be 8. Alternatively, if you decide to use only four cores, then you can execute the h2o.init(nthreads=4) command to use only four cores, thus reflecting the same in the server configuration output.

Now that you know how to implement H2O using Python and R, let’s create our very first ML model and make predictions on it using H2O AutoML.

Training your first ML model using H2O AutoML

All ML pipelines, whether they’re automated or not, eventually follow the same steps that were discussed in the Understanding AutoML and H2O AutoML section in this chapter.

For this implementation, we will be using the Iris flower dataset. This dataset can be found at https://archive.ics.uci.edu/ml/datasets/iris.

Understanding the Iris flower dataset

The Iris flower dataset, also known as Fisher’s Iris dataset, is one of the most popular multivariate datasets – that is, a dataset in which there are two or more variables that are analyzed per observation during model training. The dataset consists of 50 samples of three different varieties of the Iris flower. The features in the dataset include the length and width of the petals and sepals in centimeters. The dataset is often used for studying various classification techniques in ML because of its simplicity. The classification is performed by using the length and width of the petals and sepals as features that determine the class of the Iris flower.

The following screenshot shows a small sample of the dataset:

Figure 1.3 – Iris dataset

Figure 1.3 – Iris dataset

The columns in the dataset represent the following:

  • C1: Sepal length in cm
  • C2: Sepal width in cm
  • C3: Petal length in cm
  • C4: Petal width in cm
  • C5: Class:
    • Iris-setosa
    • Iris-versicolour
    • Iris-virginica

In this scenario, C1, C2, C3, and C4 represent the features that are used to determine C5, the class of the Iris flower.

Now that you understand the contents of the dataset that we are going to be working with, let’s implement our model training code.

Model training

Model training is the process of finding the best combination of biases and weights for a given ML algorithm so that it minimizes a loss function. A loss function is a way of measuring how far the predicted value is from the actual value. So, minimizing it indicates that the model is getting closer to making accurate predictions – in other words, it’s learning. The ML algorithm builds a mathematical representation of the relationship between the various features in the dataset and the target label. Then, we use this mathematical representation to predict the potential value of the target label for certain feature values. The accuracy of the predicted values depends a lot on the quality of the dataset, as well as the combination of weights and biases against features used during model training. However, all of this is entirely automated by AutoML and, as such, is not a concern for us.

With that in mind, let’s learn how to quickly and easily create an ML model using H2O in Python.

Model training and prediction in Python

The H2O Python module makes it easy to use H2O in a Python program. The inbuilt functions in the H2O Python module are straightforward to use and hide away a lot of the complexities of using H2O.

Follow these steps to train your very first model in Python using H2O AutoML:

  1. Import the H2O module:
    import h2o
  2. Initialize H2O to spin up a local H2O server:
    h2o.init()

The h2o.init() command starts up or reuses an H2O server instance running locally on port 54321.

  1. Now, you can import the dataset by using the h2o.import_file() command while passing the location of the dataset into your system.
  2. Next, import the dataset by passing the location where you downloaded the dataset:
    data = h2o.import_file("Dataset/iris.data")
  3. Now, you need to identify which columns of the DataFrame are the features and which are the labels. A label is something that we want to predict, while features are attributes of the label that help identify the label. We train models on these features and then predict the value of the label, given a specific set of feature values. Referring to the dataset in the Understanding the Iris flower dataset section, let’s set all the column names – C1, C2, C3, C4, and C5 – as a list of features:
    features = data.columns
  4. Based on our DataFrame, the C5 column, which denotes the class of the Iris flower, is the column that we want to eventually predict once the model has been trained. Hence, we denote C5 as the label and remove it from the remaining set of column names, which we will note as features. Set the target label and remove it from the list of features:
    label = "C5"
    features.remove(label)
  5. Split the DataFrame into training and testing DataFrames:
    train_dataframe, test_dataframe = data.split_frame([0.8])

The data.split_frame([0.8]) command splits the DataFrame into two – a training DataFrame and another for testing. The training DataFrame contains 80% of the data, while the testing DataFrame contains the remaining 20%. We will use the training DataFrame to train the model and the testing DataFrame to run predictions on the model once it has been trained to test how the model performs.

Tip

If you are curious as to how H2O splits the dataset based on ratios and how it randomizes the data between different splits, feel free to explore and experiment with the split_frame function. You can find more details at https://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/_modules/h2o/frame.html#H2OFrame.split_frame.

  1. Initialize the H2O AutoML object. Here, we have set the max_model parameter to 10 to limit the number of models that will be trained by H2O, set AutoML to 10, and set the random seed generator to 1:
    aml=h2o.automl.H2OAutoML(max_models=10, seed = 1)
  2. Now, trigger the AutoML training by passing in the feature columns – that is C1, C2, C3, and C4 – in (x), the label column C5 in (y), and the train_dataframe DataFrame using the aml.train() command. This is when H2O starts its automated model training.
  3. Train the model using the H2O AutoML object:
    aml.train(x = features, y = label, training_frame = train_dataframe)

During the training, H2O will analyze the type of the label column. For numerical labels, H2O treats the ML problem as a regression problem. If the label is categorical, then it treats the problem as a classification problem. For the Iris flower dataset, the C5 column is a categorical value containing class values. H2O will analyze this column and correctly identify that it is a classification problem and train classification models.

H2O AutoML trains several models behind the scenes using different types of ML algorithms. All the models that have been trained are evaluated on the test dataset and their performance is measured. H2O also provides detailed information about all the models, which users can use to further experiment on the data or compare different ML algorithms and understand which ones are more suitable to solve their ML problem. H2O can end up training 20-30 models, which can take a while. However, since we have passed the max_models parameter as 10, we are limiting the number of models that will be trained so that we can see the output of the training process quickly. More on ensemble learning will be discussed in Chapter 5, Understanding AutoML Algorithms.

  1. Once the training has finished, AutoML creates a Leaderboard of all the models it has trained, ranking them from the best performing to the worst. This ranking is achieved by comparing all the models’ error metrics. Error metrics are values that measure how many errors the model makes when making predictions on a sample test dataset with the actual label values. Lower error metrics indicate that the model makes fewer errors during prediction, which indicates that it is a better model compared to one with a higher error metric. Extract the AutoML Leaderboard:
    model_leaderboard = aml.leaderboard
  2. Display the AutoML Leaderboard:
    model_leaderboard.head(rows=model_leaderboard.nrows)

The Leaderboard will look as follows:

Figure 1.4 – H2O AutoML Leaderboard (Python)

Figure 1.4 – H2O AutoML Leaderboard (Python)

The Leaderboard includes the following details:

  • model_id: This represents the ID of the model.
  • mean_per_class_error: This metric is used to measure the average of the errors of each class in your multi-class dataset.
  • logloss: This metric is used to measure the negative average of the log of corrected predicted probabilities for each instance.
  • Root Mean Squared Error (RMSE): This metric is used to measure the standard deviation of prediction errors.
  • Mean Squared Error (MSE): This metric is used to measure the average of the squares of the errors.

The Leaderboard sorts the models based on certain default metrics, depending on the type of ML problem, unless specifically mentioned during AutoML training. The Leaderboard sorts the models based on the AUC metric for binary classification, mean_per_class_error for multinomial classification, and deviance for regression.

The metrics are different measures of error in the model’s performance. So, the smaller the error value, the better the model is for making accurate predictions. We will explore the different model performance metrics in Chapter 6, Understanding H2O AutoML Leaderboard and Other Performance Metrics.

In this case, GLM_1_AutoML_1_20211221_224844 is the best model according to H2O AutoML since it is a multinomial classification problem and this model has the lowest mean_per_class_error.

You may notice that despite passing the max_model value as 10, when triggering AutoML for training, we see more than 10 models in the Leaderboard. This is because only 10 models have been trained; the remaining models are Stacked Ensemble models. Stacked Ensemble models are models that are created from what other models have learned and are not technically trained in the normal sense. We will learn more about Stacked Ensemble models in Chapter 5, Understanding AutoML Algorithms, and more about the Leaderboard in Chapter 6, Understanding H2O AutoML Leaderboard and Other Performance Metrics.

Congratulations! You have officially trained your very first ML model using H2O AutoML and it is now ready to be used to make predictions.

Making predictions is very straightforward: we will use the test_dataframe DataFrame that was created from the data.split_frame([0.8]) command.

Execute the following command in Python:

aml.predict(test_dataframe)

That’s it – everything is wrapped inside the predict function of the model object.

After executing the prediction, you will see the following results:

Figure 1.5 – H2O AutoML model prediction (Python)

Figure 1.5 – H2O AutoML model prediction (Python)

The prediction result shows a table where every row is a representation of predictions for the rows present in the test DataFrame. The predict column indicates what Iris class it is for that row, while the remaining columns are the calculated probabilities of the Iris classes, as denoted in the column’s name, by the model after reading the feature values of that row. In short, the model predicts that for row 1, there is a 99.6763% chance that it is Iris-setosa.

Congratulations! You have now made an accurate prediction using your newly trained model using AutoML.

Now that we’ve seen how easy it is to use H2O AutoML in Python, let’s learn how to do the same in the R programming language.

Model training and prediction in R

Similar to Python, training and making predictions using H2O AutoML in the R programming language is also very easy. H2O has a lot of support for the R programming language and, as such, has encapsulated much of the sophistication of ML behind ready-to-use functions.

Let’s look at a model training example that uses H2O AutoML in the R programming language on the Iris flower dataset.

You will notice that training models in R is similar to how we do it in Python, with the only difference being the slight change in syntax.

Follow these steps:

  1. Import the H2O library:
    library(h2o)
  2. Initialize H2O to spin up a local H2O server:
    h2o.init()

h2o.init() will start up an H2O server instance that’s running locally on port 54321 and connect to it. If an H2O server already exists on the same port, then it will reuse it.

  1. Import the dataset using h2o.importFile(“Dataset/iris.data”) while passing the location of the dataset in your system as a parameter. Import the dataset:
    data <- h2o.importFile("Dataset/iris.data")
  2. Now, you need to set which columns of the dataframe are the features and which column is the label. Set the C5 column as the target label and the remaining column names as the list of features:
    label <- "C5"
    features <- setdiff(names(data), label)
  3. Split the DataFrame into two parts:
    parts <- h2o.splitFrame(data, 0.8)

One DataFrame will be used for training, while the other will be used for testing/validating the model being trained. parts <- h2o.splitFrame(data, 0.8) splits the DataFrame into two parts. One DataFrame contains 80% of the data, while the other contains the remaining 20%. Now, assign the DataFrame that contains 80% of the data as the training DataFrame and the other as the testing or validation DataFrame.

  1. Assign the first part as the training DataFrame:
    train_dataframe <- parts[[1]]
  2. Assign the second part as the testing DataFrame:
    test_dataframe <- parts[[2]]
  3. Now that the dataset has been imported and its features and labels have been identified, let’s pass them to H2O’s AutoML to train models. This means that you can implement the AutoML model training function in R using h2o.automl(). Train the model using H2O AutoML:
    aml <- h2o.automl(x = features, y = label, training_frame = train_dataframe, max_models=10, seed = 1)
  4. Extract the AutoML Leaderboard:
    model_leaderboard <- aml@leaderboard
  5. Display the AutoML Leaderboard:
    print(model_leaderboard, n = nrow(model_leaderboard))

Once the training has finished, AutoML will create a Leaderboard of all the models it has trained, ranking them from the best performing to the worst.

The Leaderboard will display the results as follows:

Figure 1.6 – H2O AutoML Leaderboard (R)

Figure 1.6 – H2O AutoML Leaderboard (R)

The Leaderboard includes the same details as we saw in the Leaderboard we got when training models in Python.

However, you may notice that the best model that’s suggested in this Leaderboard is different from the one we got in our previous experiment.

In this case, GBM_3_AutoML_8_20211222_02555 is the best model according to H2O AutoML, while in the previous experiment, it was GLM_1_AutoML_1_20211221_224844. This may be due to several factors, such as a different random number being generated for the seed value during model training or different data values being split across the training and testing DataFrames between the two experiments. This is what makes ML tricky – every step that you perform in a model training pipeline can greatly affect the overall performance of your trained model. At the end of the day, ML is a best-effort approach to making the most accurate predictions.

Congratulations – you have officially trained your ML model using H2O AutoML in R. Now, let’s learn how to make predictions on it. We will use the testing DataFrame we created after the split function to make predictions on the model we trained.

Execute the following command in R to make predictions:

predictions <- h2o.predict(aml, test_dataframe)

The predict function of the h2o object accepts two parameters. One is the model object, which in our case is the aml object, while the other is the DataFrame to make predictions on. By default, the aml object will use the best model in the Leaderboard to make predictions.

After executing the prediction, you will see the following results:

Figure 1.7 – H2O AutoML model prediction (R)

Figure 1.7 – H2O AutoML model prediction (R)

The results show a table with similar details that we saw in our previous experiment with Python. Every row is a representation of predictions for the rows present in the test DataFrame. The predict column indicates what Iris class it is for that row, while the remaining columns are the calculated probabilities of the Iris classes.

Congratulations – you have made an accurate prediction using your newly trained model using AutoML in R. Now, let’s summarize this chapter.

Summary

In this chapter, we understood the various steps in an ML pipeline and how AutoML plays a part in automating some of those steps. Then, we prepared our system to use H2O AutoML by installing the basic requirements. Once our system was ready, we implemented a simple application in Python and R that uses H2O AutoML to train a model on the Iris flower dataset. Finally, we understood the Leaderboard results and made successful predictions on the ML model that we just trained. All of this helped us test the waters of H2O AutoML, thus opening doors to more advanced concepts of H2O AutoML.

In the next chapter, we will explore H2O’s web User Interface (UI) so that we can understand and observe various ML details using an interactive visual interface.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn how to train the best models with a single click using H2O AutoML
  • Get a simple explanation of model performance using H2O Explainability
  • Easily deploy your trained models to production using H2O MOJO and POJO

Description

With the huge amount of data being generated over the internet and the benefits that Machine Learning (ML) predictions bring to businesses, ML implementation has become a low-hanging fruit that everyone is striving for. The complex mathematics behind it, however, can be discouraging for a lot of users. This is where H2O comes in – it automates various repetitive steps, and this encapsulation helps developers focus on results rather than handling complexities. You’ll begin by understanding how H2O’s AutoML simplifies the implementation of ML by providing a simple, easy-to-use interface to train and use ML models. Next, you’ll see how AutoML automates the entire process of training multiple models, optimizing their hyperparameters, as well as explaining their performance. As you advance, you’ll find out how to leverage a Plain Old Java Object (POJO) and Model Object, Optimized (MOJO) to deploy your models to production. Throughout this book, you’ll take a hands-on approach to implementation using H2O that’ll enable you to set up your ML systems in no time. By the end of this H2O book, you’ll be able to train and use your ML models using H2O AutoML, right from experimentation all the way to production without a single need to understand complex statistics or data science.

Who is this book for?

This book is for engineers and data scientists who want to quickly adopt machine learning into their products without worrying about the internal intricacies of training ML models. If you're someone who wants to incorporate machine learning into your software system but don’t know where to start or don’t have much expertise in the domain of ML, then you’ll find this book useful. Basic knowledge of statistics and programming is beneficial. Some understanding of ML and Python will be helpful.

What you will learn

  • Get to grips with H2O AutoML and learn how to use it
  • Explore the H2O Flow Web UI
  • Understand how H2O AutoML trains the best models and automates hyperparameter optimization
  • Find out how H2O Explainability helps understand model performance
  • Explore H2O integration with scikit-learn, the Spring Framework, and Apache Storm
  • Discover how to use H2O with Spark using H2O Sparkling Water

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 26, 2022
Length: 396 pages
Edition : 1st
Language : English
ISBN-13 : 9781801074520
Category :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Sep 26, 2022
Length: 396 pages
Edition : 1st
Language : English
ISBN-13 : 9781801074520
Category :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just $5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total $ 125.97
Data Cleaning and Exploration with Machine Learning
$41.99
Practical Automated Machine Learning Using H2O.ai
$41.99
Machine Learning at Scale with H2O
$41.99
Total $ 125.97 Stars icon
Banner background image

Table of Contents

18 Chapters
Part 1 H2O AutoML Basics Chevron down icon Chevron up icon
Chapter 1: Understanding H2O AutoML Basics Chevron down icon Chevron up icon
Chapter 2: Working with H2O Flow (H2O’s Web UI) Chevron down icon Chevron up icon
Part 2 H2O AutoML Deep Dive Chevron down icon Chevron up icon
Chapter 3: Understanding Data Processing Chevron down icon Chevron up icon
Chapter 4: Understanding H2O AutoML Architecture and Training Chevron down icon Chevron up icon
Chapter 5: Understanding AutoML Algorithms Chevron down icon Chevron up icon
Chapter 6: Understanding H2O AutoML Leaderboard and Other Performance Metrics Chevron down icon Chevron up icon
Chapter 7: Working with Model Explainability Chevron down icon Chevron up icon
Part 3 H2O AutoML Advanced Implementation and Productization Chevron down icon Chevron up icon
Chapter 8: Exploring Optional Parameters for H2O AutoML Chevron down icon Chevron up icon
Chapter 9: Exploring Miscellaneous Features in H2O AutoML Chevron down icon Chevron up icon
Chapter 10: Working with Plain Old Java Objects (POJOs) Chevron down icon Chevron up icon
Chapter 11: Working with Model Object, Optimized (MOJO) Chevron down icon Chevron up icon
Chapter 12: Working with H2O AutoML and Apache Spark Chevron down icon Chevron up icon
Chapter 13: Using H2O AutoML with Other Technologies Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.6
(5 Ratings)
5 star 80%
4 star 0%
3 star 20%
2 star 0%
1 star 0%
Ajinkya Redkar Nov 06, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good and informative product
Amazon Verified review Amazon
Daniel Armstrong May 21, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
I really enjoyed this book It covers both the R and Python implementation (I lived seeing the differences).H2O and autoML in general is something I have most stayed away from, but I don't know why since the first DS tool I learned was spss modeler. After reading this book H2O is something I will definitely try in the future. I am not sure why more people don't use it, especially when starting out in DS. I can definitely see how the user interface could help people set up there project, it has database connections(SQL, hdfs, etc), nice feature visualizations, and data manipulations tools.The author did a nice job covering the wide array of models the H2O provides, as well as a detailed explanation of different performance metrics, and model explainability and feature importance, which includes some really nice visualizations.I am very intrigued by what I read in the productionization chapter. I think that productionization is far too often overlooked, and it is one of the best reasons to use H2O over other tools, especially if you use spark at your organization.I think beginners will get the most out of this book, but I think it can help a wider audience. I am not going to trade my pandas dfs for H2O dataframes but I think there is room for them to work together. I hope the author add a chapter or two to the work that H2O is doing in the NLP with LLMsOverall it was a great book and I am happy to have it on my shelfBest,Daniel Armstrong
Amazon Verified review Amazon
Alexander Souza Oct 27, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
An incredible book for anyone starting in the AI world. The content is very clear and with several examples that make it much easier to understand.100% recommend
Amazon Verified review Amazon
Gemmy Dec 25, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Good content. Well thought-out examples. Provides a great way to understand and deploy your AI model without the mathematical complexities that goes along with it.
Amazon Verified review Amazon
Dietmar Baur Nov 07, 2022
Full star icon Full star icon Full star icon Empty star icon Empty star icon 3
I already had access to the electronic version via Safari’s onlinebookshelf, but liked the content so much that I wanted to have a printed copy, for increased readability. The book finally arrived (a couple of days late) and I was disappointed by the tiny font size used. I‘m 48, wearing glasses - it’s really hard to read. If the reading experience of printed books is so bad, no wonder eBooks will replace them. Make reading tech books a real experience again, with bigger fonts and an overall better feel, please!!
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.