Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
The Regularization Cookbook
The Regularization Cookbook

The Regularization Cookbook: Explore practical recipes to improve the functionality of your ML models

Arrow left icon
Profile Icon Vincent Vandenbussche
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (7 Ratings)
Paperback Jul 2023 424 pages 1st Edition
eBook
₱1714.98 ₱2449.99
Paperback
₱3061.99
Subscription
Free Trial
Arrow left icon
Profile Icon Vincent Vandenbussche
Arrow right icon
Free Trial
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3 (7 Ratings)
Paperback Jul 2023 424 pages 1st Edition
eBook
₱1714.98 ₱2449.99
Paperback
₱3061.99
Subscription
Free Trial
eBook
₱1714.98 ₱2449.99
Paperback
₱3061.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

The Regularization Cookbook

Machine Learning Refresher

Machine learning (ML) is much more than just models. It is about following a certain process and best practices. This chapter will provide a refresher on these: from loading data and model evaluation to model training and optimization, the main steps and methods will be explained here.

In this chapter, we are going to cover the following main topics:

  • Loading data
  • Splitting data
  • Preparing quantitative data
  • Preparing qualitative data
  • Training a model
  • Evaluating a model
  • Performing hyperparameter optimization

Even though the recipes in this chapter are independent from a methodological standpoint, they build upon each other and are meant to be executed sequentially.

Technical requirements

In this chapter, you will need to be able to run code to load datasets, prepare data, and train, optimize, and evaluate ML models. To do so, you will need the following libraries:

  • numpy
  • pandas
  • scikit-learn

They can be installed using pip with the following command line:

pip install numpy pandas scikit-learn

Note

In this book, some best practices such as using virtual environments won’t be explicitly mentioned. However, it is highly recommended that you use a virtual environment before installing any library using pip or any other package manager.

Loading data

The primary focus of this recipe is to load data from a CSV file. However, this is not the only thing that this recipe covers. Since the data is usually the first step in any ML project, this recipe is also a good opportunity to give a quick recap of the ML workflow, as well as the different types of data.

Getting ready

Before loading the data, we should keep in mind that an ML model follows a two-step process:

  1. Train a model on a given dataset to create a new model.
  2. Reuse the previously trained model to infer predictions on new data.

These two steps are summarized in the following figure:

Figure 2.1 – A simple view of the two-step ML process

Figure 2.1 – A simple view of the two-step ML process

Of course, in most cases, this is a rather simplistic view. A more detailed view can be seen in Figure 2.2:

Figure 2.2 – A more complete view of the ML process

Figure 2.2 – A more complete view of the ML process

Let’s take a closer look at the training part of the ML process shown in Figure 2.2:

  1. First, training data is queried from a data source (this can be a database, a data lake, an open dataset, and so on).
  2. The data is preprocessed, such as via feature engineering, rescaling, and so on.
  3. A model is trained and stored (on a data lake, locally, on the edge, and so on).
  4. Optionally, the output of this model is post-processed – for example, via formatting, heuristics, business rules, and more.
  5. Optionally again, this model (with or without postprocessing) is stored in a database for later reference or evaluation if needed.

Now, let’s take a look at the inference part of the ML process:

  1. The data is queried from a data source (a database, an API query, and so on).
  2. The data goes through the same preprocessing step as the training data.
  3. The trained model is fetched if it doesn’t already exist locally.
  4. The model is used to infer output.
  5. Optionally, the output of the model is post-processed via the same post-processing step as the training data.
  6. Optionally, the output is stored in a database for monitoring and later reference.

Even in this schema, many steps were not mentioned: splitting data for training purposes, using evaluation metrics, cross-validation, hyperparameter optimization, and others. This chapter will dive into the more training-specific steps and apply them to the very common but practical Titanic dataset, a binary classification problem. But first, we need to load the data.

To do so, you must download the Titanic dataset training set locally. This can be performed with the following command line:

wget https://github.com/PacktPublishing/The-Regularization-Cookbook/blob/main/chapter_02/train.csv

How to do it…

This recipe is about loading a CSV file and displaying a few lines of code so that we can have a first glance at what it is about:

  1. The first step is to import the required libraries. Here, the only library we need is pandas:
    import pandas as pd
  2. Now, we can load the data using the read_csv function provided by pandas. The first argument is the path to the file. Assuming the file is named train.csv and located in the current folder, we only have to provide train.csv as an argument:
    # Load the data as a DataFrame
    df = pd.read_csv('train.csv')

The returned object is a dataframe object, which provides many useful methods for data processing.

  1. Now, we can display the first five lines of the loaded file using the .head() method:
    # Display the first 5 rows of the dataset
    df.head()

This code will output the following:

   PassengerId  Survived  Pclass  \
0        1            0         3
1        2            1         1
2        3            1         3
3        4            1         1
4        5            0         3
      Name                      Sex   Age     SibSp  \
0   Braund, Mr. Owen Harris     male  22.0       1
1  Cumings, Mrs. John Bradley (Florence Briggs Th...
                               female  38.0        1
2  Heikkinen, Miss. Laina  female  26.0        0
3  Futrelle, Mrs. Jacques Heath (Lily May Peel)
                            female  35.0        1
4  Allen, Mr. William Henry     male  35.0        0
 Parch      Ticket   Fare   Cabin        Embarked
0  0         A/5   21171   7.2500   NaN           S
1  0       PC 17599  71.2833   C85       C
2  0      STON/O2. 3101282   7.9250   NaN       S
3  0        113803  53.1000  C123           S
4  0        373450   8.0500   NaN    S

Here is a description of the data types in each column:

  • PassengerId (qualitative): A unique, arbitrary ID for each passenger.
  • Survived (qualitative): 1 for yes, 0 for no. This is our label, so this is a binary classification problem.
  • Pclass (quantitative, discrete): The class, which is arguably quantitative. Is class 1 better than class 2? Most likely yes.
  • Name (unstructured): The name and title of the passenger.
  • Sex (qualitative): The registered sex of the passenger, either male or female.
  • Age (quantitative, discrete): The age of the passenger.
  • SibSp (quantitative, discrete): The number of siblings and spouses on board.
  • Parch (quantitative, discrete): The number of parents and children on board.
  • Ticket (unstructured): The ticket reference.
  • Fare (quantitative, continuous): The ticket price.
  • Cabin (unstructured): The cabin number, which is arguably unstructured. It can be seen as a qualitative feature with high cardinality.
  • Embarked (qualitative): The embarked city, either Southampton (S), Cherbourg (C), or Queenstown (Q).

There’s more…

Let’s talk about the different types of data that are available. Data is a very generic word and can describe many things. We are surrounded by data all the time. One way to specify data is using opposites.

Data can be structured or unstructured:

  • Structured data comes in the form of tables, databases, Excel files, CSV files, and JSON files.
  • Unstructured data does not fit in a table: it can be text, sound, image, videos, and so on. Even if we tend to have tabular representation, this kind of data does not naturally fit in an Excel table.

Data can be quantitative or qualitative.

Quantitative data is ordered. Here are some examples:

  • €100 is greater than €10
  • 1.8 meters is taller than 1.6 meters
  • 18 years old is younger than 80 years old

Qualitative data has no intrinsic order, as shown here:

  • Blue is not intrinsically better than red
  • A dog is not intrinsically greater than a cat
  • A kitchen is not intrinsically more useful than a bathroom

These are not mutually exclusive. An object can have both quantitative and qualitative features, as can be seen in the case of the car in the following figure:

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Figure 2.3 – A single object depicted by both quantitative (left) and qualitative (right) features

Finally, data can be continuous or discrete.

Some data is continuous, as follows:

  • A weight
  • A volume
  • A price

On the other hand, some data is discrete:

  • A color
  • A football score
  • A nationality

Note

Discrete != qualitative.

For example, a football score is discrete, but there is an intrinsic order: 3 points is more than 2.

See also

The pandas read_csv function has a lot of flexibility as it can use other separators, handle headers, and much more. This is described in the official documentation: https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html.

The pandas library allows I/O operations that have different types of inputs. For more information, have a look at the official documentation: https://pandas.pydata.org/docs/reference/io.html.

Splitting data

After loading data, splitting it is a crucial step. This recipe will explain why we need to split data, as well as how to do it.

Getting ready

Why do we need to split data? An ML model is quite like a student.

You provide a student with many lectures and exercises, with or without the answers. But more often than not, students are evaluated on a completely new problem. To make sure they fully understand the concepts and methods, they not only learn the exercises and solutions – they also understand the underlying concepts.

An ML model is no different: you train the model on training data and then evaluate it on test data. This way, you make sure the model fully understands the task and generalizes well to new, unseen data.

So, the dataset is usually split into train and test sets:

  • The train set must be as large as possible to give as many samples as possible to the model
  • The test set must be large enough to be statistically significant in evaluating the model

Typical splits can be anywhere between 80% to 20% for rather small datasets (for example, hundreds of samples), and 99% to 1% for very large datasets (for example, millions of samples and more).

For this recipe and the others in this chapter, it is assumed that the code has been executed in the same notebook as the previous recipe since each recipe reuses the code from the previous ones.

How to do it…

Here are the steps to try out this recipe:

  1. You can split the data rather easily with scikit-learn and the train_test_split() function:
    # Import the train_test_split function
    from sklearn.model_selection import train_test_split
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        df.drop(columns=['Survived']), df['Survived'],
        test_size=0.2, stratify=df['Survived'],
        random_state=0)

This function uses the following parameters as input:

  • X: All columns but the 'Survived' label
  • y: The 'Survived' label column
  • test_size: This is 0.2, which means the training size will be 80%
  • stratify: This specifies the 'Survived' column to ensure the same label balance is used in both splits
  • random_state: 0 is any integer to ensure reproducibility

It returns the following outputs:

  • X_train: The train split of X
  • X_test: The test split of X
  • y_train: The training split of y, associated with X_train
  • y_test: The test split of y, associated with X_test

Note

The stratify option is not mandatory but can be critical to ensure a balanced split of any qualitative feature, not just the labels, as is the case with imbalanced data.

This split should be done as early as possible when performing data processing so that you avoid any potential data leakage. From now on, all the preprocessing will be computed on the train set, and only then applied to the test set, in agreement with Figure 2.2.

See also

See the official documentation for the train_test_split function: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html.

Preparing quantitative data

Depending on the type of data, how the features must be prepared may differ. In this recipe, we’ll cover how to prepare quantitative data, including missing data imputation and rescaling.

Getting ready

In the Titanic dataset, as well as any other dataset, there may be missing data. There are several ways to deal with missing data. For example, you can drop a column or a row, or impute a value. There are many imputation techniques, some of which are more or less sophisticated. scikit-learn supplies several implementations of imputers, such as SimpleImputer and KNNImputer.

As we will see in this recipe, using SimpleImputer, we can impute the missing quantitative data with the mean value.

Once the missing data has been handled, we can prepare the quantitative data by rescaling it so that all the data is at the same scale.

Several rescaling strategies exist, such as min-max scaling, robust scaling, standard scaling, and others.

In this recipe, we will use standard scaling. So, for each feature, we will subtract the mean value of this feature, and then divide it by the standard deviation of that feature:

Fortunately, scikit-learn provides a fully working implementation via StandardScaler.

How to do it…

We will sequentially handle missing values and rescale the data in this recipe:

  1. Import the required classes – SimpleImputer for missing data imputation and StandardScaler for rescaling:
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler
  2. Select the quantitative features we want to keep. Here, we will keep 'Pclass', 'Age', 'Fare', 'SibSp', and 'Parch' and store these features in new variables for both the train and test sets:
    quanti_columns = ['Pclass', 'Age', 'Fare', 'SibSp', 'Parch']
    # Get the quantitative columns
    X_train_quanti = X_train[quanti_columns]
    X_test_quanti = X_test[quanti_columns]
  3. Instantiate the simple imputer with a mean strategy. Here, the missing value of a feature will be replaced with the mean value of that feature:
    # Impute missing quantitative values with mean feature value
    quanti_imputer = SimpleImputer(strategy='mean')
  4. Fit the imputer on the train set and apply it to the test set so that it avoids leakage in the imputation:
    # Fit and impute the training set
    X_train_quanti = quanti_imputer.fit_transform(X_train_quanti)
    # Just impute the test set
    X_test_quanti = quanti_imputer.transform(X_test_quanti)
  5. Now that imputation has been performed, instantiate the scaler object:
    # Instantiate the standard scaler
    scaler = StandardScaler()
  6. Finally, fit and apply the standard scaler to the train set, and then apply it to the test set:
    # Fit and transform the training set
    X_train_quanti = scaler.fit_transform(X_train_quanti)
    # Just transform the test set
    X_test_quanti = scaler.transform(X_test_quanti)

We now have quantitative data with no missing values, fully rescaled, with no data leakage.

There’s more…

In this recipe, we used the simple imputer, assuming there was missing data. In practice, it is highly recommended that you look at the data first to check whether there are missing values, as well as how many. It is possible to look at the number of missing values per column with the following code snippet:

# Display the number of missing data for each column
X_train[quanti_columns].isna().sum()

This will output the following:

Pclass        0
Age         146
Fare           0
SibSp         0
Parch         0

Thanks to this, we know that the Age feature has 146 missing values, while the other features have no missing data.

See also

A few imputers are available in scikit-learn. The list is available here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute.

There are many ways to scale data, and you can find the methods that are available in scikit-learn here: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing.

You might be interested in looking at this comparison of several scalers on some given data: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py.

Preparing qualitative data

In this recipe, we will prepare qualitative data, including missing value imputation and encoding.

Getting ready

Qualitative data requires different treatment from quantitative data. Imputing missing values with the mean value of a feature would make no sense (and would not work with non-numeric data): it makes more sense, for example, to use the most frequent value or the mode of a feature. The SimpleImputer class allows us to do such things.

The same goes for rescaling: it would make no sense to rescale qualitative data. Instead, it is more common to encode it. One of the most typical techniques is called one-hot encoding.

The idea is to transform each of the categories, over a total possible N categories, in a vector holding a 1 and N-1 zeros. In our example, the Embarked feature’s one-hot encoding would be as follows:

  • ‘C’ = [1, 0, 0]
  • ‘Q’ = [0, 1, 0]
  • ‘S’ = [0, 0, 1]

Note

Having N columns for N categories is not necessarily optimal. What happens if, in the preceding example, we remove the first column? If the value is not ‘Q’ = [1, 0] nor ‘S’ = [0, 1], then it must be ‘C’ = [0, 0]. There is no need to add one more column to have all the necessary information. This can be generalized to N categories only requiring N-1 columns to have all the information, which is why one-hot encoding functions usually allow you to drop a column.

The sklearn class’ OneHotEncoder allows us to do this. It also allows us to deal with unknown categories that may appear in the test set (or the production environment) with several strategies, such as an error, ignore, or infrequent class. Finally, it allows us to drop the first column after encoding.

How to do it…

Just like in the preceding recipe, we will handle any missing data and the features will be one-hot encoded:

  1. Import the necessary classes – SimpleImputer for missing data imputation (already imported in the previous recipe) and OneHotEncoder for encoding. We also need to import numpy so that we can concatenate the qualitative and quantitative data that’s been prepared at the end of this recipe:
    import numpy as np
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import OneHotEncoder
  2. Select the qualitative features we want to keep: 'Sex' and 'Embarked'. Then, store these features in new variables for both the train and test sets:
    quali_columns = ['Sex', 'Embarked']
    # Get the quantitative columns
    X_train_quali = X_train[quali_columns]
    X_test_quali = X_test[quali_columns]
  3. Instantiate SimpleImputer with most_frequent strategy. Any missing values will be replaced by the most frequent ones:
    # Impute missing qualitative values with most frequent feature value
    quali_imputer =SimpleImputer(strategy='most_frequent')
  4. Fit and transform the imputer on the train set, and then transform the test set:
    # Fit and impute the training set
    X_train_quali = quali_imputer.fit_transform(X_train_quali)
    # Just impute the test set
    X_test_quali = quali_imputer.transform(X_test_quali)
  5. Instantiate the encoder. Here, we will specify the following parameters:
    • drop='first': This will drop the first columns of the encoding
    • handle_unknown='ignore': If a new value appears in the test set (or in production), it will be encoded as zeros:
      # Instantiate the encoder
      encoder=OneHotEncoder(drop='first', handle_unknown='ignore')
  6. Fit and transform the encoder on the training set, and then transform the test set using this encoder:
    # Fit and transform the training set
    X_train_quali = encoder.fit_transform(X_train_quali).toarray()
    # Just encode the test set
    X_test_quali = encoder.transform(X_test_quali).toarray()

Note

We need to use .toarray() out of the encoder because the array is a sparse matrix object by default and cannot be concatenated in that form with the other features.

  1. With that, all the data has been prepared – both quantitative and qualitative (considering this recipe and the previous one). It is now possible to concatenate this data before training a model:
    # Concatenate the data back together
    X_train = np.concatenate([X_train_quanti,
        X_train_quali], axis=1)
    X_test = np.concatenate([X_test_quanti, X_test_quali], axis=1)

There’s more…

It is possible to save the data as a pickle file, either to share it or save it and avoid having to prepare it again. The following code will allow us to do this:

import pickle
pickle.dump((X_train, X_test, y_train, y_test),
    open('prepared_titanic.pkl', 'wb'))

We now have fully prepared data that can be used to train ML models.

Note

Several steps have been omitted or simplified here for more clarity. Data may need more preparation, such as more thorough missing value imputation, outlier and duplicate detection (and perhaps removal), feature engineering, and so on. It is assumed that you already have some sense of those aspects and are encouraged to read other materials about this topic if required.

See also

This more general documentation about missing data imputation is worth looking at: https://scikit-learn.org/stable/modules/impute.html.

Finally, this more general documentation about data preprocessing can be very useful: https://scikit-learn.org/stable/modules/preprocessing.html.

Training a model

Once data has been fully cleaned and prepared, it is fairly easy to train a model thanks to scikit-learn. In this recipe, before training a logistic regression model on the Titanic dataset, we will quickly recap the ML paradigm and the different types of ML we can use.

Getting ready

If you were asked how to differentiate a car from a truck, you may be tempted to provide a list of rules, such as the number of wheels, size, weight, and so on. By doing so, you would be able to provide a set of explicit rules that would allow anyone to identify a car and a truck as different types of vehicles.

Traditional programming is not so different. While developing algorithms, programmers often build explicit rules, which allow them to map from data input (for example, a vehicle) to answers (for example, a car). We can summarize this paradigm as data + rules = answers.

If we were to train an ML model to discriminate cars from trucks, we would use another strategy: we would feed an ML algorithm with many pieces of data and their associated answers, expecting the model to learn to correct rules by itself. This is a different approach that can be summarized as data + answers = rules. This paradigm difference is summarized in Figure 2.4. As little as it might look to ML practitioners, it changes everything in terms of regularization:

Figure 2.4 – Comparing traditional programming with ML algorithms

Figure 2.4 – Comparing traditional programming with ML algorithms

Regularizing traditional algorithms is conceptually straightforward. For example, what if the rules for defining a truck overlap with the bus definition? If so, we can add the fact that buses have lots of windows.

Regularization in ML is intrinsically implicit. What if the model in this case does not discriminate between buses and trucks?

  • Should we add more data?
  • Is the model complex enough to capture such a difference?
  • Is it underfitting or overfitting?

This fundamental property of ML makes regularization complex.

ML can be applied to many tasks. Anyone who uses ML knows there is not just one type of ML model.

Arguably, most ML models fall into three main categories:

  • Supervised learning
  • Unsupervised learning
  • Reinforcement learning

As is usually the case for categories, the landscape is more complex, with sub-categories and methods overlapping several categories. But this is beyond the scope of this book.

This book will focus on regularization for supervised learning. In supervised learning, the problem is usually quite easy to specify: we have input features, X (for example, apartment surface), and labels, y (for example, apartment price). The goal is to train a model so that it’s robust enough to predict y, given X.

The two major types of ML are classification and regression:

  • Classification: The labels are made of qualitative data. For example, the task is predicting between two or more classes such as car, bus, and truck.
  • Regression: The labels are made of quantitative data. For example, the task is predicting an actual value, such as an apartment price.

Again, the line can be blurry; some tasks can be solved with classification while the labels are quantitative data, while others tasks can be both classification and regression ones. See Figure 2.5:

Figure 2.5 – Regularization versus classification

Figure 2.5 – Regularization versus classification

How to do it…

Assuming we want to train a logistic regression model (which will be explained properly in the next chapter), the scikit-learn library provides the LogisticRegression class, along with the fit() and predict() methods. Let’s learn how to use it:

  1. Import the LogisticRegression class:
    from sklearn.linear_model import LogisticRegression
  2. Instantiate a LogisticRegression object:
    # Instantiate the model
    lr = LogisticRegression()
  3. Fit the model on the train set:
    # Fit on the training data
    lr.fit(X_train, y_train)
  4. Optionally, compute predictions by using that model on the test set:
    # Compute and store predictions on the test data
    y_pred = lr.predict(X_test)

See also

Even though more details will be provided in the next chapter, you might be interested in looking at the documentation of the LogisticRegression class: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.

Evaluating a model

Once the model has been trained, it is important to evaluate it. In this recipe, we will provide a few insights about a few typical metrics for both classification and regression, before evaluating our model on the test set.

Getting ready

Many evaluation metrics exist. If we think about predicting a binary classification and take a step back, there are only four cases:

  • False positive (FP): Positive prediction, negative ground truth
  • True positive (TP): Positive prediction, positive ground truth
  • True negative (TN): Negative prediction, negative ground truth
  • False negative (FN): Negative prediction, positive ground truth:
Figure 2.6 – Representation of false positive, true positive, true negative, and false negative

Figure 2.6 – Representation of false positive, true positive, true negative, and false negative

Based on this, we can define a wide range of evaluation metrics.

One of the most common metrics is accuracy, which is the ratio of good predictions. The definition of accuracy is as follows:

Note

Although very common, the accuracy may be misleading, especially for imbalanced labels. For example, let’s assume an extreme case where 99% of Titanic passengers survived, and we have a model that predicts that every passenger survived. Our model would have a 99% accuracy but would be wrong for 100% of passengers who did not survive.

There are several other very common metrics, such as precision, recall, and the F1 score.

Precision is most suited when you’re trying to maximize the true positives and minimize the false positives – for example, making sure you detect only surviving passengers:

Recall is most suited when you’re trying to maximize the true positives and minimize the false negatives – for example, making sure you don’t miss any surviving passengers:

The F1 score is just a combination of the precision and recall metrics as a harmonic mean:

Another useful classification evaluation metric is the Receiver Operating Characteristic Area Under Curve (ROC AUC) score.

All these metrics behave similarly: when there are values between 0 and 1, the higher the value, the better the model. Some are also more robust to imbalanced labels, especially the F1 score and ROC AUC.

For regression tasks, the most used metrics are the mean squared error (MSE) and the R2 score.

The MSE is the averaged square difference between the predictions and the ground truth:

Here, m is the number of samples, ŷ is the predictions, and y is the ground truth:

Figure 2.7 – Visualization of the errors for a regression task

Figure 2.7 – Visualization of the errors for a regression task

In terms of the R2 score, it is a metric that can be negative and is defined as follows:

Note

While the R2 score is a typical evaluation metric (the closer to 1, the better), the MSE is more typical of a loss function (the closer to 0, the better).

How to do it…

Assuming our chosen evaluation metric here is accuracy, a very simple way to evaluate our model is to use the accuracy_score() function:

from sklearn.metrics import accuracy_score
# Compute the accuracy on test of our model
print('accuracy on test set:', accuracy_score(y_pred,
    y_test))

This outputs the following:

accuracy on test set: 0.7877094972067039

Here, the accuracy_score() function provides an accuracy of 78.77%, meaning about 79% of our model’s predictions are right.

See also

Here is a list of the available metrics in scikit-learn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics.

Performing hyperparameter optimization

In this recipe, we will explain what hyperparameter optimization is and some related concepts: the definition of a hyperparameter, cross-validation, and various hyperparameter optimization methods. We will then perform a grid search to optimize the hyperparameters of the logistic regression task on the Titanic dataset.

Getting ready

Most of the time, in ML, we do not simply train a model on the training set and evaluate it against the test set.

This is because, like most other algorithms, ML algorithms can be fine-tuned. This fine-tuning process allows us to optimize hyperparameters to achieve the best possible results. This sometimes acts as leverage so that we can regularize a model.

Note

In ML, hyperparameters can be tuned by humans, unlike parameters, which are learned through the model training process, and thus can’t be tuned.

To properly optimize hyperparameters, a third split has to be introduced: the validation set.

This means there are now three splits:

  • The training set: Where the model is trained
  • The validation set: Where the hyperparameters are optimized
  • The test set: Where the model is evaluated

You could create such a set by splitting X_train into X_train and X_valid with the train_test_split() function from scikit-learn.

But in practice, most people just use cross-validation and do not bother creating this validation set. The k-fold cross-validation method allows us to make k splits out of the training set and divide it, as presented in Figure 2.8:

Figure 2.8 – Typical split between training, validation, and test sets, without cross-validation (top) and with cross-validation (bottom)

Figure 2.8 – Typical split between training, validation, and test sets, without cross-validation (top) and with cross-validation (bottom)

In doing so, not just one model is trained, but k, for a given set of hyperparameters. The performances are averaged over those k models, based on a chosen metric (for example, accuracy, MSE, and so on).

Several sets of hyperparameters can then be tested, and the one that shows the best performance is selected. After selecting the best hyperparameter set, the model is trained one more time on the entire train set to maximize the data for training purposes.

Finally, you can implement several strategies to optimize the hyperparameters, as follows:

  • Grid search: Test all combinations of the provided values of hyperparameters
  • Random search: Randomly search combinations of hyperparameters
  • Bayesian search: Perform Bayesian optimization on the hyperparameters

How to do it…

While being rather complicated to explain conceptually, hyperparameter optimization with cross-validation is super easy to implement. In this recipe, we’ll assume that we want to optimize a logistic regression model to predict whether a passenger would have survived:

  1. First, we need to import the GridSearchCV class from sklearn.model_selection.
  2. We would like to test the following hyperparameter values for C: [0.01, 0.03, 0.1]. We must define a parameter grid with the hyperparameter as the key and the list of values to test as the value.

The C hyperparameter is the inverse of the penalization strength: the higher C is, the lower the regularization. See the next chapter for more details:

# Define the hyperparameters we want to test
param_grid = { 'C': [0.01, 0.03, 0.1] }
  1. Finally, let’s assume we want to optimize our model on accuracy, with five cross-validation folds. To do this, we will instantiate the GridSearchCV object and provide the following arguments:
    • The model to optimize, which is a LogisticRegression instance
    • The parameter grid, param_grid, which we defined previously
    • The scoring on which to optimize – that is, accuracy
    • The number of cross-validation folds, which has been set to 5 here
  2. We must also set return_train_score to True to get some useful information we can use later:
    # Instantiate the grid search object
    grid = GridSearchCV(
        LogisticRegression(),
        param_grid,
        scoring='accuracy',
        cv=5,
        return_train_score=True
    )
  3. Finally, all we have to do is train this object on the train set. This will automatically make all the computations and store the results:
    # Fit and wait
    grid.fit(X_train, y_train)
    GridSearchCV(cv=5, estimator=LogisticRegression(),
        param_grid={'C': [0.01, 0.03, 0.1]},
        return_train_score=True, scoring='accuracy')

Note

Depending on the input dataset and the number of tested hyperparameters, the fit may take some time.

Once the fit has been completed, you can get a lot of useful information, such as the following:

  • The hyperparameter set via the .best_params attribute
  • The best accuracy score via the .best_score attribute
  • The cross-validation results via the .cv_results attribute
  1. Finally, you can infer the model that was trained with optimized hyperparameters using the .predict() method:
    y_pred = grid.predict(X_test)
  2. Optionally, you can evaluate the chosen model with the accuracy score:
    print('Hyperparameter optimized accuracy:',
        accuracy_score(y_pred, y_test))

This provides the following output:

Hyperparameter optimized accuracy: 0.781229050279329

Thanks to the tools provided by scikit-learn, it is fairly easy to have a well-optimized model and evaluate it against several metrics. In the next recipe, we’ll learn how to diagnose bias and variance based on such an evaluation.

See also

The documentation for GridSearchCV can be found at https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Learn to diagnose the need for regularization in any machine learning model
  • Regularize different ML models using a variety of techniques and methods
  • Enhance the functionality of your models using state of the art computer vision and NLP techniques

Description

Regularization is an infallible way to produce accurate results with unseen data, however, applying regularization is challenging as it is available in multiple forms and applying the appropriate technique to every model is a must. The Regularization Cookbook provides you with the appropriate tools and methods to handle any case, with ready-to-use working codes as well as theoretical explanations. After an introduction to regularization and methods to diagnose when to use it, you’ll start implementing regularization techniques on linear models, such as linear and logistic regression, and tree-based models, such as random forest and gradient boosting. You’ll then be introduced to specific regularization methods based on data, high cardinality features, and imbalanced datasets. In the last five chapters, you’ll discover regularization for deep learning models. After reviewing general methods that apply to any type of neural network, you’ll dive into more NLP-specific methods for RNNs and transformers, as well as using BERT or GPT-3. By the end, you’ll explore regularization for computer vision, covering CNN specifics, along with the use of generative models such as stable diffusion and Dall-E. By the end of this book, you’ll be armed with different regularization techniques to apply to your ML and DL models.

Who is this book for?

This book is for data scientists, machine learning engineers, and machine learning enthusiasts, looking to get hands-on knowledge to improve the performances of their models. Basic knowledge of Python is a prerequisite.

What you will learn

  • Diagnose overfitting and the need for regularization
  • Regularize common linear models such as logistic regression
  • Understand regularizing tree-based models such as XGBoos
  • Uncover the secrets of structured data to regularize ML models
  • Explore general techniques to regularize deep learning models
  • Discover specific regularization techniques for NLP problems using transformers
  • Understand the regularization in computer vision models and CNN architectures
  • Apply cutting-edge computer vision regularization with generative models

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Jul 31, 2023
Length: 424 pages
Edition : 1st
Language : English
ISBN-13 : 9781837634088
Category :
Languages :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Jul 31, 2023
Length: 424 pages
Edition : 1st
Language : English
ISBN-13 : 9781837634088
Category :
Languages :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just ₱260 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 8,318.97
Synthetic Data for Machine Learning
₱2551.99
The Regularization Cookbook
₱3061.99
The Deep Learning Architect's Handbook
₱2704.99
Total 8,318.97 Stars icon
Banner background image

Table of Contents

13 Chapters
Chapter 1: An Overview of Regularization Chevron down icon Chevron up icon
Chapter 2: Machine Learning Refresher Chevron down icon Chevron up icon
Chapter 3: Regularization with Linear Models Chevron down icon Chevron up icon
Chapter 4: Regularization with Tree-Based Models Chevron down icon Chevron up icon
Chapter 5: Regularization with Data Chevron down icon Chevron up icon
Chapter 6: Deep Learning Reminders Chevron down icon Chevron up icon
Chapter 7: Deep Learning Regularization Chevron down icon Chevron up icon
Chapter 8: Regularization with Recurrent Neural Networks Chevron down icon Chevron up icon
Chapter 9: Advanced Regularization in Natural Language Processing Chevron down icon Chevron up icon
Chapter 10: Regularization in Computer Vision Chevron down icon Chevron up icon
Chapter 11: Regularization in Computer Vision – Synthetic Image Generation Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.3
(7 Ratings)
5 star 57.1%
4 star 28.6%
3 star 0%
2 star 14.3%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




S.Kundu Sep 19, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
The Regularization Cookbook will been an awesome read if you want to explore different options to improve the functionality of your ML models.I am sharing my views regarding the same.The book will explain you what Regularization concept actually is along with giving you basic refresher of Machine Learning and Deep Learning so that it become easy for you to follow rest of the book.The book will teach you the different concepts of Regularization such as:Regularization with Linear models such as Ridge Regression, Lasso Regression and Logistic Regression.Regularization with Tree based models including Regularization of Decision Tree, Random Forest and XGBoost.Regularization with L2 Regularization, network architecture and dropout.Regularization with Recurrent Neural Networks with dropout and maximum length sequence. It will also discuss about training an RNN and GRU.Advanced Regularization in Natural Language Processing using word2vec, BERT and will discuss on data augmentation using word2vec and GPT-3.Regularization in Computer Vision along with synthetic image generation. It will discuss on Regularizing a CNN with vanilla NN methods and transfer learning. It will also cover topics on Spatial and pixel level augmentation.
Amazon Verified review Amazon
Yiqiao Yin Sep 07, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
In "The Regularization Cookbook," readers are provided an in-depth exploration into the world of regularization, offering a detailed roadmap from foundational concepts to advanced methodologies. The introduction establishes a firm grounding on the very essence of regularization, setting the stage for the chapters that follow.Upon establishing the basics, the book delves into practical applications, addressing the complexities of regularizing linear models such as logistic regression. It further extends its ambit to elucidate the intricacies involved in tree-based models, particularly the increasingly popular XGBoost. Such detailed treatment is both commendable and crucial for readers at various stages of their data science journey.A standout feature of this work is its comprehensive coverage of deep learning. With the burgeoning significance of Natural Language Processing (NLP) and computer vision in contemporary machine learning, the in-depth treatment of regularization methods tailored for Recurrent Neural Networks, transformers, and seminal models like BERT and GPT-3 is of paramount importance. These chapters promise to be both enlightening and essential for professionals aiming to harness the full power of these technologies.Furthermore, the segments on computer vision offer an expansive overview. Not only do they unravel the layers of Convolutional Neural Networks, but they also venture into the compelling domain of generative models, showcasing models like Dall-E.In summation, "The Regularization Cookbook" is a masterful compilation, adeptly spanning the breadth and depth of regularization in machine learning and deep learning. Whether a novice seeking foundational insights or an expert aiming for nuanced understanding, this book is poised to be an invaluable addition to one's scholarly repertoire.
Amazon Verified review Amazon
Ratan Aug 28, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This books presents a fair and an organized way of regularization techniques for a lot of algorithms. This book is more on applied side rather than having mathematical rigorous proofs. It gives enough mathematical background to understand algorithms and how regularization can be applied on them. One good thing is - author connects how theoretical parameters are actually represented in scikit learn framework which is helpful if you aren’t a pro with Sklearn.I particularly liked the details about how regularization can be applied to language models which is interesting. If you are not looking for mathematical proofs and are on more applied ML side, this book is a good read overall. Author has presented topics in a very organized manner with good hands-on snippets.
Amazon Verified review Amazon
Om S Aug 13, 2023
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Starting with the basics, the book unveils the importance of regularization and offers expert insights into diagnosing overfitting. From there, it delves into an array of techniques applicable to various machine learning models, including linear and tree-based models. The authors showcase how to apply these techniques effectively, ensuring a solid grasp of the concepts.A standout feature of the book is its dedication to real-world scenarios. It addresses challenges associated with high cardinality features and imbalanced datasets, offering tailored regularization methods. The deep learning sections are equally compelling, guiding readers through strategies for both general neural networks and NLP-specific applications like transformers, BERT, and GPT-3.What sets "The Regularization Cookbook" apart is its accessibility. While catering to experienced practitioners, it also accommodates those new to the field. The book provides Python codes and revisits fundamental concepts, ensuring a smooth learning curve for all readers.In a landscape where model optimization is paramount, "The Regularization Cookbook" emerges as an essential tool. It empowers readers to elevate their understanding of regularization, ultimately enhancing the performance, robustness, and reliability of their machine learning and deep learning models. Whether you're an enthusiast, data scientist, or machine learning professional, this book is your guide to becoming a regularization virtuoso.
Amazon Verified review Amazon
Amazon Customer Aug 24, 2023
Full star icon Full star icon Full star icon Full star icon Empty star icon 4
"The Regularization Cookbook" is an invaluable resource for understanding regularization techniques in machine learning. It strikes a balance between theory and practical implementation, making it accessible to beginners and experienced practitioners alike. The book covers a wide range of regularization methods with clear explanations and code examples. It also presents the pros and cons of each technique, helping readers make informed decisions. While some advanced topics may not be explored in depth, the book still provides a solid foundation for further exploration. Overall, it is an indispensable resource for anyone interested in regularization techniques in machine learning.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.