Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide

You're reading from   AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide The ultimate guide to passing the MLS-C01 exam on your first attempt

Arrow left icon
Product type Paperback
Published in Feb 2024
Publisher Packt
ISBN-13 9781835082201
Length 342 pages
Edition 2nd Edition
Languages
Tools
Arrow right icon
Authors (2):
Arrow left icon
Somanath Nanda Somanath Nanda
Author Profile Icon Somanath Nanda
Somanath Nanda
Weslley Moura Weslley Moura
Author Profile Icon Weslley Moura
Weslley Moura
Arrow right icon
View More author details
Toc

Table of Contents (13) Chapters Close

Preface 1. Chapter 1: Machine Learning Fundamentals FREE CHAPTER 2. Chapter 2: AWS Services for Data Storage 3. Chapter 3: AWS Services for Data Migration and Processing 4. Chapter 4: Data Preparation and Transformation 5. Chapter 5: Data Understanding and Visualization 6. Chapter 6: Applying Machine Learning Algorithms 7. Chapter 7: Evaluating and Optimizing Models 8. Chapter 8: AWS Application Services for AI/ML 9. Chapter 9: Amazon SageMaker Modeling 10. Chapter 10: Model Deployment 11. Chapter 11: Accessing the Online Practice Resources 12. Other Books You May Enjoy

Data splitting

Training and evaluating ML models are key tasks of the modeling pipeline. ML algorithms need data to find relationships among features in order to make inferences, but those inferences need to be validated before they are moved to production environments.

The dataset used to train ML models is commonly called the training set. This training data must be able to represent the real environment where the model will be used; it will be useless if that requirement is not met.

Coming back to the fraud example presented in Table 1.1, based on the training data, you found that e-commerce transactions with a value greater than $5,000 and processed at night are potentially fraudulent cases. With that in mind, after applying the model in a production environment, the model is supposed to flag similar cases, as learned during the training process.

Therefore, if those cases only exist in the training set, the model will flag false positive cases in production environments. The opposite scenario is also true: if there is a particular fraud case in production data, not reflected in the training data, the model will flag a lot of false negative cases. False positive and false negative ratios are just two of many quality metrics that you can use for model validation. These metrics will be covered in much more detail later on.

By this point, you should have a clear understanding of the importance of having a good training set. Now, supposing you do have a valid training set, how could you have some level of confidence that this model will perform well in production environments? The answer is by using testing and validation sets:

Figure 1.4 – Data splitting

Figure 1.4 – Data splitting

Figure 1.4 shows the different types of data splitting that you can have during training and inference pipelines. The training data is used to create the model; the testing data is used to extract the final model quality metrics. The testing data cannot be used during the training process for any reason other than to extract model metrics.

The reason to avoid using the testing data during training is simple: you cannot let the model learn on top of the data that will be used to validate it. This technique of holding one piece of the data for testing is often called hold-out validation.

The box on the right side of Figure 1.4 represents the production data. Production data usually comes in continuously and you have to execute the inference pipeline in order to extract model results from it. No training, nor any other type of recalculation, is performed on top of production data; you just have to pass it through the inference pipeline as it is.

From a technical perspective, most ML libraries implement training steps with the .fit method, while inference steps are implemented by the .transform or .predict method. Again, this is just a common pattern used by most ML libraries, but be aware that you might find different name conventions across ML libraries.

Still looking at Figure 1.4, there is another box, close to the training data, named Validation data. This is a subset of the training set often used to support the creation of the best model, before moving on to the testing phase. You will learn about validation sets in much more detail, but first, you should understand why you need them.

Overfitting and underfitting

ML models might suffer from two types of fitting issues: overfitting and underfitting. Overfitting means that your model performs very well on the training data but cannot be generalized to other datasets, such as testing and, even worse, production data. In other words, if you have an overfitted model, it only works on your training data.

When you are building ML models, you want to create solutions that are able to generalize what they have learned and infer decisions on other datasets that follow the same data distribution. A model that only works on the data that it was trained on is useless. Overfitting usually happens due to the large number of features or the lack of configuration of the hyperparameters of the algorithm.

On the other hand, underfitted models cannot fit the data during the training phase. As a result, they are so generic that they can’t perform well within the training, testing, or production data. Underfitting usually happens due to the lack of good features/observations or due to the lack of time to train the model (some algorithms need more iterations to properly fit the model).

Both overfitting and underfitting need to be avoided. There are many modeling techniques to work around them. For instance, you will learn about the commonly used cross-validation technique and its relationship with the validation data box shown in Figure 1.4.

Applying cross-validation and measuring overfitting

Cross-validation is a technique where you split the training set into training and validation sets. The model is then trained on the training set and tested on the validation set. The most common cross-validation strategy is known as k-fold cross-validation, where k is the number of splits of the training set.

Using k-fold cross-validation and assuming the value of k equals 10, you are splitting the training set into 10 folds. The model will be trained and tested 10 times. On each iteration, it uses 9 splits for training and leaves one split for testing. After 10 executions, the evaluation metrics extracted from each iteration are averaged and will represent the final model performance during the training phase, as shown in Figure 1.5:

Figure 1.5 – Cross-validation in action

Figure 1.5 – Cross-validation in action

Another common cross-validation technique is known as leave-one-out cross-validation (LOOCV). In this approach, the model is executed many times and, within each iteration, one observation is separated for testing and all the others are used for training.

There are many advantages of using cross-validation during training:

  • You mitigate overfitting in the training data since the model is always trained on a particular chunk of data and tested on another chunk that hasn’t been used for training.
  • You avoid overfitting in the test data since there is no need to keep using the testing data to optimize the model.
  • You expose the presence of overfitting or underfitting. If the model performance in the training/validation data is very different from the performance observed in the testing data, something is wrong.

It might be worth diving into the third item on that list since it is widely covered in the AWS Machine Learning Specialty exam. For instance, assume you are creating a binary classification model, using cross-validation during training, and using a testing set to extract final metrics (hold-out validation). If you get 80% accuracy in the cross-validation results and 50% accuracy in the testing set, it means that the model was overfitted to the training set, and so cannot be generalized to the testing set.

On the other hand, if you get 50% accuracy in the training set and 80% accuracy in the testing set, there is a systemic issue in the data. It is very likely that the training and testing sets do not follow the same distribution.

Important note

Accuracy is a model evaluation metric commonly used on classification models. It measures how often the model made a correct decision during its inference process. That metric was selected just for the sake of demonstration, but be aware that there are many other evaluation metrics applicable for each type of model (which will be covered at the appropriate time).

Bootstrapping methods

Cross-validation is a good strategy to validate ML models, and you should try it in your daily activities as a data scientist. However, you should also know about other resampling techniques available out there. Bootstrapping is one of them.

While cross-validation works with no replacement, a bootstrapping approach works with replacement. With replacement means that, while you are drawing multiple random samples from a population dataset, the same observation might be duplicated across samples.

Usually, bootstrapping is not used to validate models as you do in the traditional cross-validation approach. The reason is simple: since it works with replacement, the same observation used for training could potentially be used for testing, too. This would result in inflated model performance metrics since the estimator is likely to be correct when predicting an observation that was already seen in the training set.

Bootstrapping is often used by ML algorithms in an embedded way that requires resampling capabilities to process the data. In this context, bootstrapping is not used to validate the model but to create the model. Random forest, which will be covered in Chapter 6, Applying Machine Learning Algorithms, is one of those algorithms that use bootstrapping internally for model building.

Designing a good data splitting/sampling strategy is crucial to the success of the model or the algorithm. You should come up with different approaches to split your data, check how the model is performing on each split, and make sure those splits represent the real scenario where the model will be used.

The variance versus bias trade-off

Any ML model is supposed to contain errors. There are three types of errors that you can find in models: bias errors, variance errors, and unexplained errors. The last one, as expected, cannot be explained. It is often related to the context of the problem and the relationships between the variables (you can’t control it).

The other two types of errors can be controlled during modeling. You can say that there is a trade-off between bias and variance errors because one will influence the other. In this case, increasing bias will decrease variance and vice versa.

Bias errors relate to assumptions taken by the model to learn the target function, the one that you want to solve. Some types of algorithms, such as linear algorithms, usually carry over that type of error because they make a lot of assumptions during model training. For example, linear models assume that the relationship present in the data is linear. Linear regression and logistic regression are types of algorithms that, in general, contain high bias. Decision trees, on the other hand, are types of algorithms that make fewer assumptions about the data and contain less bias.

Variance relates to the difference in estimations that the model performs on different training data. Models with high variance usually overfit the training set. Decision trees are examples of algorithms with high variance (they usually rely a lot on specifics of the training set, failing to generalize), and linear and logistic regression are examples of algorithms with low variance. It does not mean that decision trees are bad estimators; it just means that you need to prune (optimize) them during training.

That being said, the goal of any model is to minimize both bias and variance. However, as already mentioned, each one will impact the other in the opposite direction. For the sake of demonstration, consider a decision tree to understand how this trade-off works.

Decision trees are nonlinear algorithms and often contain low bias and high variance. In order to decrease variance, you can prune the tree and set the max_depth hyperparameter (the maximum allowed depth of the tree) to 10. That will force a more generic model, reducing variance. However, that change will also force the model to make more assumptions (since it is now more generic) and increase bias.

Shuffling your training set

Now that you know what variance and data splitting are, you can go a little deeper into the training dataset requirements. You are very likely to find questions around data shuffling in the exam. This process consists of randomizing your training dataset before you start using it to fit an algorithm.

Data shuffling will help the algorithm to reduce variance by creating a more generalizable model. For example, let’s say your training represents a binary classification problem and it is sorted by the target variable (all cases belonging to class “0” appear first, then all the cases belonging to class “1”).

When you fit an algorithm on this sorted data (especially some algorithms that rely on batch processing), it will make strong assumptions about the pattern of one of the classes, since it is very likely that it won’t be able to create random batches of data with a good representation of both classes. Once the algorithm builds strong assumptions about the training data, it might be difficult for it to change them.

Important note

Some algorithms are able to execute the training process by fitting the data in chunks, also known as batches. This approach lets the model learn more frequently since it will make partial assumptions after processing each batch of data (instead of making decisions only after processing the entire dataset).

On the other hand, there is no need to shuffle the testing set, since it will be used only by the inference process to check model performance.

You have been reading a chapter from
AWS Certified Machine Learning - Specialty (MLS-C01) Certification Guide - Second Edition
Published in: Feb 2024
Publisher: Packt
ISBN-13: 9781835082201
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image