You're reading from Supervised Machine Learning with Python Develop rich Python coding practices while exploring supervised machine learning

Product type Paperback

Published in May 2019

Publisher Packt

ISBN-13 9781838825669

Length 162 pages

Edition 1st Edition

Languages

Python

Tools

Scikit-learn

Concepts

Machine Learning

Author (1):

Taylor Smith

View More author details

Model evaluation and data splitting

In this chapter, we will define what it means to evaluate a model, best practices for gauging the advocacy of a model, how to split your data, and several considerations that you'll have to make when preparing your split.

It is important to understand some core best practices of machine learning. One of our primary tasks as ML practitioners is to create a model that is effective for making predictions on new data. But how do we know that a model is good? If you recall from the previous section, we defined supervised learning as simply a task that learns a function from labelled data such that we can approximate the target of the new data. Therefore, we can test our model's effectiveness. We can determine how it performs on data that is never seen—just like it's taking a test.

Out-of-sample versus in-sample evaluation

Let's say we are training a small machine which is a simple classification task. Here's some nomenclature you'll need: the in-sample data is the data the model learns from and the out-of-sample data is the data the model has never seen before. One of the pitfalls many new data scientists make is that they measure their model's effectiveness on the same data that the model learned from. What this ends up doing is rewarding the model's ability to memorize, rather than its ability to generalize, which is a huge difference.

If you take a look at the two examples here, the first presents a sample that the model learned from, and we can be reasonably confident that it's going to predict one, which would be correct. The second example presents a new sample, which appears to resemble more of the zero class. Of course, the model doesn't know that. But a good model should be able to recognize and generalize this pattern, shown as follows:

So, now the question is how we can ensure both in-sample and out-of-sample data for the model to prove its worth. Even more precisely, our out-of-sample data needs to be labeled. New or unlabeled data won't suffice because we have to know the actual answer in order to determine how correct the model is. So, one of the ways we can handle this in machine learning is to split our data into two parts: a training set and a testing set. The training set is what our model will learn on; the testing set is what our model will be evaluated on. How much data you have matters a lot. In fact, in the next sections, when we discuss the bias-variance trade-off, you'll see how some models require much more data to learn than others do.

Another thing to keep in mind is that if some of the distributions of your variables are highly skewed, or you have rare categorical levels embedded throughout, or even class imbalance in your y vector, you may end up getting a bad split. As an example, let's say you have a binary feature in your X matrix that indicates the presence of a very rare sensor for some event that occurs every 10,000 occurrences. If you randomly split your data and all of the positive sensor events are in your test set, then your model will learn from the training data that the sensor is never tripped and may deem that as an unimportant variable when, in reality, it could be hugely important, and hugely predictive. So, you can control these types of issues with stratification.

Splitting made easy

Here, we have a simple snippet that demonstrates how we can use the scikit-learn library to split our data into training and test sets. We're loading the data in from the datasets module and passing both X and y into the split function. We should be familiar with loading the data up. We have the train_test_split function from the model_selection submodule in sklearn. This is going to take any number of arrays. So, 20% is going to be test_size, and the remaining 80% of that data will be training. We define random_state, so that our split can be reproducible if we ever have to prove exactly how we got this split. There's also the stratify keyword, which we're not using here, which can be used to stratify a split for rare features or an imbalanced y vector:

from sklearn.datasets import load_boston

from sklearn.model_selection import train_test_split



boston_housing = load_boston() # load data

X, y = boston_housing.data, boston_housing.target # get X, y

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,

                                                                                                              random_state=42)

# show num samples (there are no duplicates in either set!)
print("Num train samples: %i" % X_train.shape[0])

print("Num test samples: %i" % X_test.shape[0])

The output of the preceding code is as follows:

Num train samples: 404
Num test samples: 102

You're reading from Supervised Machine Learning with Python Develop rich Python coding practices while exploring supervised machine learning

Table of Contents (6) Chapters

Model evaluation and data splitting

Out-of-sample versus in-sample evaluation

Splitting made easy

Authors (1)

Other recommended products

Personalised recommendations for you