Preparing data for modeling
To start modeling the data, you need to split the data into a training set and a testing set. You can easily do this with a function called train_test_split
in scikit-learn. You’ll want to split your data into training and testing sets at the start of the data preparation process. That way, you won’t accidentally leak information from the training set into the testing set, which will inflate the results of accuracy tests. What this means is that you don’t want to allow the test dataset to contain any information about the right answer. For example, if you calculate the mean of a dataset in full before you do the train/test split and have that data in a column in both datasets, you’ve provided the testing dataset with information from the training dataset. When you use your model, you want the best result, but you also don’t want to have artificially increased the model’s accuracy.
To use the train_test_split
function...