Data-cleaning methods
Now that you have examined the dataset for problems, you need to address and correct the problems you’ve found. This is commonly the most time-consuming part of a project. When doing data cleaning, keep all the existing data and work in a new column when changing values (cleaning data). If dropping rows, create a new version of the data table. This allows you to go back to the original data easily if needed and gives you options when doing feature engineering, which you will learn about in Chapter 7.
Split your dataset before you clean the data to avoid inflating model accuracy
Before you do any data cleaning, split your data into training and test sets. This prevents leaking information from the training set into the testing set, which will inflate the results of accuracy tests. For example, if you calculate a mean of the dataset in full before you do the train/test split and have that data in a column in both datasets, you’ve provided the...