Up until now, we have dealt with identifying the types of data as well as the ways data can be missing and finally, the ways we can fill in missing data. Now, let's talk about how we can manipulate our data (and our features) in order to enhance our machine pipelines further. So far, we have tried four different ways of manipulating our dataset, and the best cross-validated accuracy we have achieved with a KNN model is .745. If we look back at some of the EDA we have previously done, we will notice something about our features:
impute = Imputer(strategy='mean')
# we will want to fill in missing values to see all 9 columns
pima_imputed_mean = pd.DataFrame(impute.fit_transform(pima), columns=pima_column_names)
Now, let's use a standard histogram to see the distribution across all nine columns, as follows, specifying a figure...