Real-life data is never clean
So far, for the examples in Chapters 2 and 4, you have used two datasets from scikit-learn
. These are both well-known and well-studied datasets. They may have been pre-cleaned. Datasets on Kaggle or in classes are often similarly clean datasets. Unlike these examples, data for real-life problems is never clean. You will find missing data, values that make no sense, inconsistencies in naming, and other problems. Typically, the majority of time spent on a data science project is spent cleaning and preparing the data for modeling. This includes time spent exploring the data, discussions with domain experts to understand the data, and time spent cleaning the data. As you learned in Chapters 2 and 4, spending time early in the process to visualize the data by building graphs will save time later on during the modeling process.