What to look for when exploring data
As mentioned in Chapters 2 and 4, we like to make graphs to explore data and understand what is contained in a dataset and what potential problems might exist. The tools and methods used to graph the data depend on the size of the dataset and what format the data is in. For example, if a smaller dataset is in Excel format or CSV format, it may be easier and faster to use Excel to explore the data. Larger datasets are more easily explored in Python with pandas
, especially if they exceed the size limits for Excel to load them fully. Generally, you are checking for problems in the data such as missing values, values that don’t make sense, misspellings in text data, and so on. You also want to look for relationships between the input parameters and between the input parameters and the parameter you want the model to predict. When you are comfortable with how the data looks and you have addressed problems with the data by cleaning the data, you...