3.6 Cleaning raw data with generator functions
One of the tasks that arise in exploratory data analysis is cleaning up raw source data. This is often done as a composite operation applying several scalar functions to each piece of input data to create a usable dataset.
Let’s look at a simplified set of data. This data is commonly used to show techniques in exploratory data analysis. It’s called Anscombe’s quartet, and it comes from the article Graphs in Statistical Analysis, by F. J. Anscombe, that appeared in American Statistician in 1973. The following are the first few rows of a downloaded file with this dataset:
Anscombe’s quartet
I II III IV
x y x y x y x y
10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58
8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76
13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71
Since the data is properly tab-delimited, we can use the csv.reader()
function to iterate through the various rows. Sadly, we can’t trivially process...