Cleaning missing values and invalid data
By default, the pandas read_csv()
function will read a variable as if it’s non-numeric (string) if it contains at least one string (text). So, what’s the difference between nan instances in the Petal_width column and the Sepal_width column? Python will convert empty cells into nan values but will keep the numeric nature of the variable, as is the case for the Petal_length variable.
In biostatistics, experimenters might use different words to mark a missing value, such as Nan
or NA
(short for not applicable), or even whole words such as missing
or not applicable
. Remember that Nan
and NA
are still strings, so if there’s an empty cell, Python will read it as a string and coerce the whole variable into a string variable. This wasn’t the case for Petal_width since Python read empty cells as nan and didn’t coerce the variable into a string, instead keeping it numeric. In this case, Python read nan as the valid...