Data imputation
Missing data is ubiquitous and data imputation techniques will help us to alleviate its influence.
In this section, we are going to use the heart disease data to examine the pros and cons of basic data imputation. I recommend you read the dataset description beforehand to understand the meaning of each column.
Preparing the dataset for imputation
The heart disease dataset is the same one we used earlier in the Collecting data from various data sources section. It should give you a real red flag that you shouldn't take data integrity for granted. The following screenshot shows missing data denoted by question marks:
First, let's do an info()
call that lists column data type information:
df.info()
Note
df.info()
is a very helpful function that provides you with pointers for your next move. It should be the first function call when given an unknown dataset.
The following screenshot shows the output obtained from the preceding function:
If pandas
can't infer the data type of a column, it will interpret it as objects. For example, the chol (cholesterol) column contains missing data. The missing data is a question mark treated as a string, but the remainder of the data is of the float type. The records are collectively called objects.
Python's type tolerance
As Python is pretty error-tolerant, it is a good practice to introduce a necessary type check. For example, if a column mixes the numerical values, instead of using numerical values to check truth, explicitly check its type and write two branches. Also, it is advised to avoid type conversion on columns with data type objects. Remember to make your code completely deterministic and future-proof.
Now, let's replace the question mark with the NaN
values. The following code snippet declares a function that can handle three different cases and treat them appropriately. The three cases are listed here:
- The record value is
"?"
. - The record value is of the
integer
type. This is treated independently because columns such asnum
should be binary. Floating numbers will lose the essence of using 0-1 encoding. - The rest includes valid strings that can be converted to float numbers and original float numbers.
The code snippet will be as follows:
import numpy as np def replace_question_mark(val): if val == "?": return np.NaN elif type(val)==int: return val else: return float(val) df2 = df.copy() for (columnName, _) in df2.iteritems(): df2[columnName] = df2[columnName].apply(replace-question_mark)
Now we call the info()
function and the head()
function, as shown here:
df2.info()
You should expect that all fields are now either floats or integers, as shown in the following output:
Now you can check the number of non-null entries for each column, and different columns have different levels of completeness. age
and sex
don't contain missing values, but ca
contains almost no valid data. This should guide you on your choices of data imputation. For example, strictly dropping all the missing values, which is also considered a way of data imputation, will almost remove the complete dataset. Let's check the shape of the DataFrame after the default missing value drops. You see that there is only one row left. We don't want it:
df2.dropna().shape
A screenshot of the output is as follows:
Before moving on to other more mainstream imputation methods, we would love to perform a quick review of our processed DataFrame.
Check the head of the new DataFrame. You should see that all question marks are replaced by NaN values. NaN values are treated as legitimate numerical values, so native NumPy
functions can be used on them:
df2.head()
The output should look as follows:
Now, let's call the describe()
function, which generates a table of statistics. It is a very helpful and handy function for a quick peak at common statistics in our dataset:
df2.describe()
Here is a screenshot of the output:
Understanding the describe() limitation
Note that the describe()
function only considers valid values. In this sample, the average age
value is more trustworthy than the average thal
value. Do also pay attention to the metadata. A numerical value doesn't necessarily have a numerical meaning. For example, a thal
value is encoded to integers with given meanings.
Now, let's examine the two most common ways of imputation.
Imputation with mean or median values
Imputation with mean or median values only works on numerical datasets. Categorical variables don't contain structures, such as one label being larger than another. Therefore, the concepts of mean and median won't apply.
There are several advantages associated with mean/median imputation:
- It is easy to implement.
- Mean/median imputation doesn't introduce extreme values.
- It does not have any time limit.
However, there are some statistical consequences of mean/median imputation. The statistics of the dataset will change. For example, the histogram for cholesterol prior to imputation is provided here:
The following code snippet does the imputation with the mean. Following imputation the with mean, the histogram shifts to the right a little bit:
chol = df2["chol"] plt.hist(chol.apply(lambda x: np.mean(chol) if np.isnan(x) else x), bins=range(0,630,30)) plt.xlabel("cholesterol imputation") plt.ylabel("count")
Imputation with the median will shift the peak to the left because the median is smaller than the mean. However, it won't be obvious if you enlarge the bin size. Median and mean values will likely fall into the same bin in this eventuality:
The good news is that the shape of the distribution looks rather similar. The bad news is that we probably increased the level of concentration a little bit. We will cover such statistics in Chapter 3, Visualization with Statistical Graphs.
Note
In other cases where the distribution is not centered or contains a substantial ratio of missing data, such imputation can be disastrous. For example, if the waiting time in a restaurant follows an exponential distribution, imputation with mean values will probably break the characteristics of the distribution.
Imputation with the mode/most frequent value
The advantage of using the most frequent value is that it works well with categorical features and, without a doubt, it will introduce bias as well. The slope field is categorical in nature, although it looks numerical. It represents three statuses of a slope value as positive, flat, or negative.
The following code snippet will reveal our observation:
plt.hist(df2["slope"],bins = 5) plt.xlabel("slope") plt.ylabel("count");
Here is the output:
Without a doubt, the mode is 2
. Following imputation with the mode, we obtain the following new distribution:
plt.hist(df2["slope"].apply(lambda x: 2 if np.isnan(x) else x),bins=5) plt.xlabel("slope mode imputation") plt.ylabel("count");
In the following graph, pay attention to the scale on y:
Replacing missing values with the mode in this case is disastrous. If positive and negative values of slope have medical consequences, performing prediction tasks on the preprocessed dataset will depress their weights and significance.
Different imputation methods have their own pros and cons. The prerequisite is to fully understand your business goals and downstream tasks. If key statistics are important, you should try to avoid distorting them. Also, do remember that collecting more data is always an option.