Data Preprocessing
For the computer to be able to understand the data proficiently, it is necessary to not only feed the data in a standardized way but also make sure that the data does not contain outliers or noisy data, or even missing entries. This is important because failing to do so might result in the system making assumptions that are not true to the data. This will cause the model to train at a slower pace and to be less accurate due to misleading interpretations of data.
Moreover, data preprocessing does not end there. Models do not work the same way, and each one makes different assumptions. This means that we need to preprocess in terms of the model that is going to be used. For example, some models accept only numerical data, whereas others work with nominal and numerical data.
To achieve better results during data preprocessing, a good practice is to transform (preprocess) the data in different ways, and then test the different transformations in different models. That way, you will be able to select the right transformation for the right model.
Messy Data
Data that is missing information or that contains outliers or noise is considered to be messy data. Failing to perform any preprocessing to transform the data can lead to poorly created models of the data, due to the introduction of bias and information loss. Some of the issues with data that should be avoided will be explained here.
Missing Values
Features where a few instances have values, as well as instances where there are no values for any feature, are considered missing data. As you can see from the following image, the vertical red rectangle represents a feature with only 3 values out of 10, and the horizontal rectangle represents an instance with no values at all:
Figure 1.6: An image that displays an instance with no values for any of the features, which makes it useless, and a feature with 7 missing values out of the 10 instances
Conventionally, a feature missing more than 5 to 10% of its values is considered to be missing data, and so needs to be dealt with. On the other hand, all instances that have missing values for all features should be eliminated as they do not provide any information to the model, and, on the contrary, may end up introducing bias.
When dealing with a feature with a high absence rate, it is recommended to either eliminate it or fill it with values. The most popular ways to replace the missing values are as follows:
- Mean imputation: Replacing missing values with the mean or median of the features' available values
- Regression imputation: Replacing missing values with the predicted values obtained from a regression function
While mean imputation is a simpler approach to implement, it may introduce bias as it evens out all instances in that matter. On the other hand, even though the regression approach matches the data to its predicted value, it may end up overfitting the model as all values introduced follow a function.
Lastly, when the missing values are found in a text feature such as gender, the best book of action would be to either eliminate them or replace them with a class labeled uncategorized or something similar. This is mainly because it is not possible to apply either mean or regression imputation over text.
Labeling missing values with a new category (uncategorized) is mostly done when eliminating them removes an important part of the dataset, and hence is not an appropriate book of action. In this case, even though the new label may have an effect on the model depending on the rationale used to label the missing values, leaving them empty is an even worse alternative as it causes the model to make assumptions on its own.
Note
To learn more on how to detect and handle missing values, feel free to visit the following page: https://towardsdatascience.com/how-to-handle-missing-data-8646b18db0d4.
Outliers
Outliers are values that are far from the mean. This means that if the values from an attribute follow a Gaussian distribution, the outliers are located at the tails.
Outliers can be global or local. The former group represents those values that are far from the entire set of values of a feature. For example, when analyzing data from all members of a neighborhood, a global outlier would be a person who is 180 years old (as shown in the following diagram (A)). The latter, on the other hand, represents values that are far from a subgroup of values of that feature. For the same example that we saw previously, a local outlier would be a college student who is 70 years old (B), which would normally differ from other college students in that neighborhood:
Figure 1.7: An image depicting global and local outliers in a dataset
Considering both examples that have been given, outliers do not evaluate whether the value is possible. While a person aged 180 years is not plausible, a 70-year-old college student might be a possibility, yet both are categorized as outliers as they can both affect the performance of the model.
A straightforward approach to detect outliers consists of visualizing the data to determine whether it follows a Gaussian distribution, and if it does, classifying those values that fall between three to six standard deviations away from the mean as outliers. Nevertheless, there is not an exact rule to determine an outlier, and the decision to select the number of standard deviations is subjective and will vary from problem to problem.
For example, if the dataset is reduced by 40% by setting three standard deviations as the parameter to rule out values, it would be appropriate to change the number of standard deviations to four.
On the other hand, when dealing with text features, detecting outliers becomes even trickier as there are no standard deviations to use. In this case, counting the occurrences of each class value would help to determine whether a certain class is indispensable or not. For instance, in clothing sizes, having a size XXS that represents less than 5% of the entire dataset might not be necessary.
Once the outliers are detected, there are three common ways to handle them:
- Delete the outlier: For outliers that are true values, it is best to completely delete them to avoid skewing the analysis. This may be a good idea for outliers that are mistakes, if the number of outliers is too large to perform further analysis to assign a new a value.
- Define a top: Defining a top might also be useful for true values. For instance, if you realize that all values above a certain threshold behave the same way, you can consider topping that value with the threshold.
- Assign a new value: If the outlier is clearly a mistake, you can assign a new value using one of the techniques that we discussed for missing values (mean or regression imputation).
The decision to use each of the preceding approaches depends on the outlier type and number. Most of the time, if the number of outliers represents a small proportion of the total size of the dataset, there is no point in treating the outlier in any way other than deleting it.
Note
Noisy data corresponds to values that are not correct or possible. This includes numerical (outliers that are mistakes) and nominal values (for example, a person's gender misspelled as "fimale"). Like outliers, noisy data can be treated by deleting the values completely or by assigning them a new value.
Exercise 2: Dealing with Messy Data
In this exercise, we will be using the titanic
dataset as an example to demonstrate how to deal with messy data:
- Open a Jupyter Notebook to implement this exercise.
- Load the
titanic
dataset and store it in a variable calledtitanic
. Use the following code:import seaborn as sns titanic = sns.load_dataset('titanic')
- Next, create a variable called
age
to store the values of that feature from the dataset. Print out the top 10 values of theage
variable:age = titanic['age'] age.head(10)
The output will appear as follows:
Figure 1.8: A screenshot showing the first 10 instances of the age variable
As you can see, the feature has
NaN
(Not a Number
) values, which represent missing values. - Check the shape of the
age
variable. Then, count the number ofNaN
values to determine how to handle them. Use theisnull()
function to find theNaN
values, and use thesum()
function to sum them all:age.shape (891,) age.isnull().sum() 177
- The participation of the
NaN
values in the total size of the variable is 5.03%. Although this is not high enough to consider removing the entire feature, there is a need to handle the missing values. - Let's choose the mean imputation methodology to replace the missing values. To do so, compute the mean of the available values. Use the following code:
mean = age.mean() mean = mean.round() mean
The mean comes to be 30.
Note
The value was rounded to its nearest integer since we are dealing with age.
- Replace all missing values with the mean. Use the
fillna()
function. To check that the values have been replaced, print the first ten values again:age.fillna(mean,inplace=True) age.head(10)
Note
Set
inplace
toTrue
to replace the values in the places where theNaN
values are.The printed output is shown below:
Figure 1.9: A screenshot depicting the first 10 instances of the age variable
As you can see in the preceding screenshot, the age of the instance with index 5 has changed from
NaN
to 30, which is the mean that was calculated previously. The same procedure occurs for all 177NaN
values. - Import Matplotlib and graph a histogram of the
age
variable. Use Matplotlib'shist() function
. To do so, type in the following code:import matplotlib.pyplot as plt plt.hist(age) plt.show()
The histogram should look like it does in the following diagram, and as we can see, its distribution is Gaussian-like:
Figure 1.10: A screenshot depicting the histogram of the age variable
- Discover the outliers in the data. Let's use three standard deviations as the measure to calculate the min and max values.
As discussed previously, the min value is determined by calculating the mean of all of the values and subtracting three standard deviations from it. Use the following code to set the min value and store it in a variable named
min_val
:min_val = age.mean() - (3 * age.std()) min_val
The min value comes to be around −9.248. According to the min value, there are no outliers at the left tail of the Gaussian distribution. This makes sense, given that the distribution is tilted slightly to the left.
Opposite to the min value, for the max value, the standard deviations are added to the mean to calculate the higher threshold. Calculate the max value, as shown in the following code, and store it in a variable named
max_val
:max_val = age.mean() + (3 * age.std()) max_val
The max value, which comes to around 68.766, determines that instances with ages above 68.76 years represent outliers. As you can see in the preceding diagram, this also makes sense as there are little instances over that threshold and they are in fact far away from the bell of the Gaussian distribution.
- Count the number of instances that are above the max value to decide how to handle them.
First, using indexing, call the values in
age
that are above the max value, and store them in a variable called outliers. Then, count the outliers usingcount()
:outliers = age[age > max_val] outliers.count()
The output shows us that there are seven outliers. Print out the outliers by typing in
outliers
and check that the correct values were stored:Figure 1.11: A screenshot depicting the outliers
As the number of outliers is small, and they correspond to true outliers, they can be deleted.
Note
For this exercise, we will be deleting the instances from the
age
variable to understand the complete procedure of dealing with outliers. However, later, the deletion of outliers will be handled in consideration of all features, in order to delete the entire instance, and not just the age values. - Redefine the value stored in
age
by using indexing to include only values below the max threshold. Then, print the shape ofage
:age = age[age <= max_val] age.shape (884,)
As you can see, the shape of
age
has been reduced by seven, which was the number of outliers.
Congratulations! You have successfully cleaned out a Pandas Series. This process serves as a guide for cleaning a dataset later on.
To summarize, we have discussed the importance of preprocessing data, as failing to do so may introduce bias in the model, which affects the training time of the model and its performance. Some of the main forms of messy data are missing values, outliers, and noise.
Missing values, as their name suggests, are those values that are left empty or null. When dealing with many missing values, it is important to handle them by deletion or by assigning new values. Two ways to assign new values were also discussed: mean imputation and regression imputation.
Outliers are values that fall far from the mean of all the values of a feature. One way to detect outliers is by selecting all the values that fall outside the mean minus/plus three-six standard deviations. Outliers may be mistakes (values that are not possible) or true values, and they should be handled differently. While true outliers may be deleted or topped, mistakes should be replaced with other values when possible.
Finally, noisy data corresponds to values that are, regardless of their proximity to the mean, mistakes or typos in the data. They can be of numeric, ordinal, or nominal types.
Note
Please remember that numeric data is always represented by numbers that can be measured, nominal data refers to text data that does not follow a rank, and ordinal data refers to text data that follows a rank or order.
Dealing with Categorical Features
Categorical features are those that comprise discrete values typically belonging to a finite set of categories. Categorical data can be nominal or ordinal. Nominal refers to categories that do not follow a specific order, such as music genre or city names, whereas ordinal refers to categories with a sense of order, such as clothing sizes or level of education.
Feature Engineering
Even though improvements in many machine learning algorithms have enabled the algorithms to understand categorical data types such as text, the process of transforming them into numeric values facilitates the training process of the model, which results in faster running times and better performance. This is mainly due to the elimination of semantics available in each category, as well as the fact that the conversion into numeric values allows you to scale all of the features of the dataset equally, as explained previously.
How does it work? Feature engineering generates a label encoding that assigns a numeric value to each category; this value will then replace the category in the dataset. For example, a variable called genre
with the classes pop, rock, and country can be converted as follows:
Figure 1.12: An image illustrating how feature engineering works
Exercise 3: Applying Feature Engineering over Text Data
In this exercise, we will be converting the text data within the embark_town
feature of the titanic
dataset into numerical data. Follow these steps:
- Use the same Jupyter Notebook that you created for the last exercise.
- Import scikit-learn's
LabelEncoder()
class, as well as the Pandas library. Use the following code:from sklearn.preprocessing import LabelEncoder import pandas as pd
- Create a variable called
em_town
and store the information of that feature from thetitanic
dataset that was imported in the previous exercise. Print the top 10 values from the new variable:em_town = titanic['embark_town'] em_town.head(10)
The output looks as follows:
Figure 1.13: A screenshot depicting the first 10 instances of the
em_town
variableAs you can see, the variable contains text data.
- Convert the text data into numeric values. Use the class that was imported previously (
LabelEncoder
):enc = LabelEncoder() new_label = pd.Series(enc.fit_transform(em_town.astype('str')))
First of all, initialize the class by typing in the first line of code. Second, create a new variable called
new_label
and use the built-in methodfit_transform()
from the class, which will assign a numeric value to each category and output the result. We use thepd.Series()
function to convert the output from the label encoder into a Pandas Series. Print out the top 10 values of the new variable:new_label.head(10)
Figure 1.14: A screenshot depicting the first 10 instances of the new_label
variable
As you can see, the text categories of the variable have been converted into numeric values.
Congratulations! You have successfully converted text data into numeric values.
While improvements in machine learning have made dealing with text features easier for some algorithms, it is best to convert them into numeric values. This is mainly important as it eliminates the complexity of dealing with semantics, not to mention that it gives the flexibility to change from model to model, without any limitations.
Text data conversion is done via feature engineering, where every text category is assigned a numeric value that replaces it. Furthermore, even though this can be done manually, there are powerful built-in classes and methods that facilitate this process. One example of this is the use of scikit-learn's LabelEncoder
class.
Rescaling Data
Why is it important to rescale data? Because even though the data may be fed to a model using different scales for each feature, the lack of homogeneity can cause the algorithm to lose its ability to discover patterns from the data due to the assumptions it has to make to understand it, thereby slowing down the training process and negatively affecting the model's performance.
Data rescaling helps the model run faster, without any burden or responsibility to learn from the invariance present in the dataset. Moreover, a model trained over equally scaled data assigns the same weights to all parameters, which allows the algorithm to generalize to all features and not just to those with higher values, irrespective of their meaning.
An example of a dataset with different scales is one that contains different features, one measured in kilograms, another measuring temperature, and another counting the number of children. Even though the values of each attribute are true, the scale of each one of them highly differs from that of the other. For example, while the values in kilograms can go higher than 100, the children count will typically not go further than 10.
Two of the most popular ways to rescale data are data normalization and data standardization. There is no rule on selecting the methodology to transform data to scale it, as all datasets behave differently. The best practice is to transform the data using two or three rescaling methodologies and test the algorithms in each one of them in order to choose the one that best fits the data based on the performance.
Rescaling methodologies are to be used individually. When testing different rescaling methodologies, the transformation of data should be done independently. Each transformation can be tested over a model, and the best suited one should be chosen for further steps.
Normalization: Data normalization in machine learning consists of rescaling the values of all features such that they lie in a range between 0 and 1 and have a maximum length of one. This serves the purpose of equating attributes of different scales.
The following equation allows you to normalize the values of a feature:
Figure 1.15: The normalization equation
Here, zi corresponds to the ith normalized value and x represents all values.
Standardization: This is a rescaling technique that transforms the data into a Gaussian distribution with a mean equal to 0 and a standard deviation equal to 1.
One simple way of standardizing a feature is shown in the following equation:
Figure 1.16: The standardization equation
Here, zi corresponds to the ith standardized value, and x represents all values.
Exercise 4: Normalizing and Standardizing Data
This section covers the normalization and standardization of data, using the titanic
dataset as an example. Use the same Jupyter Notebook that you created for the last exercise:
- Using the
age
variable that was created in the first exercise of this notebook, normalize the data using the preceding formula and store it in a new variable calledage_normalized
. Print out the top 10 values:age_normalized = (age - age.min())/(age.max()-age.min()) age_normalized.head(10)
Figure 1.17: A screenshot displaying the first 10 instances of the
age_normalized
variableAs you can see in the preceding screenshot, all of the values have been converted to their equivalents in a range between 0 and 1. By performing the normalization for all of the features, the model will be trained on the features of the same scale.
- Again, using the
age
variable, standardize the data using the formula for standardization, and store it in a variable calledage_standardized
. Print out the top 10 values:age_standardized = (age - age.mean())/age.std() age_standardized.head(10)
Figure 1.18: A screenshot displaying the first 10 instances of the
age_standardized
variableDifferent than normalization, in standardization, the values distribute normally around zero.
- Print out the mean and standard deviation of the
age_standardized
variable to confirm its mean of 0 and standard deviation of 1:print("Mean: " + str(age_standardized.mean())) print("Standard Deviation: " + str(age_standardized.std())) Mean: 9.645376503530772e-17 Standard Deviation: 1.0
As you can see, the mean approximates to 0, and the standard deviation is equal to 1, which means that the standardization of the data was successful.
Congratulations! You have successfully applied rescaling methods to your data.
In conclusion, we have covered the final step in data preprocessing, which consists of rescaling data. This process was done in a dataset with features of different scales, with the objective of homogenizing the way data is represented to facilitate the comprehension of the data by the model.
Failing to rescale data will cause the model to train at a slower pace and might negatively affect the performance of the model.
Two methodologies for data rescaling were explained in this topic: normalization and standardization. On one hand, normalization transforms the data to a length of one (from 0 to 1). On the other hand, standardization converts the data into a Gaussian distribution with a mean of 0 and a standard deviation of 1.
Given that there is no rule for selecting the appropriate rescaling methodology, the recommended book of action is to transform the data using two or three rescaling methodologies independently, and then train the model with each transformation to evaluate the methodology that behaves best.
Activity 2: Preprocessing an Entire Dataset
You continue to work for the safety department at a cruise company. As you did great work selecting the ideal target feature to develop the study, the department has decided to commission you into preprocessing the data set as well. For this purpose, you need to use all the techniques you have learned about previously to preprocess the dataset and get it ready for model training. The following steps serve to guide you in that direction:
- Load the dataset and create the features and target matrices by typing in the following code:
import seaborn as sns titanic = sns.load_dataset('titanic') X = titanic[['sex','age','fare','class','embark_town','alone']] Y = titanic['survived']
Note
For this activity, the features matrix has been created using only six features, as some of the other features were redundant for the study. For example, there is no need to keep both
sex
andgender
. - Check for missing values and outliers in all the features of the features matrix (
X
). Choose a methodology to handle them.Note
The following functions might come in handy:
notnull()
: To detect non-missing values. For instance,X[X["age"].notnull()]
will retrieve all the rows in X, except those that are missing values under the columnage
.value.counts()
: To count the occurrence of unique values of an array. For example,X["gender"].value_counts()
will count the number of times the classesmale
andfemale
are present. - Convert all text features into its numeric representation.
Note
Use the
LabelEncoder
class from scikit-learn. Don't forget to initialize the class before calling any of its methods. - Rescale your data, either by normalizing or standardizing.
Note
The solution for this activity can be found on page 179.
Results may vary depending on the choices you made. However, you must be left with a dataset with no missing values, outliers, or text features, and with data rescaled.