Data Transformation
Previously, we saw how we can combine data from different sources into a unified dataframe. Now, we have a lot of columns that have different types of data. Our goal is to transform the data into a machine-learning-digestible format. All machine learning algorithms are based on mathematics. So, we need to convert all the columns into numerical format. Before that, let's see all the different types of data we have.
Taking a broader perspective, data is classified into numerical and categorical data:
- Numerical: As the name suggests, this is numeric data that is quantifiable.
- Categorical: The data is a string or non-numeric data that is qualitative in nature.
Numerical data is further divided into the following:
- Discrete: To explain in simple terms, any numerical data that is countable is called discrete, for example, the number of people in a family or the number of students in a class. Discrete data can only take certain values (such as 1, 2, 3, 4, etc).
- Continuous: Any numerical data that is measurable is called continuous, for example, the height of a person or the time taken to reach a destination. Continuous data can take virtually any value (for example, 1.25, 3.8888, and 77.1276).
Categorical data is further divided into the following:
- Ordered: Any categorical data that has some order associated with it is called ordered categorical data, for example, movie ratings (excellent, good, bad, worst) and feedback (happy, not bad, bad). You can think of ordered data as being something you could mark on a scale.
- Nominal: Any categorical data that has no order is called nominal categorical data. Examples include gender and country.
From these different types of data, we will focus on categorical data. In the next section, we'll discuss how to handle categorical data.
Handling Categorical Data
There are some algorithms that can work well with categorical data, such as decision trees. But most machine learning algorithms cannot operate directly with categorical data. These algorithms require the input and output both to be in numerical form. If the output to be predicted is categorical, then after prediction we convert them back to categorical data from numerical data. Let's discuss some key challenges that we face while dealing with categorical data:
- High cardinality: Cardinality means uniqueness in data. The data column, in this case, will have a lot of different values. A good example is User ID – in a table of 500 different users, the User ID column would have 500 unique values.
- Rare occurrences: These data columns might have variables that occur very rarely and therefore would not be significant enough to have an impact on the model.
- Frequent occurrences: There might be a category in the data columns that occurs many times with very low variance, which would fail to make an impact on the model.
- Won't fit: This categorical data, left unprocessed, won't fit our model.
Encoding
To address the problems associated with categorical data, we can use encoding. This is the process by which we convert a categorical variable into a numerical form. Here, we will look at three simple methods of encoding categorical data.
Replacing
This is a technique in which we replace the categorical data with a number. This is a simple replacement and does not involve much logical processing. Let's look at an exercise to get a better idea of this.
Exercise 6: Simple Replacement of Categorical Data with a Number
In this exercise, we will use the student dataset that we saw earlier. We will load the data into a pandas dataframe and simply replace all the categorical data with numbers. Follow these steps to complete this exercise:
Note
The student dataset can be found at this location: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/student.csv.
- Open a Jupyter notebook and add a new cell. Write the following code to import pandas and then load the dataset into the pandas dataframe:
import pandas as pd
import numpy as np
dataset = "https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/student.csv"
df = pd.read_csv(dataset, header = 0)
- Find the categorical column and separate it out with a different dataframe. To do so, use the select_dtypes() function from pandas:
df_categorical = df.select_dtypes(exclude=[np.number])
df_categorical
The preceding code generates the following output:
Figure 1.28: Categorical columns of the dataframe
- Find the distinct unique values in the Grade column. To do so, use the unique() function from pandas with the column name:
df_categorical['Grade'].unique()
The preceding code generates the following output:
Figure 1.29: Unique values in the Grade column
- Find the frequency distribution of each categorical column. To do so, use the value_counts() function on each column. This function returns the counts of unique values in an object:
df_categorical.Grade.value_counts()
The output of this step is as follows:
Figure 1.30: Total count of each unique value in the Grade column
- For the Gender column, write the following code:
df_categorical.Gender.value_counts()
The output of this code is as follows:
Figure 1.31: Total count of each unique value in the Gender column
- Similarly, for the Employed column, write the following code:
df_categorical.Employed.value_counts()
The output of this code is as follows:
Figure 1.32: Total count of each unique value in the Employed column
- Replace the entries in the Grade column. Replace 1st class with 1, 2nd class with 2, and 3rd class with 3. To do so, use the replace() function:
df_categorical.Grade.replace({"1st Class":1, "2nd Class":2, "3rd Class":3}, inplace= True)
- Replace the entries in the Gender column. Replace Male with 0 and Female with 1. To do so, use the replace() function:
df_categorical.Gender.replace({"Male":0,"Female":1}, inplace= True)
- Replace the entries in the Employed column. Replace no with 0 and yes with 1. To do so, use the replace() function:
df_categorical.Employed.replace({"yes":1,"no":0}, inplace = True)
- Once all the replacements for three columns are done, we need to print the dataframe. Add the following code:
df_categorical.head()
Figure 1.33: Numerical data after replacement
You have successfully converted the categorical data to numerical data using a simple manual replacement method. We will now move on to look at another method of encoding categorical data.
Label Encoding
This is a technique in which we replace each value in a categorical column with numbers from 0 to N-1. For example, say we've got a list of employee names in a column. After performing label encoding, each employee name will be assigned a numeric label. But this might not be suitable for all cases because the model might consider numeric values to be weights assigned to the data. Label encoding is the best method to use for ordinal data. The scikit-learn library provides LabelEncoder(), which helps with label encoding. Let's look at an exercise in the next section.
Exercise 7: Converting Categorical Data to Numerical Data Using Label Encoding
In this exercise, we will load the Banking_Marketing.csv dataset into a pandas dataframe and convert categorical data to numeric data using label encoding. Follow these steps to complete this exercise:
Note
The Banking_Marketing.csv dataset can be found here: https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv.
- Open a Jupyter notebook and add a new cell. Write the code to import pandas and load the dataset into the pandas dataframe:
import pandas as pd
import numpy as np
dataset = 'https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv'
df = pd.read_csv(dataset, header=0)
- Before doing the encoding, remove all the missing data. To do so, use the dropna() function:
df = df.dropna()
- Select all the columns that are not numeric using the following code:
data_column_category = df.select_dtypes(exclude=[np.number]).columns
data_column_category
To understand how the selection looks, refer to the following screenshot:
Figure 1.34: Non-numeric columns of the dataframe
- Print the first five rows of the new dataframe. Add the following code to do this:
df[data_column_category].head()
The preceding code generates the following output:
Figure 1.35: Non-numeric values for the columns
- Iterate through this category column and convert it to numeric data using LabelEncoder(). To do so, import the sklearn.preprocessing package and use the LabelEncoder() class to transform the data:
#import the LabelEncoder class
from sklearn.preprocessing import LabelEncoder
#Creating the object instance
label_encoder = LabelEncoder()
for i in data_column_category:
df[i] = label_encoder.fit_transform(df[i])
print("Label Encoded Data: ")
df.head()
The preceding code generates the following output:
Figure 1.36: Values of non-numeric columns converted into numeric form
In the preceding screenshot, we can see that all the values have been converted from categorical to numerical. Here, the original values have been transformed and replaced with the newly encoded data.
You have successfully converted categorical data to numerical data using the LabelEncoder method. In the next section, we'll explore another type of encoding: one-hot encoding.
One-Hot Encoding
In label encoding, categorical data is converted to numerical data, and the values are assigned labels (such as 1, 2, and 3). Predictive models that use this numerical data for analysis might sometimes mistake these labels for some kind of order (for example, a model might think that a label of 3 is "better" than a label of 1, which is incorrect). In order to avoid this confusion, we can use one-hot encoding. Here, the label-encoded data is further divided into n number of columns. Here, n denotes the total number of unique labels generated while performing label encoding. For example, say that three new labels are generated through label encoding. Then, while performing one-hot encoding, the columns will be divided into three parts. So, the value of n is 3. Let's look at an exercise to get further clarification.
Exercise 8: Converting Categorical Data to Numerical Data Using One-Hot Encoding
In this exercise, we will load the Banking_Marketing.csv dataset into a pandas dataframe and convert the categorical data into numeric data using one-hot encoding. Follow these steps to complete this exercise:
Note
The Banking_Marketing dataset can be found here: https://github.com/TrainingByPackt/Data-Science-with-Python/blob/master/Chapter01/Data/Banking_Marketing.csv.
- Open a Jupyter notebook and add a new cell. Write the code to import pandas and load the dataset into a pandas dataframe:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
dataset = 'https://github.com/TrainingByPackt/Master-Data-Science-with-Python/blob/master/Chapter%201/Data/Banking_Marketing.csv'
#reading the data into the dataframe into the object data
df = pd.read_csv(dataset, header=0)
- Before doing the encoding, remove all the missing data. To do so, use the dropna() function:
df = df.dropna()
- Select all the columns that are not numeric using the following code:
data_column_category = df.select_dtypes(exclude=[np.number]).columns
data_column_category
The preceding code generates the following output:
Figure 1.37: Non-numeric columns of the dataframe
- Print the first five rows of the new dataframe. Add the following code to do this:
df[data_column_category].head()
The preceding code generates the following output:
Figure 1.38: Non-numeric values for the columns
- Iterate through these category columns and convert them to numeric data using OneHotEncoder. To do so, import the sklearn.preprocessing package and avail yourself of the OneHotEncoder() class do the transformation. Before performing one-hot encoding, we need to perform label encoding:
#performing label encoding
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
for i in data_column_category:
df[i] = label_encoder.fit_transform(df[i])
print("Label Encoded Data: ")
df.head()
The preceding code generates the following output:
Figure 1.39: Values of non-numeric columns converted into numeric data
- Once we have performed label encoding, we execute one-hot encoding. Add the following code to implement this:
#Performing Onehot Encoding
onehot_encoder = OneHotEncoder(sparse=False)
onehot_encoded = onehot_encoder.fit_transform(df[data_column_category])
- Now we create a new dataframe with the encoded data and print the first five rows. Add the following code to do this:
#Creating a dataframe with encoded data with new column name
onehot_encoded_frame = pd.DataFrame(onehot_encoded, columns = onehot_encoder.get_feature_names(data_column_category))
onehot_encoded_frame.head()
The preceding code generates the following output:
Figure 1.40: Columns with one-hot encoded values
- Due to one-hot encoding, the number of columns in the new dataframe has increased. In order to view and print all the columns created, use the columns attribute:
onehot_encoded_frame.columns
The preceding code generates the following output:
Figure 1.41: List of new columns generated after one-hot encoding
- For every level or category, a new column is created. In order to prefix the category name with the column name you can use this alternate way to create one-hot encoding. In order to prefix the category name with the column name, write the following code:
df_onehot_getdummies = pd.get_dummies(df[data_column_category], prefix=data_column_category)
data_onehot_encoded_data = pd.concat([df_onehot_getdummies,df[data_column_number]],axis = 1)
data_onehot_encoded_data.columns
The preceding code generates the following output:
Figure 1.42: List of new columns containing the categories
You have successfully converted categorical data to numerical data using the OneHotEncoder method.
We will now move onto another data preprocessing step – how to deal with a range of magnitudes in your data.