Data preparation and data engineering
Data preparation and data engineering are two essential steps in the machine learning process, specifically for supervised learning. We will cover each in turn in Chapters 2 and 4. For now, we’ll provide an overview. Data preparation and data engineering involve the process of collecting, storing, and managing data so that it is accessible and useful for machine learning as well as cleaning, transforming, and formatting data so that it can be used to train and evaluate machine learning models. Lets explore and discuss some of the following topics:
- Collecting data: Here, we gather data from a variety sources such as databases, sensors, or the internet.
- Storing data: Here, we store data in an efficient and accessible manner. For example in SQL or NoSQL databases, file systems, etc. or others.
- Formatting data: Here, we ensure that data is consistently stored in the required format. For example, data stored in tables in an SQL database, JSON format, excel format, csv format, or text format.
- Splitting data: To verify your model is not overfitting, you need to test the model on part of the dataset. For this test to be effective, the model should not “know” what the testing data looks like. Data leakage is when a data cleaning step provides information about the test set to the training set, for example, if you offset all data points by the mean of all the datapoints. This is why you divide the data into a training set and a testing set using a technique called a train-test split. It should be done before moving onto to complicated data cleaning and feature engineering. The purpose of this technique is to evaluate the performance of a machine learning on unseen data. Feature engineering techniques learn parameters from the data. It is critical to learn these parameters only from the train set to avoid overfitting.
The training set is used to train the model by feeding it with input data and the corresponding output labels. The model learns patterns and relationships in the training data, which it uses to make predictions.
The testing set, however, is used to evaluate the performance of the trained model. It serves as a proxy for new, unseen data. The model makes predictions on the testing set, and the predictions are compared against the known ground truth labels. This evaluation helps assess how well the model generalizes to new data and provides an estimate of its performance.
Data cleaning
Here we identify and handle issues in the dataset that can affect the performance and reliability of machine learning models. Some of the tasks that are performed during data cleaning are:
- Handling missing data: Identifying and dealing with missing values by imputing them (replacing missing values with estimated values) or removing instances or features with a significant number of missing values.
- Handling duplicate data: Removing duplicate data from the dataset is important for the model to avoid overfitting. Duplicate values can be removed in a variety of ways, such as performing a database query to select unique rows, using Python's pandas library to drop duplicate rows, or using a statistical package such as R to remove duplicate rows. We can also handle duplicate data by keeping the duplicates but marking them as such by adding a new column with a 0 or 1 to indicate duplicates. This new column can be used by the machine learning model to avoid overfitting.
- Handling outliers: We must identify and address outliers, which are extreme values that deviate from the typical pattern in the data. We can either remove them or transform them to minimize the impact on the machine learning model. Domain knowledge is important in determining how best to recognize and handle outliers in the data.
- Handling inconsistent data: Addressing inconsistent data, such as incorrect, conflicting, or flawed values, by standardizing formats, resolving discrepancies, or using domain knowledge to correct errors.
- Handling imbalanced data: If there is an imbalance in the data, for example, if there are many more of one category than the others, we can use techniques such as oversampling (replicating minority class samples) or undersampling (removing majority class samples).
Feature engineering
This involves creating new features or transforming existing features into ones that are more informative and relevant to the problem to enhance the performance of machine learning algorithms. Many techniques can be used for feature engineering; it varies depending on the specifics of the dataset and the machine learning algorithms used. The following are some of the common feature engineering techniques:
- Feature selection: This involves selecting the most relevant features for the machine learning algorithm. There are two main types of feature selection method:
- Filter method: With this method, we can select features based on their individual characteristics, such as variance or correlation with the target variable.
- Wrapper method: With this method, we can select features by iteratively building and evaluating models on different subsets of features.
- Feature extraction: This is the process of transforming raw data into meaningful features and capturing relevant and meaningful information. The following lists some examples:
- Extracting statistical measures, such as normalization or standardization, and other measures, such as principal component analysis (PCA), which transforms high-dimensional data into lower-dimensional space, capturing as much of the variation in the data as possible.
- Converting categorical data into binary values, such as one-hot encoding.
- Converting text data into numerical representations, such as bag-of-words, and text embeddings.
- Extracting images features using techniques such as convolution neural networks (CNNs).
Let’s summarize what we’ve covered in this chapter.