You're reading from XGBoost for Regression Predictive Modeling and Time Series Analysis Learn how to build, evaluate, and deploy predictive models with expert guidance

Product type Paperback

Published in Dec 2024

Publisher Packt

ISBN-13 9781805123057

Length 308 pages

Edition 1st Edition

Concepts

Data Governance

Authors (2):

Joyce Weiner

Partha Pritam Deka

View More author details

Table of Contents (19) Chapters

Preface

1. Part 1:Introduction to Machine Learning and XGBoost with Case Studies

2. Chapter 1: An Overview of Machine Learning, Classification, and Regression FREE CHAPTER

3. Chapter 2: XGBoost Quick Start Guide with an Iris Data Case Study

4. Chapter 3: Demystifying the XGBoost Paper

5. Chapter 4: Adding on to the Quick Start – Switching out the Dataset with a Housing Data Case Study

6. Part 2: Practical Applications – Data, Features, and Hyperparameters

7. Chapter 5: Classification and Regression Trees, Ensembles, and Deep Learning Models – What’s Best for Your Data?

8. Chapter 6: Data Cleaning, Imbalanced Data, and Other Data Problems

9. Chapter 7: Feature Engineering

10. Chapter 8: Encoding Techniques for Categorical Features

11. Chapter 9: Using XGBoost for Time Series Forecasting

12. Chapter 10: Model Interpretability, Explainability, and Feature Importance with XGBoost

13. Part 3: Model Evaluation Metrics and Putting Your Model into Production

14. Chapter 11: Metrics for Model Evaluations and Comparisons

15. Chapter 12: Managing a Feature Engineering Pipeline in Training and Inference

16. Chapter 13: Deploying Your XGBoost Model

17. Index

Why subscribe?

18. Other Books You May Enjoy

Data preparation and data engineering

Data preparation and data engineering are two essential steps in the machine learning process, specifically for supervised learning. We will cover each in turn in Chapters 2 and 4. For now, we’ll provide an overview. Data preparation and data engineering involve the process of collecting, storing, and managing data so that it is accessible and useful for machine learning as well as cleaning, transforming, and formatting data so that it can be used to train and evaluate machine learning models. Lets explore and discuss some of the following topics:

Collecting data: Here, we gather data from a variety sources such as databases, sensors, or the internet.
Storing data: Here, we store data in an efficient and accessible manner. For example in SQL or NoSQL databases, file systems, etc. or others.
Formatting data: Here, we ensure that data is consistently stored in the required format. For example, data stored in tables in an SQL database, JSON format, excel format, csv format, or text format.
Splitting data: To verify your model is not overfitting, you need to test the model on part of the dataset. For this test to be effective, the model should not “know” what the testing data looks like. Data leakage is when a data cleaning step provides information about the test set to the training set, for example, if you offset all data points by the mean of all the datapoints. This is why you divide the data into a training set and a testing set using a technique called a train-test split. It should be done before moving onto to complicated data cleaning and feature engineering. The purpose of this technique is to evaluate the performance of a machine learning on unseen data. Feature engineering techniques learn parameters from the data. It is critical to learn these parameters only from the train set to avoid overfitting.

The training set is used to train the model by feeding it with input data and the corresponding output labels. The model learns patterns and relationships in the training data, which it uses to make predictions.

The testing set, however, is used to evaluate the performance of the trained model. It serves as a proxy for new, unseen data. The model makes predictions on the testing set, and the predictions are compared against the known ground truth labels. This evaluation helps assess how well the model generalizes to new data and provides an estimate of its performance.

Data cleaning

Here we identify and handle issues in the dataset that can affect the performance and reliability of machine learning models. Some of the tasks that are performed during data cleaning are:

Handling missing data: Identifying and dealing with missing values by imputing them (replacing missing values with estimated values) or removing instances or features with a significant number of missing values.
Handling duplicate data: Removing duplicate data from the dataset is important for the model to avoid overfitting. Duplicate values can be removed in a variety of ways, such as performing a database query to select unique rows, using Python's pandas library to drop duplicate rows, or using a statistical package such as R to remove duplicate rows. We can also handle duplicate data by keeping the duplicates but marking them as such by adding a new column with a 0 or 1 to indicate duplicates. This new column can be used by the machine learning model to avoid overfitting.
Handling outliers: We must identify and address outliers, which are extreme values that deviate from the typical pattern in the data. We can either remove them or transform them to minimize the impact on the machine learning model. Domain knowledge is important in determining how best to recognize and handle outliers in the data.
Handling inconsistent data: Addressing inconsistent data, such as incorrect, conflicting, or flawed values, by standardizing formats, resolving discrepancies, or using domain knowledge to correct errors.
Handling imbalanced data: If there is an imbalance in the data, for example, if there are many more of one category than the others, we can use techniques such as oversampling (replicating minority class samples) or undersampling (removing majority class samples).

Feature engineering

This involves creating new features or transforming existing features into ones that are more informative and relevant to the problem to enhance the performance of machine learning algorithms. Many techniques can be used for feature engineering; it varies depending on the specifics of the dataset and the machine learning algorithms used. The following are some of the common feature engineering techniques:

Feature selection: This involves selecting the most relevant features for the machine learning algorithm. There are two main types of feature selection method:
- Filter method: With this method, we can select features based on their individual characteristics, such as variance or correlation with the target variable.
- Wrapper method: With this method, we can select features by iteratively building and evaluating models on different subsets of features.
Feature extraction: This is the process of transforming raw data into meaningful features and capturing relevant and meaningful information. The following lists some examples:
- Extracting statistical measures, such as normalization or standardization, and other measures, such as principal component analysis (PCA), which transforms high-dimensional data into lower-dimensional space, capturing as much of the variation in the data as possible.
- Converting categorical data into binary values, such as one-hot encoding.
- Converting text data into numerical representations, such as bag-of-words, and text embeddings.
- Extracting images features using techniques such as convolution neural networks (CNNs).

Let’s summarize what we’ve covered in this chapter.

You're reading from XGBoost for Regression Predictive Modeling and Time Series Analysis Learn how to build, evaluate, and deploy predictive models with expert guidance

Table of Contents (19) Chapters

Data preparation and data engineering

Data cleaning

Feature engineering

Authors (2)

Personalised recommendations for you