Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
XGBoost for Regression Predictive Modeling and Time Series Analysis
XGBoost for Regression Predictive Modeling and Time Series Analysis

XGBoost for Regression Predictive Modeling and Time Series Analysis: Learn how to build, evaluate, and deploy predictive models with expert guidance

Arrow left icon
Profile Icon Partha Pritam Deka Profile Icon Joyce Weiner
Arrow right icon
Coming Soon Coming Soon Publishing in Dec 2024
€18.99 per month
Paperback Dec 2024 308 pages 1st Edition
Subscription
Free Trial
Renews at €18.99p/m
Arrow left icon
Profile Icon Partha Pritam Deka Profile Icon Joyce Weiner
Arrow right icon
Coming Soon Coming Soon Publishing in Dec 2024
€18.99 per month
Paperback Dec 2024 308 pages 1st Edition
Subscription
Free Trial
Renews at €18.99p/m
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Info icon
You can access this book only when it is published in Dec 2024
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

XGBoost for Regression Predictive Modeling and Time Series Analysis

An Overview of Machine Learning, Classification, and Regression

In this chapter, we will present an overview of the fundamentals of machine learning concepts. You will learn about supervised and unsupervised learning techniques, then visit classification and regression trees, and discuss ensemble models. Then you will learn about data preparation and engineering.

In this chapter we will be covering the following topics:

  • Fundamentals of machine learning
  • Supervised and unsupervised learning
  • Classification and regression tree models
  • Ensembled models – bagging vs boosting
  • Data preparation and data engineering

Fundamentals of machine learning

Machine learning is a type of artificial intelligence (AI) that allows software applications to become more accurate in predicting outcomes without being explicitly programmed to do so. Machine learning algorithms use historical data as input to predict new output values. In essence, it is the science of predicting data, finding patterns in data, etc. by learning a set of algorithms from large amounts of data but not explicitly programming. There are different sets of algorithms, but machine learning algorithms are primarily of two types: supervised and unsupervised.

Supervised and unsupervised learning

In supervised learning, an algorithm learns to map the relationship between the inputs and the outputs based on a labeled dataset. A labeled dataset includes the input data (also known as features) and the corresponding output labels (also known as targets). Basically, the aim of supervised learning is to build a mapping function that can accurately predict the output for new data. Examples of supervised learning include classification and regression. Classification focuses on predicting a discrete label, while regression focusses on predicting a continuous quantity.

Unsupervised learning tries to teach an algorithm to identify patterns and structures in data without any prior knowledge of the correct labels or outputs. In unsupervised learning, the algorithm is trained to find patterns, groupings, or clusters within that data on its own. Some common examples of unsupervised learning include clustering, dimensionality reduction, and anomaly detection.

In summary, supervised learning requires labeled data with known outputs, whereas unsupervised learning requires unlabeled data without any known outputs. Supervised learning is more commonly used for prediction, classification, or regression tasks, while unsupervised learning is more commonly used for exploratory data analysis and discovering hidden patterns or insights in data.

Classification and regression decision tree models

Classification and regression trees (CART) are a type of supervised learning algorithm that can be used both for classification and regression problems.

In a classification problem, the goal is to predict the class, label, or category of a data point or an object. One example of a classification problem is to predict whether there will be customer churn or if a customer will purchase a product based on historical data.

In a regression problem, the goal is to predict a continuous numerical value, such as the price of a house based on the input features. For example, a regression CART model could be used to predict the price of a house based on input features, such as its size, location, and other relevant features.

CART models are built by recursively splitting the data into subsets based on the value of a feature that best separates the data. The algorithm chooses the feature that maximizes the separation of the classes or minimizes the variance of the target variable. The splitting process is repeated until the data are no longer able to be split further.

This process creates a tree-like structure where each internal node represents a feature or attribute, and each leaf node represents a predicted class label or a predicted continuous value. The tree can then be used to predict the class label or continuous value for new data points by following the path down the tree based on their features.

Figure 1.1 – A sample classification and regression tree

Figure 1.1 – A sample classification and regression tree

CART models are easy to explain and can handle both categorical and numerical features. However, they can be prone to overfitting. Overfitting is a phenomenon in machine learning where a model performs extremely well on the training data but fails to generalize well to unseen data. Regularization techniques such as pruning can be used to prevent overfitting. Pruning in machine learning refers to the technique of selectively removing unnecessary or less important features from a model to improve its efficiency, reduce its complexity, and prevent overfitting. The following table summarizes the advantages and disadvantages of CART models:

Advantages of CART models

Disadvantages of CART models

Easy to understand and interpret

Prone to overfitting

Relatively fast to train

Sensitive to noise in the data

Can be used for both classification and regression problems

Can be computationally expensive to train, especially for large datasets, because they need to search through all possible splits in the data in order to find the optimal tree structure

Table 1.1 – Advantages and disadvantages of CART models

As seen in the preceding table, overall, CART models are a powerful supervised learning-based tool that can be used for a variety of machine learning tasks. However, they have limitations, and we must take steps to prevent overfitting.

Ensembled models: bagging versus boosting

Ensemble modeling is a machine learning technique that combines multiple models to create a more accurate and robust model. The individual models in an ensemble are called base models. The ensemble model learns from the base models and makes predictions by combining their predictions.

Bagging and boosting are two popular ensemble learning methods used in machine learning to create more accurate models by combining individual models. However, they differ in their approach and the way they combine models.

Bagging (bootstrap aggregation) creates multiple models by repeatedly sampling the original dataset with a replacement, which means some data points may be included in multiple models, while other data points may not be included in any models. Each model is trained on its subset, and the final prediction is obtained by averaging in the case of regression or voting the predictions of all individual models in the case of classification. Since it uses a resampling technique, bagging reduces the variance or the impact using a different training set will have on the model.

Boosting is an iterative technique that focuses on sequentially improving the models, with each model being trained to correct the mistakes of the previous models. To begin with, a base model is trained on the entire training dataset. The subsequent models are then trained by adjusting the weights to give more importance to the misclassified instances in the previous models. The final prediction is obtained by combining the predictions of all individual models using a weighted sum, where the weights are assigned based on the performance of each model. Boosting reduces the bias in the model. In this context, bias means the assumptions that are being made about the form of the model function. For example, if you use a linear model, you are assuming that the form of the equation that predicts the data is linear – the model is biased towards linear. As you might expect, decision tree models be less biased than linear regression or logistic regression models. Boosting iterates on the equation and further reduces the bias.

The following table summarizes the key differences between bagging and boosting:

Bagging

Boosting

Models are trained individually, independently and parallelly

Models are trained sequentially, with each model trying to correct the mistakes of the previous model

Each model has equal weight in the final prediction

Each model’s weight in the final prediction depends on its performance

Variance is reduced and overfitting removed

Bias is reduced but overfitting may occur

More accurate ensemble models are created, for example, Random Forest

More accurate ensemble models are created, for example, AdaBoost, Gradient Boosting, and XGBoost

Table 1.2 – Table summarizing the differences between bagging and boosting

The following diagram depicts the conceptual difference between bagging and boosting in a pictorial way:

Figure 1.2 – Bagging versus boosting

Figure 1.2 – Bagging versus boosting

Next, let’s explore the two key steps in any machine learning process: data preparation and data engineering.

Data preparation and data engineering

Data preparation and data engineering are two essential steps in the machine learning process, specifically for supervised learning. We will cover each in turn in Chapters 2 and 4. For now, we’ll provide an overview. Data preparation and data engineering involve the process of collecting, storing, and managing data so that it is accessible and useful for machine learning as well as cleaning, transforming, and formatting data so that it can be used to train and evaluate machine learning models. Lets explore and discuss some of the following topics:

  1. Collecting data: Here, we gather data from a variety sources such as databases, sensors, or the internet.
  2. Storing data: Here, we store data in an efficient and accessible manner. For example in SQL or NoSQL databases, file systems, etc. or others.
  3. Formatting data: Here, we ensure that data is consistently stored in the required format. For example, data stored in tables in an SQL database, JSON format, excel format, csv format, or text format.
  4. Splitting data: To verify your model is not overfitting, you need to test the model on part of the dataset. For this test to be effective, the model should not “know” what the testing data looks like. Data leakage is when a data cleaning step provides information about the test set to the training set, for example, if you offset all data points by the mean of all the datapoints. This is why you divide the data into a training set and a testing set using a technique called a train-test split. It should be done before moving onto to complicated data cleaning and feature engineering. The purpose of this technique is to evaluate the performance of a machine learning on unseen data. Feature engineering techniques learn parameters from the data. It is critical to learn these parameters only from the train set to avoid overfitting.

The training set is used to train the model by feeding it with input data and the corresponding output labels. The model learns patterns and relationships in the training data, which it uses to make predictions.

The testing set, however, is used to evaluate the performance of the trained model. It serves as a proxy for new, unseen data. The model makes predictions on the testing set, and the predictions are compared against the known ground truth labels. This evaluation helps assess how well the model generalizes to new data and provides an estimate of its performance.

Data cleaning

Here we identify and handle issues in the dataset that can affect the performance and reliability of machine learning models. Some of the tasks that are performed during data cleaning are:

  • Handling missing data: Identifying and dealing with missing values by imputing them (replacing missing values with estimated values) or removing instances or features with a significant number of missing values.
  • Handling duplicate data: Removing duplicate data from the dataset is important for the model to avoid overfitting. Duplicate values can be removed in a variety of ways, such as performing a database query to select unique rows, using Python's pandas library to drop duplicate rows, or using a statistical package such as R to remove duplicate rows. We can also handle duplicate data by keeping the duplicates but marking them as such by adding a new column with a 0 or 1 to indicate duplicates. This new column can be used by the machine learning model to avoid overfitting.
  • Handling outliers: We must identify and address outliers, which are extreme values that deviate from the typical pattern in the data. We can either remove them or transform them to minimize the impact on the machine learning model. Domain knowledge is important in determining how best to recognize and handle outliers in the data.
  • Handling inconsistent data: Addressing inconsistent data, such as incorrect, conflicting, or flawed values, by standardizing formats, resolving discrepancies, or using domain knowledge to correct errors.
  • Handling imbalanced data: If there is an imbalance in the data, for example, if there are many more of one category than the others, we can use techniques such as oversampling (replicating minority class samples) or undersampling (removing majority class samples).

Feature engineering

This involves creating new features or transforming existing features into ones that are more informative and relevant to the problem to enhance the performance of machine learning algorithms. Many techniques can be used for feature engineering; it varies depending on the specifics of the dataset and the machine learning algorithms used. The following are some of the common feature engineering techniques:

  • Feature selection: This involves selecting the most relevant features for the machine learning algorithm. There are two main types of feature selection method:
    • Filter method: With this method, we can select features based on their individual characteristics, such as variance or correlation with the target variable.
    • Wrapper method: With this method, we can select features by iteratively building and evaluating models on different subsets of features.
  • Feature extraction: This is the process of transforming raw data into meaningful features and capturing relevant and meaningful information. The following lists some examples:
    • Extracting statistical measures, such as normalization or standardization, and other measures, such as principal component analysis (PCA), which transforms high-dimensional data into lower-dimensional space, capturing as much of the variation in the data as possible.
    • Converting categorical data into binary values, such as one-hot encoding.
    • Converting text data into numerical representations, such as bag-of-words, and text embeddings.
    • Extracting images features using techniques such as convolution neural networks (CNNs).

Let’s summarize what we’ve covered in this chapter.

Summary

In this chapter, you were introduced to the fundamentals of machine learning, got an overview of machine learning using CART, and learned about bagging and boosting ensembled methods to improve the performance of a CART model. You were also introduced to the topics of data preparation and data engineering. The topics introduced in this chapter are the fundamentals to start machine learning, and you have just touched the tip of the iceberg. We will cover all of these topics in more depth in the following chapters.

Next, we’ll go through a quick-start introduction to provide you with an example so you can apply the concepts you learned about in the next chapter.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Get up and running with this quick-start guide to building a classifier using XGBoost
  • Get an easy-to-follow, in-depth explanation of the XGBoost technical paper
  • Leverage XGBoost for time series forecasting by using moving average, frequency, and window methods
  • Purchase of the print or Kindle book includes a free PDF eBook

Description

XGBoost offers a powerful solution for regression and time series analysis, enabling you to build accurate and efficient predictive models. In this book, the authors draw on their combined experience of 40+ years in the semiconductor industry to help you harness the full potential of XGBoost, from understanding its core concepts to implementing real-world applications. As you progress, you'll get to grips with the XGBoost algorithm, including its mathematical underpinnings and its advantages over other ensemble methods. You'll learn when to choose XGBoost over other predictive modeling techniques, and get hands-on guidance on implementing XGBoost using both the Python API and scikit-learn API. You'll also get to grips with essential techniques for time series data, including feature engineering, handling lag features, encoding techniques, and evaluating model performance. A unique aspect of this book is the chapter on model interpretability, where you'll use tools such as SHAP, LIME, ELI5, and Partial Dependence Plots (PDP) to understand your XGBoost models. Throughout the book, you’ll work through several hands-on exercises and real-world datasets. By the end of this book, you'll not only be building accurate models but will also be able to deploy and maintain them effectively, ensuring your solutions deliver real-world impact.

Who is this book for?

This book is for data scientists, machine learning practitioners, analysts, and professionals interested in predictive modeling and time series analysis. Basic coding knowledge and familiarity with Python, GitHub, and other DevOps tools are required.

What you will learn

  • Build a strong, intuitive understanding of the XGBoost algorithm and its benefits
  • Implement XGBoost using the Python API for practical applications
  • Evaluate model performance using appropriate metrics
  • Deploy XGBoost models into production environments
  • Handle complex datasets and extract valuable insights
  • Gain practical experience in feature engineering, feature selection, and categorical encoding

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Dec 13, 2024
Length: 308 pages
Edition : 1st
Language : English
ISBN-13 : 9781805123057
Category :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Info icon
You can access this book only when it is published in Dec 2024
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Dec 13, 2024
Length: 308 pages
Edition : 1st
Language : English
ISBN-13 : 9781805123057
Category :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
Banner background image

Table of Contents

18 Chapters
Part 1:Introduction to Machine Learning and XGBoost with Case Studies Chevron down icon Chevron up icon
Chapter 1: An Overview of Machine Learning, Classification, and Regression Chevron down icon Chevron up icon
Chapter 2: XGBoost Quick Start Guide with an Iris Data Case Study Chevron down icon Chevron up icon
Chapter 3: Demystifying the XGBoost Paper Chevron down icon Chevron up icon
Chapter 4: Adding on to the Quick Start – Switching out the Dataset with a Housing Data Case Study Chevron down icon Chevron up icon
Part 2: Practical Applications – Data, Features, and Hyperparameters Chevron down icon Chevron up icon
Chapter 5: Classification and Regression Trees, Ensembles, and Deep Learning Models – What’s Best for Your Data? Chevron down icon Chevron up icon
Chapter 6: Data Cleaning, Imbalanced Data, and Other Data Problems Chevron down icon Chevron up icon
Chapter 7: Feature Engineering Chevron down icon Chevron up icon
Chapter 8: Encoding Techniques for Categorical Features Chevron down icon Chevron up icon
Chapter 9: Using XGBoost for Time Series Forecasting Chevron down icon Chevron up icon
Chapter 10: Model Interpretability, Explainability, and Feature Importance with XGBoost Chevron down icon Chevron up icon
Part 3: Model Evaluation Metrics and Putting Your Model into Production Chevron down icon Chevron up icon
Chapter 11: Metrics for Model Evaluations and Comparisons Chevron down icon Chevron up icon
Chapter 12: Managing a Feature Engineering Pipeline in Training and Inference Chevron down icon Chevron up icon
Chapter 13: Deploying Your XGBoost Model Chevron down icon Chevron up icon
Index Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.