Ensembled models: bagging versus boosting
Ensemble modeling is a machine learning technique that combines multiple models to create a more accurate and robust model. The individual models in an ensemble are called base models. The ensemble model learns from the base models and makes predictions by combining their predictions.
Bagging and boosting are two popular ensemble learning methods used in machine learning to create more accurate models by combining individual models. However, they differ in their approach and the way they combine models.
Bagging (bootstrap aggregation) creates multiple models by repeatedly sampling the original dataset with a replacement, which means some data points may be included in multiple models, while other data points may not be included in any models. Each model is trained on its subset, and the final prediction is obtained by averaging in the case of regression or voting the predictions of all individual models in the case of classification. Since it uses a resampling technique, bagging reduces the variance or the impact using a different training set will have on the model.
Boosting is an iterative technique that focuses on sequentially improving the models, with each model being trained to correct the mistakes of the previous models. To begin with, a base model is trained on the entire training dataset. The subsequent models are then trained by adjusting the weights to give more importance to the misclassified instances in the previous models. The final prediction is obtained by combining the predictions of all individual models using a weighted sum, where the weights are assigned based on the performance of each model. Boosting reduces the bias in the model. In this context, bias means the assumptions that are being made about the form of the model function. For example, if you use a linear model, you are assuming that the form of the equation that predicts the data is linear – the model is biased towards linear. As you might expect, decision tree models be less biased than linear regression or logistic regression models. Boosting iterates on the equation and further reduces the bias.
The following table summarizes the key differences between bagging and boosting:
Bagging |
Boosting |
Models are trained individually, independently and parallelly |
Models are trained sequentially, with each model trying to correct the mistakes of the previous model |
Each model has equal weight in the final prediction |
Each model’s weight in the final prediction depends on its performance |
Variance is reduced and overfitting removed |
Bias is reduced but overfitting may occur |
More accurate ensemble models are created, for example, Random Forest |
More accurate ensemble models are created, for example, AdaBoost, Gradient Boosting, and XGBoost |
Table 1.2 – Table summarizing the differences between bagging and boosting
The following diagram depicts the conceptual difference between bagging and boosting in a pictorial way:
Figure 1.2 – Bagging versus boosting
Next, let’s explore the two key steps in any machine learning process: data preparation and data engineering.