Handling imbalanced data
Imbalanced data is a challenge for classification problems. Because machine learning (ML) models such as XGBoost learn from historical data, if you don’t have examples, your model cannot learn the pattern. If your data only has 3 samples for a particular category, then the model can’t learn the pattern that predicts a member of that category as effectively as if it had 3,000 samples. Additionally, if you have two categories (binary classifier) and one category has many more members than the other in the training data, you essentially train the model to predict just that category. Think about it this way – imagine you have been asked to predict the color of a ball pulled out of a bag. Every time you observe a ball being pulled out, it has been red. What color would you guess next? Red, of course. You would be surprised to see a ball of a different color, but not shocked – especially if told in advance that there are two possibilities...