Experimenting with parameters that support imbalanced classes
One common problem you will often face in the field of ML is classifying rare events. Consider the case of large earthquakes. Large earthquakes of magnitude 7 and higher occur about once every year. If you had a dataset containing the Earth’s tectonic activity of each day since the last decade with the response column containing whether or not an earthquake occurred, then you would have approximately 3,650 rows of data; that is, one row for each day in the decade, with around 8-12 rows showing large earthquakes. That is less than a 0.3% chance that this event will occur. 99.7% of the time, there will be no large earthquakes. This dataset, where the number of large earthquake events is so small, is called an imbalanced dataset.
The problem with the imbalanced dataset is that even if you write a simple if-else
function that marks all tectonic events as not earthquakes and call this a model, it will still show the...