Swim preference – analysis involving a random forest
We will use the example from Chapter 3, Decision Trees concerning swim preferences. We have the same data table, as follows:
Swimming suit | Water temperature | Swim preference |
None | Cold | No |
None | Warm | No |
Small | Cold | No |
Small | Warm | No |
Good | Cold | No |
Good | Warm | Yes |
We would like to construct a random forest from this data and use it to classify an item (Good,Cold,?)
.
Analysis
We are given M=3 variables, according to which a feature can be classified. In a random forest algorithm, we usually do not use all three variables to form tree branches at each node. We only use a subset (m) of variables from M. So we choose m such that m is less than, or equal to, M. The greater m is, the stronger the classifier is in each constructed tree. However, as mentioned earlier, more data leads to more bias. But, because we use multiple trees (with a lower m), even if each constructed tree is a weak classifier, their combined classification accuracy is strong. As we want to reduce bias in...