Entropy and information gain
Before we explain how to create a Decision Tree, we need to introduce two important concepts—entropy and information gain.
Entropy measures the homogeneity of a dataset. Imagine a dataset with 10 observations with one attribute, as shown in the following diagram, the value of this attribute is A for the 10 observations. This dataset is completely homogenous and is easy to predict the value of the next observation, it'll probably be A:
The entropy in a dataset that is completely homogenous is zero. Now, imagine a similar dataset, but in this dataset each observation has a different value, as shown in the following diagram:
Now, the dataset is very heterogeneous and it's hard to predict the following observation. In this dataset, the entropy is higher. The formula to calculate the entropy is , where is the probability of x.
Try to calculate the entropy for the following datasets:
Now, we understand how entropy helps us to know the level of predictability of a dataset...