Count or frequency encoding
Count or frequency encoding replaces categorical values with their occurrence counts or relative frequencies within the dataset. This technique allows us to represent categories numerically without expanding the feature space, which is useful when dealing with high-cardinality categorical features.
For instance, consider a dataset containing information about different neighborhoods and house characteristics. If a particular neighborhood (e.g., OldTown
) appears 10 times in the dataset, count encoding would replace OldTown
with 10
, whereas frequency encoding would replace it with the fraction of times it appears (e.g., 10 out of 100 observations would be 0.1
). This method effectively captures the representation of each label but can lose valuable information if different categories have the same count or frequency. For example, if both blue
and red
appear 10 times, they will be encoded identically.
Let’s apply count and frequency encoding using...