One-hot encoding
One-hot encoding is a simple yet powerful method for transforming categorical variables into binary vectors. For each category, this technique creates a binary column indicating the presence (1) or absence (0) of that category. The challenge with this method is that, as mentioned previously, if you have hundreds or thousands of unique categories, this method creates hundreds or thousands of new columns. This can make the model inefficient and can also lead to overfitting. One-hot encoding works best for cases where the number of categories is on the order of 10 unique values. For example, a Gender
variable with categories female
and male
can be one-hot encoded into two columns: female
(1
if female, 0
otherwise) and male
(1
if male, 0
otherwise). Similarly, a Color
variable with categories red
, blue
, and green
would be encoded into three binary columns.
To reduce redundancy, (k-1) columns are often used for a variable with k categories. This approach is particularly...