Extracting features from categorical variables
Many problems have explanatory variables that are categorical or nominal. A categorical variable can take one of a fixed set of values. For example, an application that predicts the salary for a job might use categorical variables such as the city in which the position is located. Categorical variables are commonly encoded using one-of-k encoding, or one-hotencoding, in which the explanatory variable is represented using one binary feature for each of its possible values.
For example, let's assume our model has a city
variable that can take one of three values: New York
, San Francisco
, or Chapel Hill
. One-hot encoding represents the variable using one binary feature for each of the three possible cities. scikit-learn's DictVectorizer
class is a transformer that can be used to one-hot encode categorical features:
# In[1]: from sklearn.feature_extraction import DictVectorizer onehot_encoder = DictVectorizer() X = [ {'city': 'New York'}, ...