Feature extraction
As mentioned earlier in this chapter, the NLP system does not understand string values. They need numerical input to build models, sometimes they are also called numerical features. Feature extraction in NLP is converting a set of text information into a set of numerical features. Any machine learning algorithm that you are going to train would need features in numerical vector forms as it does not understand the string. There are many ways text can be represented as numerical vectors. Some such ways are One hot encoding, TF-IDF, Word2Vec, and CountVectorizer.
One hot encoding
One hot encoding is the binary sparse vector representation of text. In this encoding, the resulting binary vector is all zero-value except at the position or index of the token where it is one. Let's look at it with an example. Suppose there are two sentences: This is Big Data AI Book. This is book explains AI algorithms on Big Data
.
Unique tokens (nouns) for earlier sentences would be {data,AI,book...