Feature extraction from text
When using text in machine learning, we need to convert text to a list of features a machine learning algorithm can understand. This means that we need to convert text to numbers. To accomplish this, there are two approaches that can be used with Optimus:
- Bag of words
- TF-IDF
Let's see how you can use these methods in Optimus.
Bag of words
In the bag of words approach, we take all the words and then count the number of occurrences of each word.
After counting the number of occurrences of each word, because a corpus can have millions of words, it can be useful to select the most frequent word in the text, as shown in the following figure:
To apply bag of words in Optimus, you can use the following code:
_df = df.cols.bag_of_words("text")
This returns a big dataframe with all the strings as column names and the word count in every row. Because...