Chapter 3. Text Mining
A large amount of data available is in the form of text, and it is unstructured, massive, and of tremendous variety. In this chapter, we will have a look at the tools available in R to extract useful information from text.
This chapter describes different ways of mining text. We will cover the following topics:
- Examining the text in various ways
- Converting text to lowercase
- Removing punctuation
- Removing numbers
- Removing URLs
- Removing stop words
- Using the stems of words rather than instances
- Building a document matrix delineating uses
- XML processing, both orthogonal and of varying degrees
- Examples