Chapter 9. Text Analysis
Text analysis is a broad topic and is typically referred to as Natural Language Processing (NLP). It is used for many different tasks, including text searching, language translation, sentiment analysis, speech recognition, and classification, to mention a few. The process of analyzing can be difficult due to the particularities and ambiguity found in natural languages. However, there has been a considerable amount of work in this area and there are several Java APIs supporting this effort.
We will start with an introduction to the basic concepts and tasks used in NLP. These include the following:
- Tokenization: The process of splitting text into individual tokens or words.
- Stop words: These are words that are common and may not be necessary for processing. They include such words as the, a, and to.
- Name Entity Recognition (NER): This is the process of identifying elements of text such as people's name, locations, or things.
- Parts of Speech (POS): This identifies the grammatical parts of a sentence such as noun, verb, adjective, and so on.
- Relationships: Here, we are concerned with identifying how parts of text are related to each other, such as the subject and object of a sentence.
The concepts of words, sentences, and paragraphs are well known. However, extracting and analyzing these components is not always that straightforward. The term corpus frequently refers to a collection of text.
As with most data science problems, it is important to preprocess text. Frequently, this involves handling such tasks as these:
- Handling Unicode
- Converting text to uppercase or lowercase
- Removing stop words
We examined several techniques for tokenization and removing stop words in Chapter 3, Data Cleaning. In this chapter, we will focus on POS, NER, extracting relationships from sentence, text classification, and sentiment analysis.
There are several NLP APIs available, including these:
- OpenNLP (https://opennlp.apache.org/): An open source Apache project
- StanfordNLP (http://nlp.stanford.edu/software/) : Another open source library
- UIMA (https://uima.apache.org/): An Apache project supporting pipelines
- LingPipe (http://alias-i.com/lingpipe/): A library that uses pipelines extensively
- DL4J (http://deeplearning4j.org/): The Deep Learning for Java library supports various classes for deep learning neural networks including support for NLP
We will use OpenNLP and DL4J to demonstrate text analysis in this chapter. We chose these because they are both well-known and have good published resources for additional support.
We will use the Google Word2Vec and Doc2Vec neural networks to perform text classification. This includes feature vectors based on other words as well as using labeled information to classify documents. Finally, we will discuss sentiment analysis. This type of analysis seeks to assign meaning to text and also uses the Word2Vec network.
We start our discussion with NER.
Implementing named entity recognition
This is sometimes referred to as finding people and things. Given a text segment, we may want to identify all the names of people present. However, this is not always easy because a name such as Rob may also be used as a verb.
In this section, we will demonstrate how to use OpenNLP's TokenNameFinderModel
class to find names and locations in text. While there are other entities we may want to find, this example will demonstrate the basics of the technique. We begin with names.
Most names occur within a single line. We do not want to use multiple lines because an entity such as a state might inadvertently be identified incorrectly. Consider the following sentences:
Jim headed north. Dakota headed south.
If we ignored the period, then the state of North Dakota might be identified as a location, when in fact it is not present.
Using OpenNLP to perform NER
We start our example with a try-catch block to handle exceptions. OpenNLP uses models that have been trained on different sets of data. In this example, the en-token.bin
and en-ner-person.bin
files contain the models for the tokenization of English text and for English name elements, respectively. These files can be downloaded from http://opennlp.sourceforge.net/models-1.5/. However, the IO stream used here is standard Java:
try (InputStream tokenStream = new FileInputStream(new File("en-token.bin")); InputStream personModelStream = new FileInputStream( new File("en-ner-person.bin"));) { ... } catch (Exception ex) { // Handle exceptions }
An instance of the TokenizerModel
class is initialized using the token stream. This instance is then used to create the actual TokenizerME
tokenizer. We will use this instance to tokenize our sentence:
TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm);
The TokenNameFinderModel
class is used to hold a model for name entities. It is initialized using the person model stream. An instance of the NameFinderME
class is created using this model since we are looking for names:
TokenNameFinderModel tnfm = new TokenNameFinderModel(personModelStream); NameFinderME nf = new NameFinderME(tnfm);
To demonstrate the process, we will use the following sentence. We then convert it to a series of tokens using the tokenizer and tokenizer
method:
String sentence = "Mrs. Wilson went to Mary's house for dinner."; String[] tokens = tokenizer.tokenize(sentence);
The Span
class holds information regarding the positions of entities. The find
method will return the position information, as shown here:
Span[] spans = nf.find(tokens);
This array holds information about person entities found in the sentence. We then display this information as shown here:
for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); }
The output for this sequence is as follows. Notice that it identifies the last name of Mrs. Wilson but not the "Mrs.":
[1..2) person - Wilson [4..5) person - Mary
Once these entities have been extracted, we can use them for specialized analysis.
Identifying location entities
We can also find other types of entities such as dates and locations. In the following example, we find locations in a sentence. It is very similar to the previous person example, except that an en-ner-location.bin
file is used for the model:
try (InputStream tokenStream = new FileInputStream("en-token.bin"); InputStream locationModelStream = new FileInputStream( new File("en-ner-location.bin"));) { TokenizerModel tm = new TokenizerModel(tokenStream); TokenizerME tokenizer = new TokenizerME(tm); TokenNameFinderModel tnfm = new TokenNameFinderModel(locationModelStream); NameFinderME nf = new NameFinderME(tnfm); sentence = "Enid is located north of Oklahoma City."; String tokens[] = tokenizer.tokenize(sentence); Span spans[] = nf.find(tokens); for (int i = 0; i < spans.length; i++) { out.println(spans[i] + " - " + tokens[spans[i].getStart()]); } } catch (Exception ex) { // Handle exceptions }
With the sentence defined previously, the model was only able to find the second city, as shown here. This likely due to the confusion that arises with the name Enid
which is both the name of a city and a person' name:
[5..7) location - Oklahoma
Suppose we use the following sentence:
sentence = "Pond Creek is located north of Oklahoma City.";
Then we get this output:
[1..2) location - Creek
[6..8) location - Oklahoma
Unfortunately, it has missed the town of Pond Creek
. NER is a useful tool for many applications, but like many techniques, it is not always foolproof. The accuracy of the NER approach presented, and many of the other NLP examples, will vary depending on factors such as the accuracy of the model, the language being used, and the type of entity.
We may also be interested in how text can be classified. We will examine one approach in the next section.
Classifying text
Classifying text is an important part of machine learning and data science. We have to be able to classify text for a variety of applications, including document retrieval and web searches. It is often important to assign specific labels to the data before we can determine its usefulness for a particular application or search result.
In this chapter, we are going to demonstrate a technique involving the use of paragraph vectors and labeled data with DL4J classes. This example allows us to read in documents and, based on the text inside of the document, assign a label (or classification) to the document. We are also going to show an example of classifying text by similarity. This means we will match phrases and words that have similar structure. This example will also use DL4J.
Word2Vec and Doc2Vec
We will be using Word2Vec and Doc2Vec in a few examples in this chapter. Word2Vec is a neural network with two layers used for text processing. Given a body of text, the network will provide feature vectors for the words contained in the text. These vectors are simply mathematical representations of the word features and can be numerically compared to other vectors. This comparison is often referred to as the distance between two words.
Word2Vec operates with the understanding that words can be classified by determining the probability that two words will occur together. Because of this methodology, Word2Vec can be used for more than classification of sentences. Any object or data that can be represented by text labels can be classified with this network.
Doc2Vec is an extension of Word2Vec. Rather than building vectors representing the features of individual words compared to other words, as Word2Vec does, this network compares words to given labels. The vectors are set up to represent the theme or overall meaning of a document. Our next example shows how these feature vectors are then associated with specific documents.
Classifying text by labels
In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.
In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.
First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors
object to create our vectors, a LabelAwareIterator
object to iterate through our data, and a TokenizerFactory
object to tokenize our data:
ParagraphVectors pVect; LabelAwareIterator iter; TokenizerFactory tFact;
Then we will set up our ClassPathResource
. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource
. We also specify that we will use the CommonPreprocessor
to preprocess our data:
ClassPathResource resource = new ClassPathResource("paravec/labeled"); iter = new FileLabelAwareIterator.Builder() .addSourceFolder(resource.getFile()) .build(); tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor());
Next, we build our ParagraphVectors
. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors
, we call the fit
method to train our model using the training data in the paravec/labeled
directory:
pVect = new ParagraphVectors.Builder() .learningRate(0.025) .minLearningRate(0.001) .batchSize(1000) .epochs(20) .iterate(iter) .trainWordVectors(true) .tokenizerFactory(tFact) .build(); pVect.fit();
Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource
for our unlabeled data and create a new FileLabelAwareIterator
as well:
ClassPathResource unlabeledText = new ClassPathResource("paravec/unlabeled"); FileLabelAwareIterator unlabeledIter = new FileLabelAwareIterator.Builder() .addSourceFolder(unlabeledText.getFile()) .build();
The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.
Next, we set up a MeansBuilder
and LabelSeeker
object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors
. The InMemoryLookupTable
class provides access to a default table for word lookup:
MeansBuilder mBuilder = new MeansBuilder((InMemoryLookupTable<VocabWord>) pVect.getLookupTable(),tFact); LabelSeeker lSeeker = new LabelSeeker(iter.getLabelsSource().getLabels(), (InMemoryLookupTable<VocabWord>) pVect.getLookupTable());
Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker
to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:
while (unlabeledIter.hasNextDocument()) { LabelledDocument doc = unlabeledIter.nextDocument(); INDArray docCentroid = mBuilder.documentAsVector(doc); List<Pair<String, Double>> scores = lSeeker.getScores(docCentroid); out.println("Document '" + doc.getLabel() + "' falls into the following categories: "); for (Pair<String, Double> score : scores) { out.println (" " + score.getFirst() + ": " + score.getSecond()); } }
The output from our preceding print statements is as follows:
Document 'finance' falls into the following categories: finance: 0.2889593541622162 health: 0.11753179132938385 science: 0.021202782168984413 Document 'health' falls into the following categories: finance: 0.059537000954151154 health: 0.27373185753822327 science: 0.07699354737997055
In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.
Classifying text by similarity
In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors
class we used in the previous example. To begin, download the raw_sentences.txt
file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.
First, we set up our ClassPathResource
and assign an iterator to handle our file data. We have used a SentenceIterator
for this example:
ClassPathResource srcFile = new ClassPathResource("/raw_sentences.txt"); File file = srcFile.getFile(); SentenceIterator iter = new BasicLineIterator(file);
Next, we will again use TokenizerFactory
to tokenize our data. We also want to create a new LabelsSource
object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_
:
TokenizerFactory tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor()); LabelsSource labelFormat = new LabelsSource("LINE_");
Now we are ready to build our ParagraphVectors
. Our setup process includes these methods: minWordFrequency
, which specifies the minimum word frequency to use in the training corpus, and iterations
, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource
, defined before, and our iterator and tokenizer. The trainWordVectors
method specifies whether word and document representations should be built together. Finally, sampling
determines whether subsampling should occur or not. We then call our build
and fit
methods:
ParagraphVectors vec = new ParagraphVectors.Builder() .minWordFrequency(1) .iterations(5) .epochs(1) .layerSize(100) .learningRate(0.025) .labelsSource(labelFormat) .windowSize(5) .iterate(iter) .trainWordVectors(false) .tokenizerFactory(tFact) .sampling(0) .build(); vec.fit();
Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1
, the indexing process begins at 0
. So, for example, line 9836
in the document will be associated with the label LINE_9835
. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity
method takes two labels and returns the relative distance between them in the form of double
:
double similar1 = vec.similarity("LINE_9835", "LINE_12492"); out.println("Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = " + similar1); double similar2 = vec.similarity("LINE_3720", "LINE_16392"); out.println("Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = " + similar2); double similar3 = vec.similarity("LINE_6347", "LINE_3720"); out.println("Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = " + similar3); double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); out.println("Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = " + dissimilar1); double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); out.println("Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = " + dissimilar2);
The output of our print statements is shown as follows. Compare the result of the similarity
method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity
method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:
16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4] Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494 Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001 Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362 Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935 Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619
Although this example uses ParagraphVectors
like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.
Word2Vec and Doc2Vec
We will be using Word2Vec and Doc2Vec in a few examples in this chapter. Word2Vec is a neural network with two layers used for text processing. Given a body of text, the network will provide feature vectors for the words contained in the text. These vectors are simply mathematical representations of the word features and can be numerically compared to other vectors. This comparison is often referred to as the distance between two words.
Word2Vec operates with the understanding that words can be classified by determining the probability that two words will occur together. Because of this methodology, Word2Vec can be used for more than classification of sentences. Any object or data that can be represented by text labels can be classified with this network.
Doc2Vec is an extension of Word2Vec. Rather than building vectors representing the features of individual words compared to other words, as Word2Vec does, this network compares words to given labels. The vectors are set up to represent the theme or overall meaning of a document. Our next example shows how these feature vectors are then associated with specific documents.
Classifying text by labels
In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.
In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.
First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors
object to create our vectors, a LabelAwareIterator
object to iterate through our data, and a TokenizerFactory
object to tokenize our data:
ParagraphVectors pVect; LabelAwareIterator iter; TokenizerFactory tFact;
Then we will set up our ClassPathResource
. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource
. We also specify that we will use the CommonPreprocessor
to preprocess our data:
ClassPathResource resource = new ClassPathResource("paravec/labeled"); iter = new FileLabelAwareIterator.Builder() .addSourceFolder(resource.getFile()) .build(); tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor());
Next, we build our ParagraphVectors
. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors
, we call the fit
method to train our model using the training data in the paravec/labeled
directory:
pVect = new ParagraphVectors.Builder() .learningRate(0.025) .minLearningRate(0.001) .batchSize(1000) .epochs(20) .iterate(iter) .trainWordVectors(true) .tokenizerFactory(tFact) .build(); pVect.fit();
Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource
for our unlabeled data and create a new FileLabelAwareIterator
as well:
ClassPathResource unlabeledText = new ClassPathResource("paravec/unlabeled"); FileLabelAwareIterator unlabeledIter = new FileLabelAwareIterator.Builder() .addSourceFolder(unlabeledText.getFile()) .build();
The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.
Next, we set up a MeansBuilder
and LabelSeeker
object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors
. The InMemoryLookupTable
class provides access to a default table for word lookup:
MeansBuilder mBuilder = new MeansBuilder((InMemoryLookupTable<VocabWord>) pVect.getLookupTable(),tFact); LabelSeeker lSeeker = new LabelSeeker(iter.getLabelsSource().getLabels(), (InMemoryLookupTable<VocabWord>) pVect.getLookupTable());
Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker
to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:
while (unlabeledIter.hasNextDocument()) { LabelledDocument doc = unlabeledIter.nextDocument(); INDArray docCentroid = mBuilder.documentAsVector(doc); List<Pair<String, Double>> scores = lSeeker.getScores(docCentroid); out.println("Document '" + doc.getLabel() + "' falls into the following categories: "); for (Pair<String, Double> score : scores) { out.println (" " + score.getFirst() + ": " + score.getSecond()); } }
The output from our preceding print statements is as follows:
Document 'finance' falls into the following categories: finance: 0.2889593541622162 health: 0.11753179132938385 science: 0.021202782168984413 Document 'health' falls into the following categories: finance: 0.059537000954151154 health: 0.27373185753822327 science: 0.07699354737997055
In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.
Classifying text by similarity
In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors
class we used in the previous example. To begin, download the raw_sentences.txt
file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.
First, we set up our ClassPathResource
and assign an iterator to handle our file data. We have used a SentenceIterator
for this example:
ClassPathResource srcFile = new ClassPathResource("/raw_sentences.txt"); File file = srcFile.getFile(); SentenceIterator iter = new BasicLineIterator(file);
Next, we will again use TokenizerFactory
to tokenize our data. We also want to create a new LabelsSource
object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_
:
TokenizerFactory tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor()); LabelsSource labelFormat = new LabelsSource("LINE_");
Now we are ready to build our ParagraphVectors
. Our setup process includes these methods: minWordFrequency
, which specifies the minimum word frequency to use in the training corpus, and iterations
, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource
, defined before, and our iterator and tokenizer. The trainWordVectors
method specifies whether word and document representations should be built together. Finally, sampling
determines whether subsampling should occur or not. We then call our build
and fit
methods:
ParagraphVectors vec = new ParagraphVectors.Builder() .minWordFrequency(1) .iterations(5) .epochs(1) .layerSize(100) .learningRate(0.025) .labelsSource(labelFormat) .windowSize(5) .iterate(iter) .trainWordVectors(false) .tokenizerFactory(tFact) .sampling(0) .build(); vec.fit();
Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1
, the indexing process begins at 0
. So, for example, line 9836
in the document will be associated with the label LINE_9835
. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity
method takes two labels and returns the relative distance between them in the form of double
:
double similar1 = vec.similarity("LINE_9835", "LINE_12492"); out.println("Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = " + similar1); double similar2 = vec.similarity("LINE_3720", "LINE_16392"); out.println("Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = " + similar2); double similar3 = vec.similarity("LINE_6347", "LINE_3720"); out.println("Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = " + similar3); double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); out.println("Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = " + dissimilar1); double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); out.println("Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = " + dissimilar2);
The output of our print statements is shown as follows. Compare the result of the similarity
method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity
method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:
16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4] Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494 Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001 Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362 Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935 Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619
Although this example uses ParagraphVectors
like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.
Classifying text by labels
In our first example using Doc2Vec, we will associate our documents with three labels: health, finance, and science. But before we can associate the data with labels, we have to define those labels and train our model to recognize the labels. Each label represents the meaning or classification of a particular piece of text.
In this example we will use sample documents, each pre-labelled with our categories: health, finance, or science. We will use these paragraphs to train our model and then, as in previous examples, use a set of test data to test our model. We will be using the files found at https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources/paravec. We have based this example upon sample code written for DL4J, which can be found at https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/nlp/paragraphvectors/ParagraphVectorsClassifierExample.java.
First we need to set up some instance variables to use later in our code. We will be using a ParagraphVectors
object to create our vectors, a LabelAwareIterator
object to iterate through our data, and a TokenizerFactory
object to tokenize our data:
ParagraphVectors pVect; LabelAwareIterator iter; TokenizerFactory tFact;
Then we will set up our ClassPathResource
. This specifies the directory within our project that contains the data files to be classified. The first resource contains our labeled data used for training purposes. We then direct our iterator and tokenizer to use the resources specified as the ClassPathResource
. We also specify that we will use the CommonPreprocessor
to preprocess our data:
ClassPathResource resource = new ClassPathResource("paravec/labeled"); iter = new FileLabelAwareIterator.Builder() .addSourceFolder(resource.getFile()) .build(); tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor());
Next, we build our ParagraphVectors
. This is where we specify the learning rate, batch size, and number of training epochs. We include our iterator and tokenizer in the setup process as well. Once we've built our ParagraphVectors
, we call the fit
method to train our model using the training data in the paravec/labeled
directory:
pVect = new ParagraphVectors.Builder() .learningRate(0.025) .minLearningRate(0.001) .batchSize(1000) .epochs(20) .iterate(iter) .trainWordVectors(true) .tokenizerFactory(tFact) .build(); pVect.fit();
Now that we have trained our model, we can use our unlabeled data to test. We create a new ClassPathResource
for our unlabeled data and create a new FileLabelAwareIterator
as well:
ClassPathResource unlabeledText = new ClassPathResource("paravec/unlabeled"); FileLabelAwareIterator unlabeledIter = new FileLabelAwareIterator.Builder() .addSourceFolder(unlabeledText.getFile()) .build();
The next step involves iterating through our unlabeled data and identifying the correct label for each document. We can generally expect that each document will fall into multiple labels but have a different weight, or percent match, for each. So, while one article may be mostly classified as a health article, it likely has enough information to be also classified, to a lesser degree, as a science article.
Next, we set up a MeansBuilder
and LabelSeeker
object. These classes access tables containing the relationships between words and labels, which we will use in our ParagraphVectors
. The InMemoryLookupTable
class provides access to a default table for word lookup:
MeansBuilder mBuilder = new MeansBuilder((InMemoryLookupTable<VocabWord>) pVect.getLookupTable(),tFact); LabelSeeker lSeeker = new LabelSeeker(iter.getLabelsSource().getLabels(), (InMemoryLookupTable<VocabWord>) pVect.getLookupTable());
Finally, we iterate through our unlabeled documents. For each document, we will change the document into a vector and use our LabelSeeker
to get the scores for each document. We log the scores for each document and print out the score with the appropriate labels:
while (unlabeledIter.hasNextDocument()) { LabelledDocument doc = unlabeledIter.nextDocument(); INDArray docCentroid = mBuilder.documentAsVector(doc); List<Pair<String, Double>> scores = lSeeker.getScores(docCentroid); out.println("Document '" + doc.getLabel() + "' falls into the following categories: "); for (Pair<String, Double> score : scores) { out.println (" " + score.getFirst() + ": " + score.getSecond()); } }
The output from our preceding print statements is as follows:
Document 'finance' falls into the following categories: finance: 0.2889593541622162 health: 0.11753179132938385 science: 0.021202782168984413 Document 'health' falls into the following categories: finance: 0.059537000954151154 health: 0.27373185753822327 science: 0.07699354737997055
In each instance, our documents were classified properly, as demonstrated by the higher percentage assigned to the correct label category. This classification can be used in conjunction with other data analysis techniques to draw additional conclusions about the data contained in the files. Often text classification is an initial or early step in a data analysis process as documents are classified into groups for further analysis.
Classifying text by similarity
In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors
class we used in the previous example. To begin, download the raw_sentences.txt
file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.
First, we set up our ClassPathResource
and assign an iterator to handle our file data. We have used a SentenceIterator
for this example:
ClassPathResource srcFile = new ClassPathResource("/raw_sentences.txt"); File file = srcFile.getFile(); SentenceIterator iter = new BasicLineIterator(file);
Next, we will again use TokenizerFactory
to tokenize our data. We also want to create a new LabelsSource
object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_
:
TokenizerFactory tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor()); LabelsSource labelFormat = new LabelsSource("LINE_");
Now we are ready to build our ParagraphVectors
. Our setup process includes these methods: minWordFrequency
, which specifies the minimum word frequency to use in the training corpus, and iterations
, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource
, defined before, and our iterator and tokenizer. The trainWordVectors
method specifies whether word and document representations should be built together. Finally, sampling
determines whether subsampling should occur or not. We then call our build
and fit
methods:
ParagraphVectors vec = new ParagraphVectors.Builder() .minWordFrequency(1) .iterations(5) .epochs(1) .layerSize(100) .learningRate(0.025) .labelsSource(labelFormat) .windowSize(5) .iterate(iter) .trainWordVectors(false) .tokenizerFactory(tFact) .sampling(0) .build(); vec.fit();
Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1
, the indexing process begins at 0
. So, for example, line 9836
in the document will be associated with the label LINE_9835
. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity
method takes two labels and returns the relative distance between them in the form of double
:
double similar1 = vec.similarity("LINE_9835", "LINE_12492"); out.println("Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = " + similar1); double similar2 = vec.similarity("LINE_3720", "LINE_16392"); out.println("Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = " + similar2); double similar3 = vec.similarity("LINE_6347", "LINE_3720"); out.println("Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = " + similar3); double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); out.println("Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = " + dissimilar1); double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); out.println("Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = " + dissimilar2);
The output of our print statements is shown as follows. Compare the result of the similarity
method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity
method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:
16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4] Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494 Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001 Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362 Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935 Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619
Although this example uses ParagraphVectors
like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.
Classifying text by similarity
In this next example, we will match different text samples based on their structure and similarity. We will still be using the ParagraphVectors
class we used in the previous example. To begin, download the raw_sentences.txt
file from GitHub (https://github.com/deeplearning4j/dl4j-examples/tree/master/dl4j-examples/src/main/resources) and add it to your project. This file contains a list of sentences which we will read in, label, and then compare.
First, we set up our ClassPathResource
and assign an iterator to handle our file data. We have used a SentenceIterator
for this example:
ClassPathResource srcFile = new ClassPathResource("/raw_sentences.txt"); File file = srcFile.getFile(); SentenceIterator iter = new BasicLineIterator(file);
Next, we will again use TokenizerFactory
to tokenize our data. We also want to create a new LabelsSource
object. This allows us to define the format of our sentence labels. We have chosen to prefix each line with LINE_
:
TokenizerFactory tFact = new DefaultTokenizerFactory(); tFact.setTokenPreProcessor(new CommonPreprocessor()); LabelsSource labelFormat = new LabelsSource("LINE_");
Now we are ready to build our ParagraphVectors
. Our setup process includes these methods: minWordFrequency
, which specifies the minimum word frequency to use in the training corpus, and iterations
, which specifies the number of iterations for each mini batch. We also set the number of epochs, the layer size, and the learning rate. Additionally, we include our LabelsSource
, defined before, and our iterator and tokenizer. The trainWordVectors
method specifies whether word and document representations should be built together. Finally, sampling
determines whether subsampling should occur or not. We then call our build
and fit
methods:
ParagraphVectors vec = new ParagraphVectors.Builder() .minWordFrequency(1) .iterations(5) .epochs(1) .layerSize(100) .learningRate(0.025) .labelsSource(labelFormat) .windowSize(5) .iterate(iter) .trainWordVectors(false) .tokenizerFactory(tFact) .sampling(0) .build(); vec.fit();
Next, we will include some statements to evaluate the accuracy of our classifications. It is important to note that while the document itself starts at 1
, the indexing process begins at 0
. So, for example, line 9836
in the document will be associated with the label LINE_9835
. We will first compare three sentences that should be classified as somewhat similar, and then two examples comparing dissimilar sentences. The similarity
method takes two labels and returns the relative distance between them in the form of double
:
double similar1 = vec.similarity("LINE_9835", "LINE_12492"); out.println("Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = " + similar1); double similar2 = vec.similarity("LINE_3720", "LINE_16392"); out.println("Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = " + similar2); double similar3 = vec.similarity("LINE_6347", "LINE_3720"); out.println("Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = " + similar3); double dissimilar1 = vec.similarity("LINE_3720", "LINE_9852"); out.println("Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = " + dissimilar1); double dissimilar2 = vec.similarity("LINE_3720", "LINE_3719"); out.println("Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = " + dissimilar2);
The output of our print statements is shown as follows. Compare the result of the similarity
method for the three similar sentences and the two dissimilar sentences. Of particular note, the similarity
method result for the last example, two very dissimilar sentences, returned a negative number. This implies a more significant disparity:
16:56:15.423 [main] INFO o.d.m.s.SequenceVectors - Epoch: [1]; Words vectorized so far: [3171540]; Lines vectorized so far: [485810]; learningRate: [1.0E-4] Comparing lines 9836 & 12493 ('This is my house .'/'This is my world .') Similarity = 0.7641470432281494 Comparing lines 3721 & 16393 ('This is my way .'/'This is my work .') Similarity = 0.7246013879776001 Comparing lines 6348 & 3721 ('This is my case .'/'This is my way .') Similarity = 0.8988922834396362 Comparing lines 3721 & 9853 ('This is my way .'/'We now have one .') Similarity = 0.5840312242507935 Comparing lines 3721 & 3720 ('This is my way .'/'At first he says no .') Similarity = -0.6491150259971619
Although this example uses ParagraphVectors
like our first classification example, this demonstrates flexibility in our approach. We can use these DL4J libraries to classify data in more than one manner.
Understanding tagging and POS
POS is concerned with identifying the types of components found in a sentence. For example, this sentence has several elements, including the verb "has", several nouns such as "example" and "elements", and adjectives such as "several". Tagging, or more specifically POS tagging, is the process of associating element types to words.
POS tagging is useful as it adds more information about the sentence. We can ascertain the relationship between words and often their relative importance. The results of tagging are often used in later processing steps.
This task can be difficult as we are unable to rely upon a simple dictionary of words to determine their type. For example, the word lead
can be used as both a noun and as a verb. We might use it in either of the following two sentences:
He took the lead in the play. Lead the way!
POS tagging will attempt to associate the proper label to each word of a sentence.
Using OpenNLP to identify POS
To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.
We will be using the POSModel
class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank
tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin
model. This has been trained on English text using what is called maximum entropy.
Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.
We will also use the POSTaggerME
class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.
We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:
try (InputStream input = new FileInputStream( new File("en-pos-maxent.bin"));) { String sentence = "Let's parse this sentence."; ... } catch (IOException ex) { // Handle exceptions }
The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words
. The first part uses the Scanner
class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List
class's toArray
method is used to create the array of strings:
List<String> list = new ArrayList<>(); Scanner scanner = new Scanner(sentence); while(scanner.hasNext()) { list.add(scanner.next()); } String[] words = new String[1]; words = list.toArray(words);
The model is then built using the file containing the model:
POSModel posModel = new POSModel(input);
The tagger is then created based on the model:
POSTaggerME posTagger = new POSTaggerME(posModel);
The tag
method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:
String[] posTags = posTagger.tag(words); for(int i=0; i<posTags.length; i++) { out.println(words[i] + " - " + posTags[i]); }
The output for this example follows:
Let's - NNP parse - NN this - DT sentence. - NN
The analysis has determined that the word let's
is a singular proper noun while the words parse
and sentence
are singular nouns. The word this
is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.
Understanding POS tags
The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list:
Tag |
Description |
Tag |
Description |
|
Determiner |
|
Adverb |
|
Adjective |
|
Adverb, comparative |
|
Adjective, comparative |
|
Adverb, superlative |
|
Adjective, superlative |
|
Particle |
|
Noun, singular or mass |
|
Symbol |
|
Noun, plural |
|
Top of the parse tree |
|
Proper noun, singular |
|
Verb, base form |
|
Proper noun, plural |
|
Verb, past tense |
|
Possessive ending |
|
Verb, gerund or present participle |
|
Personal pronoun |
|
Verb, past participle |
|
Possessive pronoun |
|
Verb, non-3rd person singular present |
|
Simple declarative clause |
|
Verb, 3rd person singular present |
As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences
method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence
objects whose toString
method returns the score and POS list:
Sequence sequences[] = posTagger.topKSequences(words); for(Sequence sequence : sequences) { out.println(sequence); }
The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:
-2.3264880694837213 [NNP, NN, DT, NN] -2.6610271245387853 [NNP, VBD, DT, NN] -2.6630142638557217 [NNP, VB, DT, NN]
Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse
, is determined to have other possible tags.
Next, we will demonstrate how to extract relationships from text.
Using OpenNLP to identify POS
To illustrate this process, we will be using OpenNLP (https://opennlp.apache.org/). This is an open source Apache project which supports many other NLP processing tasks.
We will be using the POSModel
class, which can be trained to recognize POS elements. In this example, we will use it with a previously trained model based on the Penn TreeBank
tag-set (http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html). Various pretrained models are found at http://opennlp.sourceforge.net/models-1.5/. We will be using the en-pos-maxent.bin
model. This has been trained on English text using what is called maximum entropy.
Maximum entropy refers to the amount of uncertainty in the model which it maximizes. For a given problem there is a set of probabilities describing what is known about the data set. These probabilities are used to build a model. For example, we may know that there is a 23 percent chance that one specific event may follow a certain condition. We do not want to make any assumptions about unknown probabilities so we avoid adding unjustified information. A maximum entropy approach attempts to preserve as much uncertainty as possible; hence it attempts to maximize entropy.
We will also use the POSTaggerME
class, which is a maximum entropy tagger. This is the class that will make tag predictions. With any sentence, there may be more than one way of classifying, or tagging, its components.
We start with code to acquire the previously trained English tagger model and a simple sentence to be tagged:
try (InputStream input = new FileInputStream( new File("en-pos-maxent.bin"));) { String sentence = "Let's parse this sentence."; ... } catch (IOException ex) { // Handle exceptions }
The tagger uses an array of strings, where each string is a word. The following sequence takes the previous sentence and creates an array called words
. The first part uses the Scanner
class to parse the sentence string. We could have used other code to read the data from a file if needed. After that, the List
class's toArray
method is used to create the array of strings:
List<String> list = new ArrayList<>(); Scanner scanner = new Scanner(sentence); while(scanner.hasNext()) { list.add(scanner.next()); } String[] words = new String[1]; words = list.toArray(words);
The model is then built using the file containing the model:
POSModel posModel = new POSModel(input);
The tagger is then created based on the model:
POSTaggerME posTagger = new POSTaggerME(posModel);
The tag
method does the actual work. It is passed an array of words and returns an array of tags. The words and tags are then displayed:
String[] posTags = posTagger.tag(words); for(int i=0; i<posTags.length; i++) { out.println(words[i] + " - " + posTags[i]); }
The output for this example follows:
Let's - NNP parse - NN this - DT sentence. - NN
The analysis has determined that the word let's
is a singular proper noun while the words parse
and sentence
are singular nouns. The word this
is a determiner, that is, it is a word that modifies another and helps identify a phrase as general or specific. A list of tags is provided in the next section.
Understanding POS tags
The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list:
Tag |
Description |
Tag |
Description |
|
Determiner |
|
Adverb |
|
Adjective |
|
Adverb, comparative |
|
Adjective, comparative |
|
Adverb, superlative |
|
Adjective, superlative |
|
Particle |
|
Noun, singular or mass |
|
Symbol |
|
Noun, plural |
|
Top of the parse tree |
|
Proper noun, singular |
|
Verb, base form |
|
Proper noun, plural |
|
Verb, past tense |
|
Possessive ending |
|
Verb, gerund or present participle |
|
Personal pronoun |
|
Verb, past participle |
|
Possessive pronoun |
|
Verb, non-3rd person singular present |
|
Simple declarative clause |
|
Verb, 3rd person singular present |
As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences
method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence
objects whose toString
method returns the score and POS list:
Sequence sequences[] = posTagger.topKSequences(words); for(Sequence sequence : sequences) { out.println(sequence); }
The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:
-2.3264880694837213 [NNP, NN, DT, NN] -2.6610271245387853 [NNP, VBD, DT, NN] -2.6630142638557217 [NNP, VB, DT, NN]
Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse
, is determined to have other possible tags.
Next, we will demonstrate how to extract relationships from text.
Understanding POS tags
The POS elements returned abbreviations. A list of Penn TreeBankPOS tags can be found at https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html. The following is a shortened version of this list:
Tag |
Description |
Tag |
Description |
|
Determiner |
|
Adverb |
|
Adjective |
|
Adverb, comparative |
|
Adjective, comparative |
|
Adverb, superlative |
|
Adjective, superlative |
|
Particle |
|
Noun, singular or mass |
|
Symbol |
|
Noun, plural |
|
Top of the parse tree |
|
Proper noun, singular |
|
Verb, base form |
|
Proper noun, plural |
|
Verb, past tense |
|
Possessive ending |
|
Verb, gerund or present participle |
|
Personal pronoun |
|
Verb, past participle |
|
Possessive pronoun |
|
Verb, non-3rd person singular present |
|
Simple declarative clause |
|
Verb, 3rd person singular present |
As mentioned earlier, there may be more than one possible set of POS assignments for a sentence. The topKSequences
method, as shown next, will return various assignment possibilities along with a score. The method returns an array of Sequence
objects whose toString
method returns the score and POS list:
Sequence sequences[] = posTagger.topKSequences(words); for(Sequence sequence : sequences) { out.println(sequence); }
The output for the previous sentence follows, where the last sequence is considered to be the most probable alternative:
-2.3264880694837213 [NNP, NN, DT, NN] -2.6610271245387853 [NNP, VBD, DT, NN] -2.6630142638557217 [NNP, VB, DT, NN]
Each line of output assigns possible tags to each word of the sentence. We can see that only the second word, parse
, is determined to have other possible tags.
Next, we will demonstrate how to extract relationships from text.
Extracting relationships from sentences
Knowing the relationship between elements of a sentence is important in many analysis tasks. It is useful for assessing the important content of a sentence and providing insight into the meaning of a sentence. This type of analysis has been used for tasks ranging from grammar checking to speech recognition to language translations.
In the previous section, we demonstrated one approach used to extract the parts of speech. Using this technique, we were able to identify the sentence element types present in a sentence. However, the relationships between these elements is missing. We need to parse the sentence to extract these relationships between sentence elements.
Using OpenNLP to extract relationships
There are several techniques and APIs that can be used to extract this type of information. In this section we will use OpenNLP to demonstrate one way of extracting the structure of a sentence. The demonstration is centered around the ParserTool
class, which uses a previously trained model. The parsing process will return the probabilities that the sentence's elements extracted are correct. As will many NLP tasks, there are often multiple answers possible.
We start with a try-with-resource block to open an input stream for the model. The en-parser-chunking.bin
file contains a model that uses parses text into its POS. In this case, it is trained for English:
try (InputStream modelInputStream = new FileInputStream( new File("en-parser-chunking.bin"));) { ... } catch (Exception ex) { // Handle exceptions }
Within the try block an instance of the ParserModel
class is created using the input stream. The actual parser is created next using the ParserFactory
class's create
method:
ParserModel parserModel = new ParserModel(modelInputStream); Parser parser = ParserFactory.create(parserModel);
We will use the following sentence to test the parser. The ParserTool
class's parseLine
method does the actual parsing and returns an array of Parse
objects. Each of these objects holds one parsing alternative. The last argument of the parseLine
method specifies how many alternatives to return:
String sentence = "Let's parse this sentence."; Parse[] parseTrees = ParserTool.parseLine(sentence, parser, 3);
The next sequence displays each of the possibilities:
for(Parse tree : parseTrees) { tree.show(); }
The output of the show method for this example follows. The tags were previously defined in Understanding POS tags section:
(TOP (NP (NP (NNP Let's) (NN parse)) (NP (DT this) (NN sentence.)))) (TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.))))) (TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)))))
The following example reformats the last two outputs to better show the relationships. They differ in how they classify the verb parse:
(TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.)) ) ) ) (TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)) ) ) )
When there are multiple parse alternatives, the Parse
class's getProb
returns a probability that reflects the model's confidence in the alternatives. The following sequence demonstrates this method:
for(Parse tree : parseTrees) { out.println("Probability: " + tree.getProb()); }
The output follows:
Probability: -3.6810244423259078 Probability: -3.742475884515823 Probability: -4.16148634555491
Another interesting NLP task is sentiment analysis, which we will demonstrate next.
Using OpenNLP to extract relationships
There are several techniques and APIs that can be used to extract this type of information. In this section we will use OpenNLP to demonstrate one way of extracting the structure of a sentence. The demonstration is centered around the ParserTool
class, which uses a previously trained model. The parsing process will return the probabilities that the sentence's elements extracted are correct. As will many NLP tasks, there are often multiple answers possible.
We start with a try-with-resource block to open an input stream for the model. The en-parser-chunking.bin
file contains a model that uses parses text into its POS. In this case, it is trained for English:
try (InputStream modelInputStream = new FileInputStream( new File("en-parser-chunking.bin"));) { ... } catch (Exception ex) { // Handle exceptions }
Within the try block an instance of the ParserModel
class is created using the input stream. The actual parser is created next using the ParserFactory
class's create
method:
ParserModel parserModel = new ParserModel(modelInputStream); Parser parser = ParserFactory.create(parserModel);
We will use the following sentence to test the parser. The ParserTool
class's parseLine
method does the actual parsing and returns an array of Parse
objects. Each of these objects holds one parsing alternative. The last argument of the parseLine
method specifies how many alternatives to return:
String sentence = "Let's parse this sentence."; Parse[] parseTrees = ParserTool.parseLine(sentence, parser, 3);
The next sequence displays each of the possibilities:
for(Parse tree : parseTrees) { tree.show(); }
The output of the show method for this example follows. The tags were previously defined in Understanding POS tags section:
(TOP (NP (NP (NNP Let's) (NN parse)) (NP (DT this) (NN sentence.)))) (TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.))))) (TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)))))
The following example reformats the last two outputs to better show the relationships. They differ in how they classify the verb parse:
(TOP (S (NP (NNP Let's)) (VP (VB parse) (NP (DT this) (NN sentence.)) ) ) ) (TOP (S (NP (NNP Let's)) (VP (VBD parse) (NP (DT this) (NN sentence.)) ) ) )
When there are multiple parse alternatives, the Parse
class's getProb
returns a probability that reflects the model's confidence in the alternatives. The following sequence demonstrates this method:
for(Parse tree : parseTrees) { out.println("Probability: " + tree.getProb()); }
The output follows:
Probability: -3.6810244423259078 Probability: -3.742475884515823 Probability: -4.16148634555491
Another interesting NLP task is sentiment analysis, which we will demonstrate next.
Sentiment analysis
Sentiment analysis involves the evaluation and classification of words based on their context, meaning, and emotional implications. Typically, if we were to look up a word in a dictionary we will find a meaning or definition for the word but, taken out of the context of a sentence, we may not be able to ascribe detailed and precise meaning to the word.
For example, the word toast could be defined as simply a slice of heated and browned bread. But in the context of the sentence He's toast!, the meaning changes completely. Sentiment analysis seeks to derive meanings of words based on their context and usage.
It is important to note that advanced sentiment analysis will expand beyond simple positive or negative classification and ascribe detailed emotional meaning to words. It is far simpler to classify words as positive or negative but far more useful to classify them as happy, furious, indifferent, or anxious.
This type of analysis falls into the category of effective computing, a type of computing interested in the emotional implications and uses of technological tools. This type of computing is especially significant given the growing amount of emotionally influenced data readily available for analysis on social media sites today.
Being able to determine the emotional content of text enables a more targeted, and appropriate response. For example, being able to judge the emotional response in a chat session between a customer and technical representative can allow the representative to do a better job. This can be especially important when there is a cultural or language gap between them.
This type of analysis can also be applied to visual images. It could be used to gauge someone's response to a new product, such as when conducting a taste test, or to judge how people react to scenes of s movie or commercial.
As part of our example we will be using a bag-of-words model. Bag-of-words models simplify word representation for natural language processing by containing a set, known as the bag, of words irrespective of grammar or word order. The words have features used for classification, most importantly the frequency of each word. Because some words such as the, a, or and will naturally have a higher frequency in any text, the words are given a weight as well. Common words with less contextual significance will have a smaller weight and factor less into the text analysis.
Downloading and extracting the Word2Vec model
To demonstrate sentiment analysis, we will use Google's Word2Vec models in conjunction with DL4J to simply classify movie reviews as either positive or negative based upon the words used in the review. This example is adapted from work done by Alex Black (https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java). As discussed previously in this chapter, Word2Vec consists of two-layer neural networks trained to build meaning from the context of words. We will also be using a large set of movie reviews from http://ai.stanford.edu/~amaas/data/sentiment/.
Before we begin, you will need to download the Word2Vec data from https://code.google.com/p/word2vec/. The basic process includes:
- Downloading and extracting the movie reviews
- Loading the Word2Vec Google News vectors
- Loading each movie review
The words within the reviews are then broken into vectors and used to train the network. We will train the network across five epochs and evaluate the network's performance after each epoch.
To begin, we first declare three final variables. The first is the URL to retrieve the training data, the second is the location to store our extracted data, and the third is the location of the Google News vectors on the local machine. Modify this third variable to reflect the location on your local machine:
public static final String TRAINING_DATA_URL = "http://ai.stanford.edu/~amaas/" + "data/sentiment/aclImdb_v1.tar.gz"; public static final String EXTRACT_DATA_PATH = FilenameUtils.concat(System.getProperty( "java.io.tmpdir"), "dl4j_w2vSentiment/"); public static final String GNEWS_VECTORS_PATH = "C:/YOUR_PATH/GoogleNews-vectors-negative300.bin" + "/GoogleNews-vectors-negative300.bin";
Next we download and extract our model data. The next two methods are modelled after the code found in the DL4J example. We first create a new method, getModelData
. The method is shown next in its entirety.
First we create a new File
using the EXTRACT_DATA_PATH
we defined previously. If the file does not already exist, we create a new directory. Next, we create two more File
objects, one for the path to the archived TAR file and one for the path to the extracted data. Before we attempt to extract the data, we check whether these two files exist. If the archive path does not exist, we download the data from the TRAINING_DATA_URL
and then extract the data. If the extracted file does not exist, we then extract the data:
private static void getModelData() throws Exception { File modelDir = new File(EXTRACT_DATA_PATH); if (!modelDir.exists()) { modelDir.mkdir(); } String archivePath = EXTRACT_DATA_PATH + "aclImdb_v1.tar.gz"; File archiveName = new File(archivePath); String extractPath = EXTRACT_DATA_PATH + "aclImdb"; File extractName = new File(extractPath); if (!archiveName.exists()) { FileUtils.copyURLToFile(new URL(TRAINING_DATA_URL), archiveName); extractTar(archivePath, EXTRACT_DATA_PATH); } else if (!extractName.exists()) { extractTar(archivePath, EXTRACT_DATA_PATH); } }
To extract our data, we will create another method called extractTar
. We will provide two inputs to the method, the archivePath
and the EXTRACT_DATA_PATH
defined before. We also need to define our buffer size to use in the extraction process:
private static final int BUFFER_SIZE = 4096;
We first create a new TarArchiveInputStream
. We use the GzipCompressorInputStream
because it provides support for extracting .gz
files. We also use the BufferedInputStream
to improve performance in our extraction process. The compressed file is very large and may take some time to download and extract.
Next we create a TarArchiveEntry
and begin reading in data using the TarArchiveInputStream
getNextEntry
method. As we process entry in the compressed file, we first check whether the entry is a directory. If it is, we create a new directory in our extraction location. Finally we create a new FileOutputStream
and BufferedOutputStream
and use the write
method to write our data in the extracted location:
private static void extractTar(String dataIn, String dataOut) throws IOException { try (TarArchiveInputStream inStream = new TarArchiveInputStream( new GzipCompressorInputStream( new BufferedInputStream( new FileInputStream(dataIn))))) { TarArchiveEntry tarFile; while ((tarFile = (TarArchiveEntry) inStream.getNextEntry()) != null) { if (tarFile.isDirectory()) { new File(dataOut + tarFile.getName()).mkdirs(); }else { int count; byte data[] = new byte[BUFFER_SIZE]; FileOutputStream fileInStream = new FileOutputStream(dataOut + tarFile.getName()); BufferedOutputStream outStream = new BufferedOutputStream(fileInStream, BUFFER_SIZE); while ((count = inStream.read(data, 0, BUFFER_SIZE)) != -1) { outStream.write(data, 0, count); } } } } }
Building our model and classifying text
Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize
refers to the amount of words we process in each example, in this case 50
. Our vectorSize
determines the size of the vectors. The Google News model has word vectors of size 300
. nEpochs
refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength
specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300
words:
int batchSize = 50; int vectorSize = 300; int nEpochs = 5; int truncateReviewsToLength = 300;
Now we can set up our neural network. We will use a MultiLayerConfiguration
network, as discussed in Chapter 8, Deep Learning
. In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM
recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300
. We also use gradientNormalization
, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax
activation function, which was discussed in Chapter 8, Deep Learning
. This function uses regression and is especially suited for classification algorithms:
MultiLayerConfiguration sentimentNN = new NeuralNetConfiguration.Builder() .optimizationAlgo(OptimizationAlgorithm .STOCHASTIC_GRADIENT_DESCENT).iterations(1) .updater(Updater.RMSPROP) .regularization(true).l2(1e-5) .weightInit(WeightInit.XAVIER) .gradientNormalization(GradientNormalization .ClipElementWiseAbsoluteValue) .gradientNormalizationThreshold(1.0) .learningRate(0.0018) .list() .layer(0, new GravesLSTM.Builder() .nIn(vectorSize).nOut(200) .activation("softsign").build()) .layer(1, new RnnOutputLayer.Builder() .activation("softmax") .lossFunction(LossFunctions.LossFunction.MCXENT) .nIn(200).nOut(2).build()) .pretrain(false).backprop(true).build();
We can then create our MultiLayerNetwork
, initialize the network, and set listeners.
MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); net.init(); net.setListeners(new ScoreIterationListener(1));
Next we create a WordVectors
object to load our Google data. We use a DataSetIterator
to test and train our data. The AsyncDataSetIterator
allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:
WordVectors wordVectors = WordVectorSerializer DataSetIterator trainData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, batchSize, truncateReviewsToLength, true), 1); DataSetIterator testData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, 100, truncateReviewsToLength, false), 1);
Finally, we are ready to train and evaluate our data. We run through our data nEpochs
times; in this case, we have five iterations. Each iteration executes the fit
method against our training data and then creates a new Evaluation
object to evaluate our model using testData
. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray
to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries
method for evaluation. Finally, we print out our evaluation statistics:
for (int i = 0; i < nEpochs; i++) { net.fit(trainData); trainData.reset(); Evaluation evaluation = new Evaluation(); while (testData.hasNext()) { DataSet t = testData.next(); INDArray dataFeatures = t.getFeatureMatrix(); INDArray dataLabels = t.getLabels(); INDArray inMask = t.getFeaturesMaskArray(); INDArray outMask = t.getLabelsMaskArray(); INDArray predicted = net.output(dataFeatures, false, inMask, outMask); evaluation.evalTimeSeries(dataLabels, predicted, outMask); } testData.reset(); out.println(evaluation.stats()); }
The output from the final iteration is shown next. Our examples classified as 0
are considered negative reviews and the ones classified as 1
are considered positive reviews:
Epoch 4 complete. Starting evaluation: Examples labeled as 0 classified by model as 0: 11122 times Examples labeled as 0 classified by model as 1: 1378 times Examples labeled as 1 classified by model as 0: 3193 times Examples labeled as 1 classified by model as 1: 9307 times ==========================Scores===================================Accuracy: 0.8172 Precision: 0.824 Recall: 0.8172 F1 Score: 0.8206 ===================================================================
If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.
Downloading and extracting the Word2Vec model
To demonstrate sentiment analysis, we will use Google's Word2Vec models in conjunction with DL4J to simply classify movie reviews as either positive or negative based upon the words used in the review. This example is adapted from work done by Alex Black (https://github.com/deeplearning4j/dl4j-examples/blob/master/dl4j-examples/src/main/java/org/deeplearning4j/examples/recurrent/word2vecsentiment/Word2VecSentimentRNN.java). As discussed previously in this chapter, Word2Vec consists of two-layer neural networks trained to build meaning from the context of words. We will also be using a large set of movie reviews from http://ai.stanford.edu/~amaas/data/sentiment/.
Before we begin, you will need to download the Word2Vec data from https://code.google.com/p/word2vec/. The basic process includes:
- Downloading and extracting the movie reviews
- Loading the Word2Vec Google News vectors
- Loading each movie review
The words within the reviews are then broken into vectors and used to train the network. We will train the network across five epochs and evaluate the network's performance after each epoch.
To begin, we first declare three final variables. The first is the URL to retrieve the training data, the second is the location to store our extracted data, and the third is the location of the Google News vectors on the local machine. Modify this third variable to reflect the location on your local machine:
public static final String TRAINING_DATA_URL = "http://ai.stanford.edu/~amaas/" + "data/sentiment/aclImdb_v1.tar.gz"; public static final String EXTRACT_DATA_PATH = FilenameUtils.concat(System.getProperty( "java.io.tmpdir"), "dl4j_w2vSentiment/"); public static final String GNEWS_VECTORS_PATH = "C:/YOUR_PATH/GoogleNews-vectors-negative300.bin" + "/GoogleNews-vectors-negative300.bin";
Next we download and extract our model data. The next two methods are modelled after the code found in the DL4J example. We first create a new method, getModelData
. The method is shown next in its entirety.
First we create a new File
using the EXTRACT_DATA_PATH
we defined previously. If the file does not already exist, we create a new directory. Next, we create two more File
objects, one for the path to the archived TAR file and one for the path to the extracted data. Before we attempt to extract the data, we check whether these two files exist. If the archive path does not exist, we download the data from the TRAINING_DATA_URL
and then extract the data. If the extracted file does not exist, we then extract the data:
private static void getModelData() throws Exception { File modelDir = new File(EXTRACT_DATA_PATH); if (!modelDir.exists()) { modelDir.mkdir(); } String archivePath = EXTRACT_DATA_PATH + "aclImdb_v1.tar.gz"; File archiveName = new File(archivePath); String extractPath = EXTRACT_DATA_PATH + "aclImdb"; File extractName = new File(extractPath); if (!archiveName.exists()) { FileUtils.copyURLToFile(new URL(TRAINING_DATA_URL), archiveName); extractTar(archivePath, EXTRACT_DATA_PATH); } else if (!extractName.exists()) { extractTar(archivePath, EXTRACT_DATA_PATH); } }
To extract our data, we will create another method called extractTar
. We will provide two inputs to the method, the archivePath
and the EXTRACT_DATA_PATH
defined before. We also need to define our buffer size to use in the extraction process:
private static final int BUFFER_SIZE = 4096;
We first create a new TarArchiveInputStream
. We use the GzipCompressorInputStream
because it provides support for extracting .gz
files. We also use the BufferedInputStream
to improve performance in our extraction process. The compressed file is very large and may take some time to download and extract.
Next we create a TarArchiveEntry
and begin reading in data using the TarArchiveInputStream
getNextEntry
method. As we process entry in the compressed file, we first check whether the entry is a directory. If it is, we create a new directory in our extraction location. Finally we create a new FileOutputStream
and BufferedOutputStream
and use the write
method to write our data in the extracted location:
private static void extractTar(String dataIn, String dataOut) throws IOException { try (TarArchiveInputStream inStream = new TarArchiveInputStream( new GzipCompressorInputStream( new BufferedInputStream( new FileInputStream(dataIn))))) { TarArchiveEntry tarFile; while ((tarFile = (TarArchiveEntry) inStream.getNextEntry()) != null) { if (tarFile.isDirectory()) { new File(dataOut + tarFile.getName()).mkdirs(); }else { int count; byte data[] = new byte[BUFFER_SIZE]; FileOutputStream fileInStream = new FileOutputStream(dataOut + tarFile.getName()); BufferedOutputStream outStream = new BufferedOutputStream(fileInStream, BUFFER_SIZE); while ((count = inStream.read(data, 0, BUFFER_SIZE)) != -1) { outStream.write(data, 0, count); } } } } }
Building our model and classifying text
Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize
refers to the amount of words we process in each example, in this case 50
. Our vectorSize
determines the size of the vectors. The Google News model has word vectors of size 300
. nEpochs
refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength
specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300
words:
int batchSize = 50; int vectorSize = 300; int nEpochs = 5; int truncateReviewsToLength = 300;
Now we can set up our neural network. We will use a MultiLayerConfiguration
network, as discussed in Chapter 8, Deep Learning
. In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM
recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300
. We also use gradientNormalization
, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax
activation function, which was discussed in Chapter 8, Deep Learning
. This function uses regression and is especially suited for classification algorithms:
MultiLayerConfiguration sentimentNN = new NeuralNetConfiguration.Builder() .optimizationAlgo(OptimizationAlgorithm .STOCHASTIC_GRADIENT_DESCENT).iterations(1) .updater(Updater.RMSPROP) .regularization(true).l2(1e-5) .weightInit(WeightInit.XAVIER) .gradientNormalization(GradientNormalization .ClipElementWiseAbsoluteValue) .gradientNormalizationThreshold(1.0) .learningRate(0.0018) .list() .layer(0, new GravesLSTM.Builder() .nIn(vectorSize).nOut(200) .activation("softsign").build()) .layer(1, new RnnOutputLayer.Builder() .activation("softmax") .lossFunction(LossFunctions.LossFunction.MCXENT) .nIn(200).nOut(2).build()) .pretrain(false).backprop(true).build();
We can then create our MultiLayerNetwork
, initialize the network, and set listeners.
MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); net.init(); net.setListeners(new ScoreIterationListener(1));
Next we create a WordVectors
object to load our Google data. We use a DataSetIterator
to test and train our data. The AsyncDataSetIterator
allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:
WordVectors wordVectors = WordVectorSerializer DataSetIterator trainData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, batchSize, truncateReviewsToLength, true), 1); DataSetIterator testData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, 100, truncateReviewsToLength, false), 1);
Finally, we are ready to train and evaluate our data. We run through our data nEpochs
times; in this case, we have five iterations. Each iteration executes the fit
method against our training data and then creates a new Evaluation
object to evaluate our model using testData
. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray
to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries
method for evaluation. Finally, we print out our evaluation statistics:
for (int i = 0; i < nEpochs; i++) { net.fit(trainData); trainData.reset(); Evaluation evaluation = new Evaluation(); while (testData.hasNext()) { DataSet t = testData.next(); INDArray dataFeatures = t.getFeatureMatrix(); INDArray dataLabels = t.getLabels(); INDArray inMask = t.getFeaturesMaskArray(); INDArray outMask = t.getLabelsMaskArray(); INDArray predicted = net.output(dataFeatures, false, inMask, outMask); evaluation.evalTimeSeries(dataLabels, predicted, outMask); } testData.reset(); out.println(evaluation.stats()); }
The output from the final iteration is shown next. Our examples classified as 0
are considered negative reviews and the ones classified as 1
are considered positive reviews:
Epoch 4 complete. Starting evaluation: Examples labeled as 0 classified by model as 0: 11122 times Examples labeled as 0 classified by model as 1: 1378 times Examples labeled as 1 classified by model as 0: 3193 times Examples labeled as 1 classified by model as 1: 9307 times ==========================Scores===================================Accuracy: 0.8172 Precision: 0.824 Recall: 0.8172 F1 Score: 0.8206 ===================================================================
If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.
Building our model and classifying text
Now that we have created methods to download and extract our data, we need to declare and initialize variables used to control the execution of our model. Our batchSize
refers to the amount of words we process in each example, in this case 50
. Our vectorSize
determines the size of the vectors. The Google News model has word vectors of size 300
. nEpochs
refers to the number of times we attempt to run through our training data. Finally, truncateReviewsToLength
specifies whether, for memory utilization purposes, we should truncate the movie reviews if they exceed a specific length. We have chosen to truncate reviews longer than 300
words:
int batchSize = 50; int vectorSize = 300; int nEpochs = 5; int truncateReviewsToLength = 300;
Now we can set up our neural network. We will use a MultiLayerConfiguration
network, as discussed in Chapter 8, Deep Learning
. In fact, our example here is very similar to the model built in configuring and building a model, with a few differences. In particular, in this model we will use a faster learning rate and a GravesLSTM
recurrent network in layer 0. We will have the same number of input neurons as we have words in our vector, in this case, 300
. We also use gradientNormalization
, a technique used to help our algorithm find the optimal solution. Notice we are using the softmax
activation function, which was discussed in Chapter 8, Deep Learning
. This function uses regression and is especially suited for classification algorithms:
MultiLayerConfiguration sentimentNN = new NeuralNetConfiguration.Builder() .optimizationAlgo(OptimizationAlgorithm .STOCHASTIC_GRADIENT_DESCENT).iterations(1) .updater(Updater.RMSPROP) .regularization(true).l2(1e-5) .weightInit(WeightInit.XAVIER) .gradientNormalization(GradientNormalization .ClipElementWiseAbsoluteValue) .gradientNormalizationThreshold(1.0) .learningRate(0.0018) .list() .layer(0, new GravesLSTM.Builder() .nIn(vectorSize).nOut(200) .activation("softsign").build()) .layer(1, new RnnOutputLayer.Builder() .activation("softmax") .lossFunction(LossFunctions.LossFunction.MCXENT) .nIn(200).nOut(2).build()) .pretrain(false).backprop(true).build();
We can then create our MultiLayerNetwork
, initialize the network, and set listeners.
MultiLayerNetwork net = new MultiLayerNetwork(sentimentNN); net.init(); net.setListeners(new ScoreIterationListener(1));
Next we create a WordVectors
object to load our Google data. We use a DataSetIterator
to test and train our data. The AsyncDataSetIterator
allows us to load our data in a separate thread, to improve performance. This process requires a large amount of memory and so improvements such as this are essential for optimal performance:
WordVectors wordVectors = WordVectorSerializer DataSetIterator trainData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, batchSize, truncateReviewsToLength, true), 1); DataSetIterator testData = new AsyncDataSetIterator( new SentimentExampleIterator(EXTRACT_DATA_PATH, wordVectors, 100, truncateReviewsToLength, false), 1);
Finally, we are ready to train and evaluate our data. We run through our data nEpochs
times; in this case, we have five iterations. Each iteration executes the fit
method against our training data and then creates a new Evaluation
object to evaluate our model using testData
. The evaluation is based on around 25,000 movie reviews and can take a significant amount to time to run. As we evaluate the data, we create INDArray
to store information, including the feature matrix and labels from our data. This data is used later in the evalTimeSeries
method for evaluation. Finally, we print out our evaluation statistics:
for (int i = 0; i < nEpochs; i++) { net.fit(trainData); trainData.reset(); Evaluation evaluation = new Evaluation(); while (testData.hasNext()) { DataSet t = testData.next(); INDArray dataFeatures = t.getFeatureMatrix(); INDArray dataLabels = t.getLabels(); INDArray inMask = t.getFeaturesMaskArray(); INDArray outMask = t.getLabelsMaskArray(); INDArray predicted = net.output(dataFeatures, false, inMask, outMask); evaluation.evalTimeSeries(dataLabels, predicted, outMask); } testData.reset(); out.println(evaluation.stats()); }
The output from the final iteration is shown next. Our examples classified as 0
are considered negative reviews and the ones classified as 1
are considered positive reviews:
Epoch 4 complete. Starting evaluation: Examples labeled as 0 classified by model as 0: 11122 times Examples labeled as 0 classified by model as 1: 1378 times Examples labeled as 1 classified by model as 0: 3193 times Examples labeled as 1 classified by model as 1: 9307 times ==========================Scores===================================Accuracy: 0.8172 Precision: 0.824 Recall: 0.8172 F1 Score: 0.8206 ===================================================================
If compared with previous iterations, you should notice the score and accuracy improving with each evaluation. With each iteration, our model improves its accuracy in classifying movie reviews as either negative or positive.
Summary
In this chapter, we introduced a number of NLP tasks and showed how they are supported. In particular, we used OpenNLP and DL4J to illustrate how they are performed. While there are a number of other libraries available, these examples provide a good introduction to the techniques.
We started with an introduction to basic NLP terms and concepts such as named entity recognition, POS, and relationships between elements of a sentence. Named entity recognition is concerned with finding and labeling the parts of a sentence such as people, locations, and things. POS associates labels with elements of a sentence. For example, NN
refers to a noun and VB
to a verb.
We then included a discussion of the Word2Vec and Doc2Vec neural networks. These were used to classify text, both with labels and by similarity with other words. We demonstrated the use of DL4J resources to create feature vectors for document association with labels.
While the identification of these associations is interesting, a more useful analysis is performed when relationships are extracted from a sentence. We demonstrated how relationships are found using OpenNLP. The POS are associated with each word and the relationships between the words are shown using a set of tags and parentheses. This type of analysis can be used for more sophisticated analyses such as language translation and grammar checking.
Finally, we discussed and showed examples of sentiment analysis. This process involves classifying text based on its tone or contextual meaning. We examined a process for classifying movie reviews as positive or negative.
In this chapter, we demonstrated various techniques for text analysis and classification. In our next chapter, we will examine techniques designed for video and audio analysis.