Implementing data parallel algorithms
Several R packages allow code to be executed in parallel. The parallel
package that comes with R provides the foundation for most parallel computing capabilities in other packages. Let's see how it works with an example.
This example involves finding documents that match a regular expression. Regular expression matching is a fairly computational expensive task, depending on the complexity of the regular expression. The corpus, or set of documents, for this example is a sample of the Reuters-21578 dataset for the topic corporate acquisitions (acq
) from the tm
package. Because this dataset contains only 50 documents, they are replicated 100,000 times to form a corpus of 5 million documents so that parallelizing the code will lead to meaningful savings in execution times.
library(tm) data("acq") textdata <- rep(sapply(content(acq), content), 1e5)
The task is to find documents that match the regular expression \d+(,\d+)? mln dlrs
, which represents monetary...