Spark text file analysis
In this example, we will look through a news article to determine some basic information from it.
We will be using the following script against the 2600raid news article (from http://newsitem.com/):
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() sentences = sc.textFile('2600raid.txt') \ .glom() \ .map(lambda x: " ".join(x)) \ .flatMap(lambda x: x.split(".")) print(sentences.count(),"sentences") bigrams = sentences.map(lambda x:x.split()) \ .flatMap(lambda x: [((x[i],x[i+1]),1) for i in range(0,len(x)-1)]) print(bigrams.count(),"bigrams") frequent_bigrams = bigrams.reduceByKey(lambda x,y:x+y) \ .map(lambda x:(x[1],x[0])) \ .sortByKey(False) frequent_bigrams.take(10)
The code reads in the article and splits up the article into sentences as determined by ending with a period. From there, the code maps out the bigrams present. A bigram is a pair of words that appear next to each other. We then sort the list and print out...