First Spark script
Our first script reads in a text file and sees how much the line lengths add up to, as shown next. Note that we are reading in the Notebook file we are running; the Notebook is named Spark File Lengths
, and is stored in the Spark File Lengths.ipynb
file:
import pyspark if not 'sc' in globals(): sc = pyspark.SparkContext() lines = sc.textFile("Spark File Line Lengths.ipynb") lineLengths = lines.map(lambda s: len(s)) totalLengths = lineLengths.reduce(lambda a, b: a + b) print(totalLengths)
In the print(totalLengths)
script, we first initialize Spark, but only if we have not done so already. Spark will complain if you try to initialize it more than once, so all Spark scripts should have this if
statement prefix.
The script reads in a text file (the source of this script), takes every line and computes its length, and then adds all the lengths together.
A lambda
function is an anonymous (not named) function that takes arguments and returns a value. In the first case, given...