Applying basic transformations to data with Apache Spark
In this recipe, we will discuss the basics of Apache Spark. We will use Python as our primary programming language and the PySpark API to perform basic transformations on a dataset of Nobel Prize winners.
How to do it...
- Import the libraries: Import the required libraries and create a
SparkSession
object:from pyspark.sql import SparkSession
from pyspark.sql.functions import transform, col, concat, lit
spark = (SparkSession.builder
    .appName("basic-transformations")
    .master("spark://spark-master:7077")
    .config("spark.executor.memory", "512m")
    .getOrCreate())
spark.sparkContext.setLogLevel("ERROR")
- Read file: Read the
nobel_prizes.json
file using theread
method ofSparkSession
:df = (spark.read.format("json")
    .option("multiLine", "...