Cleaning and preparing your data
The history of data processing is long and has had several unique innovations. When you pull up data in Excel or any other data processing tool, you often see issues with the data that require fixes and changes. Data issues are extremely common, even with robust data practices. We will now go through several fundamental techniques using Apache Spark for cleansing and wrangling your data.
Duplicate values
Here, we set up our example DataFrame:
data_frame = spark.createDataFrame(data = [("Brian," "Engr," 1), Â Â Â Â ("Nechama", "Engr", 2), Â Â Â Â ("Naava", "Engr", 3), Â Â Â Â ("Miri", "Engr", 4), Â Â Â Â ("Brian", "Engr", 1), Â Â Â Â ("Miri", "Engr", 3), Â Â ], schema = ["name", "div", "ID"])
The first method...