Summary
In this chapter, we learned how to detect and handle the missing values in PySpark DataFrames. We looked at how to perform correlation and a metric to quantify the Pearson correlation coefficient. Later, we computed Pearson correlation coefficients for different numerical variable pairs and learned how to compute the correlation matrix for all the variables in the PySpark DataFrame.
In the next chapter, we will learn what problem definition is, and understand how to perform KPI generation. We will also use the data aggregation and data merge operations (learned about in previous chapters) and analyze data using graphs.