How joins work in Spark
In this recipe, you will learn how query joins are executed by the Spark optimizer using different types of sorting algorithms such as SortMerge
and BroadcastHash
joins. You will learn how to identify which algorithm has been used by looking at the DAG that Spark generates. You will also learn how to use the hints that are provided in the queries to influence the optimizer to use a specific join algorithm.
Getting ready
To follow along with this recipe, run the cells in the 3-5.Joins
notebook, which you can find in your local cloned repository, in the Chapter03
folder (https://github.com/PacktPublishing/Azure-Databricks-Cookbook/tree/main/Chapter03).
Upload the csvFiles
folders, which can be found in the Common/Customer
and Common/Orders
folders in your local cloned repository, to the ADLS Gen-2 account in the rawdata
filesystem. You will need to create two folders called Customer
and Orders
in the rawdata
filesystem: