Using SQL in Spark
In Chapter 2, we talked about Spark Core and how it’s shared across different components of Spark. DataFrames and Spark SQL can also be used interchangeably. We can also use data stored in DataFrames with Spark SQL queries.
The following code illustrates how we can make use of this feature:
salary_data_with_id.createOrReplaceTempView("SalaryTable") spark.sql("SELECT count(*) from SalaryTable").show()
The resulting DataFrame looks like this:
+--------+ |count(1)| +--------+ |Â Â Â Â Â Â Â 8| +--------+
The createOrReplaceTempView
function is used to convert a DataFrame into a table named SalaryTable
. Once this conversion is made, we can run regular SQL queries on top of this table. We are running a count *
query to count the total number of elements in a table.
In the next section, we will see what a UDF is and how we use that in Spark.