Exercise – Creating and running jobs on a Dataproc cluster
In this exercise, we will try three different methods to submit a Dataproc job: running on a permanent Dataproc cluster, running on an ephemeral cluster, and running on Dataproc Serverless.
In the previous exercise, we used the Spark shell to run our Spark syntax, which is common when practicing but not common in real development. Usually, we would only use the Spark shell for initial checking or testing simple things. In this exercise, we will code Spark jobs in editors and submit them as jobs.
Here are the scenarios that we want to try:
- Preparing log data in GCS and HDFS
- Developing a Spark ETL job from HDFS to HDFS
- Developing a Spark ETL job from GCS to GCS
- Developing a Spark ETL job from GCS to BigQuery
Let’s look at each of these scenarios in detail.
Preparing log data in GCS and HDFS
The log data is in our GitHub repository, located here: https://github.com/PacktPublishing...