Building an ephemeral cluster using Dataproc and Cloud Composer
Another option to manage ephemeral clusters is using Cloud Composer. We learned about Airflow in the previous chapter to orchestrate BigQuery data loading. But as we've already learned, Airflow has many operators and one of them is of course Dataproc.Â
You should use this approach compared to a workflow template if your jobs are complex, in terms of developing a pipeline that contains many branches, backfilling logic, and dependencies to other services, since workflow templates can't handle these complexities.
In this section, we will use Airflow to create a Dataproc cluster, submit a pyspark
job, and delete the cluster when finished.
Check the full code in the GitHub repository:
Link to be updated
To use the Dataproc operators in Airflow, we need to import the operators, like this:
from airflow.providers.google.cloud.operators.dataproc import ( Â Â Â Â DataprocCreateClusterOperator...