Google Cloud: Creating Dataproc Cluster Using Google Cloud and Running a Pyspark Job

Share At:

Democratizing Dataproc — dunnhumby's journey on Google Cloud Platform | by  Jamie Thomson | dunnhumby Data Science & Engineering | Medium

Description

This hands-on lab introduces how to use Google Cloud Storage as the primary input and output location for Dataproc cluster jobs. Leveraging GCS over the Hadoop Distributed File System (HDFS) allows us to treat clusters as ephemeral entities, so we can delete clusters that are no longer in use, while still preserving our data.

In this lab, you will create a single node Dataproc cluster and a GCS bucket for your Pyspark job output. Separating the storage from the compute allows you to treat your cluster as ephemeral, and we will delete the cluster when we are done while preserving the results.

Prepare Our Environment

  1. First, we need to enable the Dataproc API:
gcloud services enable dataproc.googleapis.com
  1. Then create a Cloud Storage bucket:
gsutil mb -l us-central1 gs://$DEVSHELL_PROJECT_ID-data
  1. Now create the Dataproc cluster:
gcloud dataproc clusters create wordcount --region=us-central1 --zone=us-central1-f --single-node --master-machine-type=n1-standard-2
  1. Validate that the Dataproc cluster has been created

Go to BigData > Dataproc > clusters. You will see the Dataproc cluster up and running.

  1. And finally, download the wordcount.py file that will be used for the pyspark job:
gsutil cp -r gs://acg-gcp-labs-resources/data-engineer/dataproc/* .

wordcount.py

Submit the Pyspark Job to the Dataproc Cluster

In Cloud Shell, type:

gcloud dataproc jobs submit pyspark wordcount.py --cluster=wordcount --region=us-central1 -- \
gs://acg-gcp-labs-resources/data-engineer/dataproc/romeoandjuliet.txt \
gs://$DEVSHELL_PROJECT_ID-data/output/

Review the Pyspark Output

  1. In Cloud Shell, download output files from the GCS output location:
gsutil cp -r gs://$DEVSHELL_PROJECT_ID-data/output/* .
  • Note: Alternatively, we could download them to our local machine via the web console.

Delete the Dataproc Cluster

  1. We don’t need our cluster any longer, so let’s delete it. In the web console, go to the top-left menu and into BIGDATA > Dataproc.
  2. Select the wordcount cluster, then click DELETE, and OK to confirm.Our job output still remains in Cloud Storage, allowing us to delete Dataproc clusters when no longer in use to save costs, while preserving input and output resources.

Wait until the cluster is deleted.

Happy Learning !!!


Share At:
0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
Back To Top

Contact Us