You're reading from Journey to Become a Google Cloud Machine Learning Engineer Build the mind and hand of a Google Certified ML professional

Product type Paperback

Published in Sep 2022

Publisher Packt

ISBN-13 9781803233727

Length 330 pages

Edition 1st Edition

Languages

Python

Tools

BigQuery

Concepts

Machine Learning

Author (1):

Dr. Logan Song

View More author details

Table of Contents (23) Chapters

Preface

1. Part 1: Starting with GCP and Python

2. Chapter 1: Comprehending Google Cloud Services FREE CHAPTER

3. Chapter 2: Mastering Python Programming

4. Part 2: Introducing Machine Learning

5. Chapter 3: Preparing for ML Development

6. Chapter 4: Developing and Deploying ML Models

7. Chapter 5: Understanding Neural Networks and Deep Learning

8. Part 3: Mastering ML in GCP

9. Chapter 6: Learning BQ/BQML, TensorFlow, and Keras

10. Chapter 7: Exploring Google Cloud Vertex AI

11. Chapter 8: Discovering Google Cloud ML API

12. Chapter 9: Using Google Cloud ML Best Practices

13. Part 4: Accomplishing GCP ML Certification

14. Chapter 10: Achieving the GCP ML Certification

15. Part 5: Appendices

16. Index

Why subscribe?

17. Other Books You May Enjoy

Appendix 1: Practicing with Basic GCP Services

1. Appendix 2: Practicing Using the Python Data Libraries

2. Appendix 3: Practicing with Scikit-Learn

3. Appendix 4: Practicing with Google Vertex AI

4. Appendix 5: Practicing with Google Cloud ML API

GCP big data and analytics services

Distinguished from storage and database services, the big data and analytics services focus on the big data processing pipeline: from data ingestion, storing, and processing to visualization, it helps you create a complete cloud-based big data infrastructure:

Figure 1.6 – GCP big data and analytics services

As shown in the preceding diagram, the GCP big data and analytics services include Cloud Dataproc, Cloud Dataflow, BigQuery, and Cloud Pub/Sub.

Let’s examine each of them briefly.

Google Cloud Dataproc

Based on the concept of Map-Reduce and the architecture of Hadoop systems, Google Cloud Dataproc is a managed GCP service for processing large datasets. Dataproc provides organizations with the flexibility to provision and configure data processing clusters of varying sizes on demand. Dataproc integrates well with other GCP services. It can operate directly on Cloud Storage files or use Bigtable to analyze data, and it can be integrated with Vertex AI, BigQuery, Dataplex, and other GCP services.

Dataproc helps users process, transform, and understand vast quantities of data. You can use Dataproc to run Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks. You can also use Dataproc for data lake modernization, ETL processes, and more.

Google Cloud Dataflow

Cloud Dataflow is a GCP-managed service for developing and executing a wide variety of data processing patterns, including Extract, Transform, Load (ETL), batch, and streaming jobs. Cloud Dataflow is a serverless data processing service that runs jobs written with Apache Beam libraries. Cloud Dataflow executes jobs that consist of a pipeline – a sequence of steps that reads data, transforms it into different formats, and writes it out. A dataflow pipeline consists of a series of pipes, which is a way to connect components, where data moves from one component to the next via a pipe. When jobs are executed on Cloud Dataflow, the service spins up a cluster of VMs, distributes the job tasks to the VMs, and dynamically scales the cluster based on job loads and their performance.

Google Cloud BigQuery

BigQuery is a Google fully managed enterprise data warehouse service that is highly scalable, fast, and optimized for data analytics. It has the following features:

BigQuery supports ANSI-standard SQL queries, including joins, nested and repeated fields, analytic and aggregation functions, scripting, and a variety of spatial functions via geospatial analytics.
With BigQuery, you do not physically manage the infrastructure assets. BigQuery’s serverless architecture lets you use SQL queries to answer big business questions with zero infrastructure overhead. With BigQuery’s scalable, distributed analysis engine, you can query petabytes of data in minutes.
BigQuery integrates seamlessly with other GCP data services. You can query data stored in BigQuery or run queries on data where it lives using external tables or federated queries, including GCS, Bigtable, Spanner, or Google Sheets stored in Google Drive.
BigQuery helps you manage and analyze your data with built-in features such as ML, geospatial analysis, and business intelligence. We will discuss BigQuery ML later in this book.

Google BigQuery is used in many business cases due to it being SQL-friendly, having a serverless structure, and having built-in integration with other GCP services.

Google Cloud Pub/Sub

GCP Pub/Sub is a widely used cloud service for decoupling many GCP services – it implements an event/message queue pipe to integrate services and parallelize tasks. With the Pub/Sub service, you can create event producers, called publishers, and event consumers, called subscribers. Using Pub/Sub, the publishers communicate with subscribers asynchronously by broadcasting events – a publisher can have multiple subscribers and a subscriber can subscribe to multiple publishers:

Figure 1.7 – Google Cloud Pub/Sub services

The preceding diagram shows the example we discussed in the GCP Cloud Functions section: after an object is uploaded to a GCS bucket, a request/message can be generated and sent to GCP Pub/Sub, which can trigger an email notification and a cloud function to process the object. When the number of parallel object uploads is huge, Cloud Pub/Sub will help buffer/queue the requests/messages and decouple the GCS service from other cloud services such as Cloud Functions.

So far, we have covered various GCP services, including compute, storage, databases, and data analytics (big data). Now, let’s take a look at various GCP artificial intelligence (AI) services.