Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Save more on your purchases! discount-offer-chevron-icon
Savings automatically calculated. No voucher code required.
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Data Engineering with Google Cloud Platform
Data Engineering with Google Cloud Platform

Data Engineering with Google Cloud Platform: A practical guide to operationalizing scalable data analytics systems on GCP

eBook
Can$12.99 Can$70.99
Paperback
Can$88.99
Subscription
Free Trial

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Data Engineering with Google Cloud Platform

Chapter 1: Fundamentals of Data Engineering

Years ago, when I first entered the data science world, I used to think data was clean. Clean in terms of readiness, available in one place, and ready for fun data science purposes. I was so excited to experiment with machine learning models, finding unusual patterns in data and playing around with clean data. But after years of experience working with data, I realized that data science in big organizations isn't straightforward. 

Eighty percent of the effort goes into collecting, cleaning, and transforming the data. If you have had any experience in working with data, I am sure you've noticed something similar. But the good news is, we know that almost all processes can be automated using proper planning, designing, and engineering skills. That was the point where I realized that data engineering will be the most critical role from that day to the future of the data science world. 

To develop a successful data ecosystem in any organization, the most crucial part is how they design the data architecture. If the organization fails to make the best decision on the data architecture, the future process will be painful. Here are some common examples: the system is not scalable, querying data is slow, business users don't trust your data, the infrastructure cost is very high, and data is leaked. There is so much more that can go wrong without proper data engineering practice. 

In this chapter, we are going to learn the fundamental knowledge behind data engineering. The goal is to introduce you to common terms that are often used in this field and will be mentioned often in the later chapters. 

In particular, we will be covering the following topics:

  • Understanding the data life cycle
  • Know the roles of a data engineer before starting
  • Foundational concepts for data engineering

Understanding the data life cycle

The first principle to learn to become a data engineer is understanding the data life cycle. If you've worked with data, you must know that data doesn't stay in one place; it moves from one storage to another, from one database to other databases. Understanding the data life cycle means you need to be able to answer these sorts of questions if you want to display information to your end user

  • Who will consume the data? 
  • What data sources should I use? 
  • Where should I store the data? 
  • When should the data arrive? 
  • Why does the data need to be stored in this place? 
  • How should the data be processed?

To answer all those questions, we'll start by looking back a little bit at the history of data technologies.

Understanding the need for a data warehouse

Data warehouse is not a new concept; I believe you've at least heard of it. In fact, the terminology is no longer appealing. In my experience, no one gets excited when talking about data warehouses in the 2020s. Especially when compared to terminologies such as big data, cloud computing, and artificial intelligence.

So, why do we need to know about data warehouses? The answer to that is because almost every single data engineering challenge from the old times to these days is conceptually the same. The challenges are always about moving data from the data source to other environments so the business can use it to get information. The difference from time to time is only about the how and newer technologies. If we understand why people needed data warehouses in historical times, we will have a better foundation to understand the data engineering space and, more specifically, the data life cycle.

Data warehouses were first developed in the 1980s to transform data from operational systems to decision-making support systems. The key principle of a data warehouse is combining data from many different sources to a single location and then transforming it into a format the data warehouse can process and store. 

For example, in the financial industry, say a bank wants to know how many credit card customers also have mortgages. It is a simple enough question, yet it's not that easy to answer. Why? 

Most traditional banks that I have worked with had different operating systems for each of their products, including a specific system for credit cards and specific systems for mortgages, saving products, websites, customer service, and many other systems. So, in order to answer the question, data from multiple systems needs to be stored in one place first.

See the following diagram on how each department is independent:

Figure 1.1 – Data silos

Figure 1.1 – Data silos

Often, independence not only applies to the organization structure but also to the data. When data is located in different places, it's called data silos. This is very common in large organizations where each department has different goals, responsibilities, and priorities. 

In summary, what we need to understand from the data warehouse concept is the following:

  • Data silos have always occurred in large organizations, even back in the 1980s.
  • Data comes from many operating systems.
  • In order to process the data, we need to store the data in one place.

What does a typical data warehouse stack look like?

This diagram represents the four logical building blocks in a data warehouse, which are Storage, Compute, Schema, and SQL Interface:

Figure 1.2 – Data warehouse main components

Figure 1.2 – Data warehouse main components

Data warehouse products are mostly able to store and process data seamlessly and the user can use the SQL language to access the data in tables with a structured schema format. It is basic knowledge, but an important point to be aware of is that the four logical building blocks in the data warehouse are designed as one monolithic software that evolved over the later years and was the start of the data lake.

Getting familiar with the differences between a data warehouse and a data lake

Fast forward to 2008, when an open source data technology named Hadoop was first published, and people started to use the data lake terminology. If you try to find the definition of data lake on the internet, it will mostly be described as a centralized repository that allows you to store all your structured and unstructured data

So, what is the difference between a data lake and a data warehouse? Both have the same idea to store data in centralized storage. Is it simply that a data lake stores unstructured data and a data warehouse doesn't? 

What if I say some data warehouse products can now store and process unstructured data? Does the data warehouse become a data lake? The answer is no.

One of the key differences from a technical perspective is that data lake technologies separate most of the building blocks, in particular, the storage and computation, but also the other blocks, such as schema, stream, SQL interface, and machine learning. This evolves the concept of a monolithic platform into a modern and modular platform consisting of separated components, as illustrated in the following diagram:

Figure 1.3 – Data warehouse versus data lake components

Figure 1.3 – Data warehouse versus data lake components

For example, in a data warehouse, you insert data by calling SQL statements and query the data through SQL tables, and there is nothing you can do as a user to change that pattern. 

In a data lake, you can access the underlying storage directly, for example, by storing a text file, choosing your own computation engine, and choosing not to have a schema. There are many impacts of this concept, but I'll summarize it into three differences:

Figure 1.4 – Table comparing data lakes and data warehouses

Figure 1.4 – Table comparing data lakes and data warehouses

Large organizations start to store any data in the data lake system for two reasons, high scalability and cheap storage. In modern data architecture, both data lakes and data warehouses complete each other, rather than replacing each other. 

We will dive deeper and carry out some practical examples throughout the book, such as trying to build a sample in Chapter 3, Building a Data Warehouse in BigQuery, and Chapter 5, Building a Data Lake Using Dataproc.

The data life cycle

Based on our understanding of the history of the data warehouse, now we know that data does not stay in one place. As an analogy, data is very similar to water; it flows from upstream to downstream. Later in this book, both of these terms, upstream and downstream, will be used often since they are common terminologies in data engineering. 

When you think about water flowing upstream and downstream, one example that you can think of is a waterfall; the water falls freely without any obstacles.

Another example in different water life cycle circumstances is a water pipeline; upstream is the water reservoir and downstream is your kitchen sink. In this case, you can imagine the different pipes, filters, branches, and knobs in the middle of the process.  

Data is very much like water. There are scenarios where you just need to copy data from one storage to another storage, or in more complex scenarios, you may need to filter, join, and split multiple steps downstream before the data can be consumed by the end users. 

As illustrated in the following diagram, the data life cycle mostly starts from frontend applications, and flows up to the end for data users as information in the dashboard or ad hoc queries:

Figure 1.5 – Data life cycle diagram

Figure 1.5 – Data life cycle diagram

Let's now look at the elements of the data life cycle in detail:

  1. Apps and databases: The application is the interface from the human to the machine. The frontend application in most cases acts as the first data upstream. Data at this level is designed to serve application transactions as fast as possible. 
  2. Data lake: Data from multiple databases needs to be stored in one place. The data lake concept and technologies suit the needs. The data lake stores data in a file format such as a CSV file, Avro, or Parquet. 

The advantage of storing data in a file format is that it can accept any data sources; for example, MySQL Database can export the data to CSV files, image data can be stored as JPEG files, and IoT device data can be stored as JSON files. Another advantage of storing data in a data lake is it doesn't require a schema at this stage.

  1. Data warehouse: When you know any data in a data lake is valuable, you need to start thinking about the following: 
    1. What is the schema? 
    2. How do you query the data? 
    3. What is the best data model for the data? 

Data in a data warehouse is usually modeled based on business requirements. With this, one of the key requirements to build the data warehouse is that you need to know the relevance of the data to your business and the expected information that you want to generate from the data.

  1. Data mart: A data mart is an area for storing data that serves specific user groups. At this stage, you need to start thinking about the final downstream of data and who the end user is. Each data mart is usually under the control of each department within an organization. For example, a data mart for a finance team will consist of finance-related tables, while a data mart for data scientists might consist of tables with machine learning features.
  2. Data end consumer: The last stage of data will be back to humans as information. The end user of data can have various ways to use the data but at a very high level, these are the three most common usages:
    1. Reporting and dashboard
    2. Ad hoc query
    3. Machine learning

Are all data life cycles like this? No. Similar to the analogy of water flowing upstream and downstream, in different circumstances, it will require different data life cycles, and that's where data engineers need to be able to design the data pipeline architecture. But the preceding data life cycle is a very common pattern. In the past 10 years as a data consultant, I have had the opportunity to work with more than 30 companies from many industries, including financial, government, telecommunication, and e-commerce. Most of the companies that I worked with followed this pattern or were at least going in that direction.

As an overall summary of this section, we've learned that since historical times, data is mostly in silos, and it drives the needs of the data warehouse and data lake. The data will move from one system to others as specific needs have specific technologies and, in this section, we've learned about a very common pattern in data engineering. In the next section, let's try to understand the role of a data engineer, who should be responsible for this.

Knowing the roles of a data engineer before starting

In the later chapters, we will spend much of our time doing practical exercises to understand the data engineering concepts. But before that, let's quickly take a look at the data engineer role. 

The job role is getting more and more popular now, but the terminology itself is relatively new compared to other job roles, such as accountant, lawyer, doctor, and many other well-established job roles. The impact is that sometimes there is still a debate of what a data engineer should and shouldn't do. 

For example, if you came to a hospital and met a doctor, you know for sure that the doctor would do the following:

  1. Examine your condition.
  2. Make a diagnosis of your health issues.
  3. Prescribe medicine. 

The doctor wouldn't do the following:

  1. Clean the hospital.
  2. Make the medicine.
  3. Manage hospital administration.

It's clear, and it applies to most well-established job roles. But how about data engineers?

This is just a very short list of examples of what data engineers should or shouldn't be responsible for:

  • Handle all big data infrastructures and software installation.
  • Handle application databases.
  • Design the data warehouse data model.
  • Analyze big data to transform raw data into meaningful information.
  • Create a data pipeline for machine learning.

The unclear condition is unavoidable since it's a new role and I believe it will be more and more established following the maturity of data science. In this section, let's try to understand what a data engineer is and despite many combinations of responsibilities, what you should focus on as a data engineer.

Data engineer versus data scientist

A data engineer is someone who designs and builds data pipelines. 

The definition is that simple, but I found out that the question about the different between a data engineer versus a data scientist is still one of the most frequently asked questions when someone wants to start their data career. The hype of data scientists on the internet is one of the drivers; for example, up until today people still like to quote the following:

"Data scientist: the sexiest job of the 21st Century"

– Harvard Business Review

The data scientist role was originally invented to refer to groups of people who are highly curious and able to utilize big data technologies for business purposes back in 2008. But since the technologies are maturing and becoming more complex, people start to realize that it's too much. It's very rare for a company to hire someone who knows how to do all of the following: 

  • How to handle big data infrastructure
  • Properly design and build ETL pipelines
  • Train machine learning models 
  • Understand deeply about the company's business 

Not that it's impossible, some people do have this knowledge, but from a company's point of view, it's not practical.

These days, for better focus and scalability, the data scientist role can be split into many different roles, for example, data analyst, machine learning engineer, and business analyst. But one of the most popular and realized to be very important roles is data engineer.

The focus of data engineers

Let's map the data engineer role to our data life cycle diagram Figure 1.5 from the previous section. 

In the diagram, I added two underlying components:

  • Job Orchestrator: Design and build a job dependency and scheduler that runs data movement from upstream to downstream.
  • Infrastructure: Provision the required data infrastructure to run the data pipelines.

And on each step, I added numbers from 1 to 3. The numbers will help you to identify which components are the data engineer's main responsibility. This diagram works together with Figure 1.7, a data engineer-focused diagram to map the numbering. First, let's check this data life cycle diagram that we discussed before with the numbering on it:

Figure 1.6 – Data life cycle flows with focus numbering

Figure 1.6 – Data life cycle flows with focus numbering

After seeing the numbering on the data life cycle, check this diagram that illustrates the focus points of a data engineer:

Figure 1.7 – Data engineer-focused diagram

Figure 1.7 – Data engineer-focused diagram

The diagram shows the distribution of the knowledge area from the end-to-end data life cycle. At the center of the diagram (number 3) are the jobs that are the key focus of data engineers, and I will call it the core.  

Those numbered 2 are the good to have area. For example, it's still common in small organizations that data engineers need to build a data mart for business users. 

Important Note

Designing and building a data mart is not as simple as creating tables in a database. Someone who builds a data mart needs to be able to talk to business people and gather requirements to serve tables from a business perspective, which is one of the reasons it's not part of the core.

While how to collect data to a data lake is part of the data engineer's responsibility, exporting data from operational application databases is often done by the application development team, for example, dumping MySQL tables as CSV in staging storage.

Those numbered 1 are the good to know area. For example, it's rare that a data engineer needs to be responsible for building application databases, developing machine learning models, maintaining infrastructure, and creating dashboards. It is possible, but less likely. The discipline needs knowledge that is a little bit too far from the core.

After learning about the three focus areas, now let's retrospect our understanding and vision about data engineers. Study the diagram carefully and answer these questions.

  • What are your current focus areas as an individual?
  • What are your current job's role focus areas (or if you are a student, your study areas)?
  • What is your future goal in the data science world?

Depending on your individual answers, check with the diagram – do you have all the necessary skills at the core? Does your current job give you experience in the core? Are you excited if you could master all subjects at the core in the near future?

From my experience, what is important to data engineers is the core. Even though there are a variety of data engineers' expectations, responsibilities, and job descriptions in the market, if you are new to the role, then the most important thing is to understand what the core of a data engineer is. 

The diagram gives you guidance on what type of data engineers you are or will be. The closer you are to the core, the more of a data engineer you are. You are on the right track and in the right environment to be a good data engineer. 

In scenarios where you are at the core, plus other areas beside it, then you are closer to a full-stack data expert; as long as you have a strong core, if you are able to expand your expertise to the good to have and good to know areas, you will have a good advantage in your data engineering career. But if you focus on other non-core areas, I suggest you find a way to master the core first. 

In this section, we learned about the role of a data engineer. If you are not familiar with the cores, the next section will be your guidance to the fundamental concepts in data engineering. 

Foundational concepts for data engineering

Even though there are many data engineering concepts that we will learn throughout the book by using Google Cloud Platform (GCP), there are some concepts that are basic and you need to know as data engineers. In my experience interviewing in data companies, I found out that these foundational concepts are often asked to test how much you know about data engineering. Take the following examples: 

  • What is Extract-Transform-Load (ETL)?
  • What's the difference between ETL and Extract-Load-Transform (ELT)?
  • What is big data?
  • How do you handle large volumes of data?

These questions are very common, yet very important to deeply understand the concepts since it may affect our decisions on architecting our data life cycles.

ETL concept in data engineering 

ETL is the key foundation of data engineering. All things in the data life cycle are ETL; any part that happens from upstream to downstream is ETL. Let's take a look at the upstream to downstream flows that has an ETL process in between here:

Figure 1.8 – ETL illustration

Figure 1.8 – ETL illustration

ETL consists of three actual steps that you need in order to move your data:

  • What is extract? This is the step to get the data from the upstream system. For example, if the upstream system is an RDBMS, then the extract step will be dumping or exporting data from the RDBMS. 
  • What is transform? This is the step to apply any transformation to the extracted data. For example, the file from the RDBMS needs to be joined with a static CSV file, then the transform step will process the extracted data, load the CSV file, and finally, join both information together in an intermediary system.
  • What is load? This is the step to put the transformed data to the downstream system. For example, if the downstream system is BigQuery, then the load step will call BigQuery load job to store the data into BigQuery's table.

Back in Figure 1.5, Data life cycle diagrams, each of the individual steps may have a different ETL process. For example, at the application database to data lake step, the upstream is the application database and the data lake is the downstream. But at the data lake to data warehouse step, the data lake becomes the upstream and the data warehouse as its downstream. So, you need to think about how you want to do the ETL process in every data life cycle step. 

The difference between ETL and ELT

ETL is extract, transform, load and ELT is extract, load, transform. From the acronym itself, the difference between ETL and ELT is only the ordering of the letters T and L. Should you transform first and then load the data to the downstream or load the data to the downstream first and then transform the data inside the downstream system? 

Figure 1.9 – Extract load transform

Figure 1.9 – Extract load transform

Easy! What's the big deal? 

Even though it's a very simple difference in the acronym, deciding on the method can really affect your choice of technology products, system performance, scalability, and cost. For example, not all downstream systems are powerful enough to transform large volumes of data; in this case, ETL is preferred since using the ELT pattern will introduce issues in your downstream system. 

In other cases, the downstream system is a lot more powerful compared to any intermediary system, so you want to choose the ELT pattern. This mostly happens after the data lake era where the downstream are products such as Hadoop, BigQuery, or other scalable data processing products. But this is not the absolute answer; depending on your available choice of technology, you may change your ETL versus ELT strategy. 

You will understand this better after running through the content of this book with a lot of ETL and ELT examples, but at this point, the important thing to keep in mind is, as a data engineer, you have two options of where to transform your data: in an intermediary system or in the target system.

What is NOT big data?

After learning about ETL and ELT, the other most common terminology is big data. Since big data is still one of the highly correlated concepts close to data engineering, it is important how you interpret the terminology as a data engineer. Note that the word big data itself refers to two different subjects:

  • The data itself is big.
  • The big data technology.

With so much hype in the media about the words, both in the context of data is getting bigger and big data technology, I don't think I need to tell you the definition of the word big data. Instead, I will focus on eliminating the non-relevant definitions of big data for data engineers. Here are some definitions in media or from people that I have met personally:

  • All people already use social media, the data in social media is huge, and we can use the social media data for our organization. That's big data. 
  • My company doesn't have data. Big data is a way to use data from the public internet to start my data journey. That's big data.
  • The five Vs of data: volume, variety, velocity, veracity, and value. That's big data.

All the preceding definitions are correct but not really helpful to us as data engineers. So instead of seeing big data as general use cases, we need to focus on the how questions; think about what actually works under the hood. Take the following examples:

  • How do you store 1 PB of data in storage, while the size of common hard drives is in TBs? 
  • How do you average a list of numbers, when the data is stored in multiple computers? 
  • How can you continuously extract data from the upstream system and do aggregation as a streaming process? 

These kinds of questions are what are important for data engineers. Data engineers need to know when a condition (the data itself is big) should be handled using big data or non-big data technology.

A quick look at how big data technologies store data

Knowing that answering the how question is what is important to understanding big data, the first question we need to answer is how does it actually store the data? What makes it different from non-big data storage?

The word big in big data is relative. For example, say you analyze Twitter data and then download the data as JSON files with a size of 5 GB, and your laptop storage is 1 TB with 16 GB memory.

I don't think that's big data. But if the Twitter data is 5 PB, then it becomes big data because you need a special way to store it and a special way to process it. So, the key is not about whether it is social media data or not, or unstructured or not, which sometimes many people still get confused by. It's more about the size of the data relative to your system.

Big data technology needs to be able to distribute the data in multiple servers. The common terminology for multiple servers working together is a cluster. I'll give an illustration to show you how a very large file can be distributed into multiple chunks of file parts on multiple machines:

Figure 1.10 – Distributed filesystem

Figure 1.10 – Distributed filesystem

In a distributed filesystem, a large file will be split into multiple small parts. In the preceding example, it is split into nine parts, and each file is a small 128 MB file. Then, the multiple file parts are distributed into three machines randomly. On top of the file parts, there will be metadata to store information about how the file parts formed the original file, for example, a large file is a combination of file part 1 located in machine 1, file part 2 located in machine 2, and more.

The distributed parts can be stored in any format that isn't necessarily a file format; for example, it can be in the form of data blocks, byte arrays in memory, or some other data format. But for simplicity, what you need to be aware of is that in a big data system, data can be stored in multiple machines and in order to optimize performance, sometimes you need to think about how you want to distribute the parts. 

After we know data can be split into small parts on different machines, it leads to further questions:

  • How do I process the files?
  • What if I want to aggregate some numbers from the files?
  • How does each part know the records value from other parts while it is stored in different machines?

There are many approaches to answer these three questions. But one of the most famous concepts is MapReduce. 

A quick look at how to process multiple files using MapReduce

Historically speaking, MapReduce is a framework that was published as a white paper by Google and is widely used in the Hadoop ecosystem. There is an actual open source project called MapReduce mainly written in Java that still has a large user base, but slowly people have started to change to other distributed processing engine alternatives, such as Spark, Tez, and Dataflow. But MapReduce as a concept itself is still relevant regardless of the technology. 

In a short summary, the word MapReduce can refer to two definitions: 

  • MapReduce as a technology
  • MapReduce as a concept

What is important for us to understand is MapReduce as a concept. MapReduce is a combination of two words: map and reduce. 

Let's take a look at an example, if you have a file that's divided into two file parts:

Figure 1.11 – File parts

Figure 1.11 – File parts

Each of the parts contains one or more words, which in this example are fruit. The file parts are stored on different machines. So, each machine will have these three file parts:

  • File Part 1 contains two words: Banana and Apple.
  • File Part 2 contains three words: Melon, Apple, and Banana.
  • File Part 3 contains one word: Apple.

How can you write a program to calculate a word count that produces these results?

  • Apple = 3 
  • Banana = 2
  • Melon = 1

Since the file parts are separated in different machines, we cannot just count the words directly. We need MapReduce. Let's take a look at the following diagram, where file parts are mapped, shuffled, and lastly reduced to get the final result:

Figure 1.12 – MapReduce step diagram

Figure 1.12 – MapReduce step diagram

There are four main steps in the diagram:

  1. Map: Add to each individual record a static value of 1. This will transform the word into a key-value pair when the value is always 1.
  2. Shuffle: At this point, we need to move the fruit words between machines. We want to group each word and store it in the same machine for each group.
  3. Reduce: Because each fruit group is already in the same machine, we can count them together. The Reduce step will sum up the static value 1 to produce the count results.
  4. Result: Store the final results back in the single machine. 

The key idea here is to process any possible process in a distributed manner. Looking back at the diagram, you can imagine each box on each step is a different machine. 

Each step, Map, Shuffle, and Reduce, always maintains three parallel boxes. What does this mean? It means that the processes happened in parallel on three machines. This paradigm is different from calculating all processes in a single machine. For example, we can simply download all the file parts into a pandas DataFrame in Python and do a count using the pandas DataFrame. In this case, the process will happen in one machine.

MapReduce is a complex concept. The concept is explained in a 13-page-long document by Google. You can find the document easily on the public internet. In this book, I haven't added much deeper explanation about MapReduce. In most cases, you don't need to really think about it; for example, if in a later chapter you use BigQuery to process 1 PB of data, you will only need to run a SQL query and BigQuery will process it in a distributed manner in the background.

As a matter of fact, all technologies in GCP that we will use in this book are highly scalable and without question able to handle big data out of the box. But understanding the underlying concepts helps you as a data engineer in many ways, for example, choosing the right technologies, designing data pipeline architecture, troubleshooting, and improving performance. 

Summary

As a summary of the first chapter, we've learned the fundamental knowledge we need as data engineers. Here are some key takeaways from this chapter. First, data doesn't stay in one place. Data moves from one place to another, called the data life cycle. We also understand that data in a big organization is mostly in silos, and we can solve these data silos using the concepts of a data warehouse and data lake.

As someone who has started to look into data engineer roles, you may be a little bit lost. The role of data engineers may vary. The key takeaway is not to be confused about the broad expectation in the market. First, you should focus on the core and then expand as you get more and more experience from the core. In this chapter, we've learned what the core for a data engineer is. At the end of the chapter, we learned some of the key concepts. There are three key concepts as a data engineer that you need to be familiar with. These concepts are ETL, big data, and distributed systems

In the next chapter, we will visit GCP, a cloud platform provided by Google that has a lot of services to help us as data engineers. We want to understand its preposition and what the services are that are relevant to big data, and lastly, we will start using the GCP console.

Now let's put the knowledge from this chapter into practice.

Exercise

You are a data engineer at a book publishing company and your product manager has asked you to build a dashboard to show the total revenue and customer satisfaction index in a single dashboard. 

Your company doesn't have any data infrastructure yet, but you know that your company has these three applications that contain TBs of data:

  • The company website
  • A book sales application using MongoDB to store sales transactions, including transactions, book ID, and author ID
  • An author portal application using MySQL Database to store authors' personal information, including age

Do the following:

  1. List important follow-up questions for your manager.
  2. List your technical thinking process of how to do it at a high level. 
  3. Draw a data pipeline architecture.

There is no right or wrong answer to this practice. The important thing is that you can imagine how the data flows from upstream to downstream, how it should be processed at each step, and finally, how you want to serve the information to end users. 

See also

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Understand data engineering concepts, the role of a data engineer, and the benefits of using GCP for building your solution
  • Learn how to use the various GCP products to ingest, consume, and transform data and orchestrate pipelines
  • Discover tips to prepare for and pass the Professional Data Engineer exam

Description

With this book, you'll understand how the highly scalable Google Cloud Platform (GCP) enables data engineers to create end-to-end data pipelines right from storing and processing data and workflow orchestration to presenting data through visualization dashboards. Starting with a quick overview of the fundamental concepts of data engineering, you'll learn the various responsibilities of a data engineer and how GCP plays a vital role in fulfilling those responsibilities. As you progress through the chapters, you'll be able to leverage GCP products to build a sample data warehouse using Cloud Storage and BigQuery and a data lake using Dataproc. The book gradually takes you through operations such as data ingestion, data cleansing, transformation, and integrating data with other sources. You'll learn how to design IAM for data governance, deploy ML pipelines with the Vertex AI, leverage pre-built GCP models as a service, and visualize data with Google Data Studio to build compelling reports. Finally, you'll find tips on how to boost your career as a data engineer, take the Professional Data Engineer certification exam, and get ready to become an expert in data engineering with GCP. By the end of this data engineering book, you'll have developed the skills to perform core data engineering tasks and build efficient ETL data pipelines with GCP.

Who is this book for?

This book is for data engineers, data analysts, and anyone looking to design and manage data processing pipelines using GCP. You'll find this book useful if you are preparing to take Google's Professional Data Engineer exam. Beginner-level understanding of data science, the Python programming language, and Linux commands is necessary. A basic understanding of data processing and cloud computing, in general, will help you make the most out of this book.

What you will learn

  • Load data into BigQuery and materialize its output for downstream consumption
  • Build data pipeline orchestration using Cloud Composer
  • Develop Airflow jobs to orchestrate and automate a data warehouse
  • Build a Hadoop data lake, create ephemeral clusters, and run jobs on the Dataproc cluster
  • Leverage Pub/Sub for messaging and ingestion for event-driven systems
  • Use Dataflow to perform ETL on streaming data
  • Unlock the power of your data with Data Studio
  • Calculate the GCP cost estimation for your end-to-end data solutions

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Mar 31, 2022
Length: 440 pages
Edition : 1st
Language : English
ISBN-13 : 9781800561328
Category :
Languages :
Concepts :

What do you get with a Packt Subscription?

Free for first 7 days. $19.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Mar 31, 2022
Length: 440 pages
Edition : 1st
Language : English
ISBN-13 : 9781800561328
Category :
Languages :
Concepts :

Packt Subscriptions

See our plans and pricing
Modal Close icon
$19.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
$199.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts
$279.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just Can$6 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total Can$ 233.97
Data Engineering with Google Cloud Platform
Can$88.99
Data Engineering with AWS
Can$82.99
Google Cloud Certified Professional Cloud Network Engineer Guide
Can$61.99
Total Can$ 233.97 Stars icon
Banner background image

Table of Contents

16 Chapters
Section 1: Getting Started with Data Engineering with GCP Chevron down icon Chevron up icon
Chapter 1: Fundamentals of Data Engineering Chevron down icon Chevron up icon
Chapter 2: Big Data Capabilities on GCP Chevron down icon Chevron up icon
Section 2: Building Solutions with GCP Components Chevron down icon Chevron up icon
Chapter 3: Building a Data Warehouse in BigQuery Chevron down icon Chevron up icon
Chapter 4: Building Orchestration for Batch Data Loading Using Cloud Composer Chevron down icon Chevron up icon
Chapter 5: Building a Data Lake Using Dataproc Chevron down icon Chevron up icon
Chapter 6: Processing Streaming Data with Pub/Sub and Dataflow Chevron down icon Chevron up icon
Chapter 7: Visualizing Data for Making Data-Driven Decisions with Data Studio Chevron down icon Chevron up icon
Chapter 8: Building Machine Learning Solutions on Google Cloud Platform Chevron down icon Chevron up icon
Section 3: Key Strategies for Architecting Top-Notch Data Pipelines Chevron down icon Chevron up icon
Chapter 9: User and Project Management in GCP Chevron down icon Chevron up icon
Chapter 10: Cost Strategy in GCP Chevron down icon Chevron up icon
Chapter 11: CI/CD on Google Cloud Platform for Data Engineers Chevron down icon Chevron up icon
Chapter 12: Boosting Your Confidence as a Data Engineer Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.7
(12 Ratings)
5 star 66.7%
4 star 33.3%
3 star 0%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Kyle Malone May 29, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As a data analyst looking to expand my data engineering skills, I have been looking for a book on this topic for a while now. It's exactly what I needed to better understand how to write ETL jobs using GCP.The code in Github has a few errors, but if you know basic Python you should be able to spot and correct the mistakes fairly easily. Overall, this books has helped me a ton. I would definitely purchase it again.
Amazon Verified review Amazon
Behnam Jun 19, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book helped me to get started working with GCP in an organized way, now I have the confidence to do some real projects on GCP.
Amazon Verified review Amazon
cloud-learner Jun 28, 2024
Full star icon Full star icon Full star icon Full star icon Full star icon 5
well written and easy to understand!
Amazon Verified review Amazon
Om S May 31, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Google Cloud Platform had always been pioneered when it comes to big data and AI. This platform is amazingly simple for developers and the author has done a great job putting everything together step by step to facilitate readers' and users' experience.The book is divided into 12 solid chapters; the last chapter is related to those who are interested in getting certification and want to know more about it.[Chapter 1 ] - Explains fundamentals of data engineering, ETL concept in data engineering and data life cycle, the difference between ETL and ELT, etc. It also explains ETL, big data, and distributed systems in short.[Chapter 2] - This is more related to big data products and their capability, GCP console, cloud shell, and cloud editor are needed and explained in detail with other associated services. The author has explained serverless services or fully managed services. Most of the cloud companies want clients/users to be on and opt serverless side of the story![Chapter 3] - Cover Google Cloud Storage and BigQuery console, Different kind of scenarios the author has presented give a different aspect of the data modeling and understanding of the BigQuery operations. You will learn design data modeling for BigQuery. Diagram and code to cover your practice that is your key.[Chapter 4] - In this chapter automating tasks, jobs, and how to handle their dependencies has been explained. Cloud Composer, the introduction of the open-source tool “Airflow” and data pipeline for BigQuery data warehouse. [Chapter 5] - If you know and have an idea about Spark and PySpark you will be enjoying this chapter! Developing Spark ETL from GCS to BigQuery is a fun part. Brief Intro to Dataproc which is Data Lake, building a data lake on a Dataproc cluster, creating and running jobs on a Dataproc cluster, the concept of the ephemeral cluster, using Dataproc and Cloud Composer is explained. My suggestion is to learn HDFS and Spark a bit in detail, this is an important chapter.[Chapter 6] – You will learn about streaming data and how to handle incoming data as soon as data is created using the Pub/Sub publisher client.[Chapter 7] - Data Studio for visualizing data with connectivity to BigQuery. Data studio explorer gives a lot of options to make charts/aggregations and visualize your data.[Chapter 8] - No one understands better than Google Cloud about Machine Learning and Artificial Intelligence. In this chapter, author has a provided bird eye view of ML, MLOps, some pre-built GCP models as a service, and deploying ML pipelines with Vertex AI are highlights. If you like ML you will be enjoying this chapter. Different Exercises make this a more fun chapter. [Chapter 9 ] IAM, project structure, and BigQuery ACLs, controlling user access/ infrastructure as code. You will also learn the power of Terraform but my intake is practicing “Terraform” a bit outside of this book is more fun and you will see new challenges. [Chapter 10] - This chapter cover cost strategy/saving money and also highlights how to estimate the overall data solution using GCP. End-to-end data solution cost with tools and tips for optimizing services. Happy clients bring good business![Chapter 11] - Continuous integration and continuous deployment (CI/CD) on Google Cloud Platform for Data Engineers, explains the concept of CI/CD and its relevance to data engineers. Without CI/CD cloud is like a handicap! If you have used that before you will understand its importance.[Chapter 12 ] – As I explained in the beginning it's all about Boosting Your Confidence as a Data Engineer, and preparing you for the GCP certification. Finally, I would conclude it as a great book. Nice references/URL and summary for revision.For a new version of the book, I would be expecting more questions and case studies for readers.
Amazon Verified review Amazon
Yashar Mansouri Jun 26, 2022
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Data Engineering with GCP by Adi Wijaya is a hands-on-book that covers processes such as Data Warehousing , ETL/ELT, and workflow management through the use of Google Cloud Platform's tech stack.Services such as IAM & Admin, BigQuery, CloudSQL, Cloud Storage, Composer, Dataproc, Pub/Sub, and Dataflow are covered in multiple chapters of the book with hands-on or coding examples through Cloud Console, Cloud Shell scripting, and even Python code. Most examples are presented by trying to answer business requirements for real world scenarios.What I specifically liked is that the author also covers the concepts around data engineering processes such as comparing the Inmon and Kimball Data Warehousing methods and when to use each as well as user and project management and the optimal cost strategy when using the tools provided by GCP.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.