Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Definitive Guide to Google Vertex AI

You're reading from   The Definitive Guide to Google Vertex AI Accelerate your machine learning journey with Google Cloud Vertex AI and MLOps best practices

Arrow left icon
Product type Paperback
Published in Dec 2023
Publisher Packt
ISBN-13 9781801815260
Length 422 pages
Edition 1st Edition
Tools
Arrow right icon
Authors (2):
Arrow left icon
Kartik Chaudhary Kartik Chaudhary
Author Profile Icon Kartik Chaudhary
Kartik Chaudhary
Jasmeet Bhatia Jasmeet Bhatia
Author Profile Icon Jasmeet Bhatia
Jasmeet Bhatia
Arrow right icon
View More author details
Toc

Table of Contents (24) Chapters Close

Preface 1. Part 1:The Importance of MLOps in a Real-World ML Deployment
2. Chapter 1: Machine Learning Project Life Cycle and Challenges FREE CHAPTER 3. Chapter 2: What Is MLOps, and Why Is It So Important for Every ML Team? 4. Part 2: Machine Learning Tools for Custom Models on Google Cloud
5. Chapter 3: It’s All About Data – Options to Store and Transform ML Datasets 6. Chapter 4: Vertex AI Workbench – a One-Stop Tool for AI/ML Development Needs 7. Chapter 5: No-Code Options for Building ML Models 8. Chapter 6: Low-Code Options for Building ML Models 9. Chapter 7: Training Fully Custom ML Models with Vertex AI 10. Chapter 8: ML Model Explainability 11. Chapter 9: Model Optimizations – Hyperparameter Tuning and NAS 12. Chapter 10: Vertex AI Deployment and Automation Tools – Orchestration through Managed Kubeflow Pipelines 13. Chapter 11: MLOps Governance with Vertex AI 14. Part 3: Prebuilt/Turnkey ML Solutions Available in GCP
15. Chapter 12: Vertex AI – Generative AI Tools 16. Chapter 13: Document AI – An End-to-End Solution for Processing Documents 17. Chapter 14: ML APIs for Vision, NLP, and Speech 18. Part 4: Building Real-World ML Solutions with Google Cloud
19. Chapter 15: Recommender Systems – Predict What Movies a User Would Like to Watch 20. Chapter 16: Vision-Based Defect Detection System – Machines Can See Now! 21. Chapter 17: Natural Language Models – Detecting Fake News Articles! 22. Index 23. Other Books You May Enjoy

Common challenges in developing real-world ML solutions

A real-world ML project is always filled with some unexpected challenges that we get to experience at different stages. The main reason for this is that the data present in the real world, and the ML algorithms, are not perfect. Though these challenges hamper the performance of the overall ML setup, they don’t prevent us from creating a valuable ML application. In a new ML project, it is difficult to know the challenges up front. They are often found during different stages of the project. Some of these challenges are not obvious and require skilled or experienced ML practitioners (or data scientists) to identify them and apply countermeasures to reduce their effect.

In this section, we will understand some of the common challenges encountered during the development of a typical ML solution. The following list shows some common challenges we will discuss in more detail:

  • Data collection and security
  • Non-representative training set
  • Poor quality of data
  • Underfitting of the training dataset
  • Overfitting of the training dataset
  • Infrastructure requirements

Now, let’s learn about each of these common challenges in detail.

Data collection and security

One of the most common challenges that organizations face is data availability. ML algorithms require a large amount of good-quality data in order to provide quality results. Thus, the availability of raw data is critical for a business if it wants to implement ML. Sometimes, even if the raw data is available, gathering data is not the only concern; we often need to transform or process the data in a way that our ML algorithm supports.

Data security is another important challenge that is very frequently faced by ML developers. When we get data from a company, it is essential to differentiate between sensitive and non-sensitive information to implement ML correctly and efficiently. The sensitive part of data needs to be stored in fully secured servers (storage systems) and should always be kept encrypted. Sensitive data should be avoided for security purposes, and only the less-sensitive data access should be given to trusted team members working on the project. If the data contains Personally Identifiable Information (PII), it can still be used by anonymizing it properly.

Non-representative training data

A good ML model is one that performs equally well on unseen data and training data. It is only possible when your training data is a good representative of most possible business scenarios. Sometimes, when the dataset is small, it may not be a true representative of the inherent distribution, and the resulting model may provide inaccurate predictions on unseen datasets despite having high-quality results on the training dataset. This kind of non-representative data is either the result of sampling bias or the unavailability of data. Thus, an ML model trained on such a non-representative dataset may have less value when it is deployed in production.

If it is impossible to get a true representative training dataset for a business problem, then it’s better to limit the scope of the problem to only the scenarios for which we have a sufficient amount of training samples. In this way, we will only get known scenarios in the unseen dataset, and the model should provide quality predictions. Sometimes, the data related to a business problem keeps changing with time, and it may not be possible to develop a single static model that works well; in such cases, continuous retraining of the model on the latest data becomes essential.

Poor quality of data

The performance of ML algorithms is very sensitive to the quality of training samples. A small number of outliers, missing data cases, or some abnormal scenarios can affect the quality of the model significantly. So, it is important to treat such scenarios carefully while analyzing the data before training any ML algorithm. There are multiple methods for identifying and treating outliers; the best method depends upon the nature of the problem and the data itself. Similarly, there are multiple ways of treating the missing values as well. For example, mean, median, mode, and so on are some frequently used methods to fill in missing data. If the training data size is sufficiently large, dropping a small number of rows with missing values is also a good option.

As discussed, the quality of the training dataset is important if we want our ML system to learn accurately and provide quality results on the unseen dataset. It means that the data pre-processing part of the ML life cycle should be taken very seriously.

Underfitting the training dataset

Underfitting an ML model means that the model is too simple to learn the inherent information or structure of the training dataset. It may occur when we try to fit a non-linear distribution using a linear ML algorithm such as linear regression. Underfitting may also occur when we utilize only a minimal set of features (that may not have much information about the target distribution) while training the model. This type of model can be too simple to learn the target distribution. An underfitted model learns too little from the training data and, thus, makes mistakes on unseen or test datasets.

There are multiple ways to tackle the problem of underfitting. Here is a list of some common methods:

  • Feature engineering – add more features that represent target distribution
  • Non-linear algorithms – switch to a non-linear algorithm if the target distribution is not linear
  • Removing noise from the data
  • Add more power to the model – increase trainable parameters, increase depth or number of trees in tree-based ensembles

Just like underfitting the model on training data, overfitting is also a big issue. Let’s deep dive into it.

Overfitting the training dataset

The overfitting problem is the opposite of the underfitting problem. Overfitting is the scenario when the ML model learns too much unnecessary information from the training data and fails to generalize on a test or unseen dataset. In this case, the model performs extremely well on the training dataset, but the metric value (such as accuracy) is very low on the test set. Overfitting usually occurs when we implement a very complex algorithm on simple datasets.

Some common methods to address the problem of overfitting are as follows:

  • Increase training data size – ML models often overfit on small datasets
  • Use simpler models – When problems are simple or linear in nature, choose simple ML algorithms
  • Regularization – There are multiple regularization methods that prevent complex models from overfitting on the training dataset
  • Reduce model complexity – Use a smaller number of trainable parameters, train for a smaller number of epochs, and reduce the depth of tree-based models

Overfitting and underfitting are common challenges and should be addressed carefully, as discussed earlier. Now, let’s discuss some infrastructure-related challenges.

Infrastructure requirements

ML is expensive. A typical ML project often involves crunching large datasets with millions or billions of samples. Slicing and dicing such datasets requires a lot of memory and high-end multi-core processors. Additionally, once the development of the project is complete, dedicated servers are required to deploy the models and match the scale of consumers. Thus, business organizations willing to practice ML need some dedicated infrastructure to implement and consume ML efficiently. This requirement increases further when working with large, deep learning models such as transformers, large language models (LLMs), and so on. Such models usually require a set of accelerators, graphical processing units (GPUs), or tensor processing units (TPUs) for training, finetuning, and deployment.

As we have discussed, infrastructure is critical for practicing ML. Companies that lack such infrastructure can consult with other firms or adopt cloud-based offerings to start developing ML-based applications.

Now that we understand the common challenges faced during the development of an ML project, we should be able to make more informed decisions about them. Next, let’s learn about some of the limitations of ML.

You have been reading a chapter from
The Definitive Guide to Google Vertex AI
Published in: Dec 2023
Publisher: Packt
ISBN-13: 9781801815260
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image