You're reading from The Kaggle Workbook Self-learning exercises and valuable insights for Kaggle data science competitions

Product type Paperback

Published in Feb 2023

Publisher Packt

ISBN-13 9781804611210

Length 172 pages

Edition 1st Edition

Languages

Python

Tools

Boost

Concepts

Data Science

Authors (2):

Luca Massaron

Konrad Banachewicz

View More author details

Table of Contents (7) Chapters

Preface

1. The Most Renowned Tabular Competition – Porto Seguro’s Safe Driver Prediction FREE CHAPTER

2. The Makridakis Competitions – M5 on Kaggle for Accuracy and Uncertainty

3. Vision Competition – Cassava Leaf Disease Competition

4. NLP Competition – Google Quest Q&A Labeling

5. Other Books You May Enjoy

6. Index

Learning from top solutions

In this section, we gather aspects of the top solutions that could allow us to rise above the level of the baseline solution. Keep in mind that the leaderboards (both public and private) in this competition were quite tight; this was due to a combination of a couple of factors:

Noisy data: it was easy to get to 0.89 accuracy by correctly identifying a large part of the train data, and then each new correct one allowed for a tiny move upward
Limited size of the data

Pretraining

The first and most obvious remedy to the issue of limited data size was pretraining: using more data. Pretraining a deep learning model on more data can be beneficial because it can help the model learn better representations of the data, which can in turn improve the performance of the model on downstream tasks. When a deep learning model is trained on a large dataset, it can learn to extract useful features from the data that are relevant to the task at hand. This can provide a strong foundation for the model, allowing it to learn more effectively when it is fine-tuned on a smaller, specific dataset.

Additionally, pretraining on a large dataset can help the model to generalize better to new, unseen data. Because the model has seen a wide range of examples during pretraining, it can better adapt to new data that may be different from the training data in some way. This can be especially important when working with deep learning models, which can have a large number of parameters and can be difficult to train effectively from scratch.

The Cassava competition was held a year before as well: https://www.kaggle.com/competitions/cassava-disease/overview.

With minimal adjustments, the data from the 2019 edition could be leveraged in the context of the current one. Several competitors addressed the topic:

A combined 2019 + 2020 dataset in TFRecords format was released in the Kaggle forum: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/199131
The winning solution from the 2019 edition served as a useful starting point: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/216985
Generating predictions on 2019 data and using the pseudo-labels to augment the dataset was reported to yield some (minor) improvements: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/203594

Test time augmentation

The idea behind Test Time Augmentation (TTA) is to apply different transformations to the test image: rotations, flipping, and translations. This creates a few different versions of the test image, and we generate a prediction for each of them. The resulting class probabilities are then averaged to get a more confident answer. An excellent demonstration of this technique is given in a notebook by Andrew Khael: https://www.kaggle.com/code/andrewkh/test-time-augmentation-tta-worth-it.

TTA was used extensively by the top solutions in the Cassava competition, an excellent example being the top three private leaderboard results: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/221150.

Transformers

While more widely known architectures like ResNeXt and EfficientNet were used a lot in the course of the competition, it was the addition of more novel ones that provided the extra edge to many competitors yearning for progress in a tightly packed leaderboard. Transformers emerged in 2017 as a revolutionary architecture for NLP (if somehow you missed the paper that started it all, here it is: https://arxiv.org/abs/1706.03762) and were such a spectacular success that, inevitably, many people started wondering if they could be applied to other modalities as well – vision being an obvious candidate. The aptly named Vision Transformer (ViT) made one of its first appearances in a Kaggle competition in the Cassava contest.

An excellent tutorial for ViT has been made public: https://www.kaggle.com/code/abhinand05/vision-transformer-vit-tutorial-baseline.

Ensembling

Ensembling is very popular on Kaggle (see Chapter 9 of The Kaggle Book for a more elaborate description) and the Cassava competition was no exception. As it turned out, combining diverse architectures was very beneficial (by averaging the class probabilities): EfficientNet, ResNext, and ViT are sufficiently different from each other that their predictions complement each other. When building a machine learning ensemble, it is useful to combine models that are different from one another because this can help improve the overall performance of the ensemble.

Ensembling is the process of combining the predictions of multiple models to create a more accurate prediction. By combining models that have different strengths and weaknesses, the ensemble can take advantage of the strengths of each individual model to make more accurate predictions.

For example, if the individual models in an ensemble are all based on the same type of algorithm, they may all make similar errors on certain types of data. By combining models that use different algorithms, the ensemble can potentially correct for the errors made by each individual model, leading to better overall performance. Additionally, by combining models that have been trained on different data or using different parameters, the ensemble can potentially capture more of the underlying variation in the data, leading to more accurate predictions.

Another important approach was stacking, i.e., using models in two stages. First we construct multiple predictions from diverse models, and those are subsequently used as input for a second-level model: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/220751.

The winning solution involved a different approach (with fewer models in the final blend), but relied on the same core logic: https://www.kaggle.com/competitions/cassava-leaf-disease-classification/discussion/221957.

The rest of the chapter is locked