Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Deep Learning with fastai Cookbook
Deep Learning with fastai Cookbook

Deep Learning with fastai Cookbook: Leverage the easy-to-use fastai framework to unlock the power of deep learning

eBook
€20.98 €29.99
Paperback
€36.99
Subscription
Free Trial
Renews at €18.99p/m

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing
Table of content icon View table of contents Preview book icon Preview Book

Deep Learning with fastai Cookbook

Chapter 2: Exploring and Cleaning Up Data with fastai

In the previous chapter, we got started with the fastai framework by setting up its coding environment, working through a concrete application example (MNIST), and investigating two frameworks with different relationships to fastai: PyTorch and Keras. In this chapter, we are going to dive deeper into an important aspect of fastai: ingesting, exploring, and cleaning up data. In particular, we are going to explore a selection of the datasets that are curated by fastai.

By the end of this chapter, you will be able to describe the complete set of curated datasets that fastai supports, use the facilities of fastai to examine these datasets, and clean up a dataset to eliminate missing and non-numeric values.

Here are the recipes that will be covered in this chapter:

  • Getting the complete set of oven-ready fastai datasets
  • Examining tabular datasets with fastai
  • Examining text datasets with fastai
  • Examining image datasets with fastai
  • Cleaning up raw datasets with fastai

Technical requirements

Ensure that you have completed the setup sections in Chapter 1, Getting Started with fastai, and that you have a working Gradient instance or Colab setup. Ensure that you have cloned the repository for this book (https://github.com/PacktPublishing/Deep-Learning-with-fastai-Cookbook) and have access to the ch2 folder. This folder contains the code samples that will be described in this chapter.

Getting the complete set of oven-ready fastai datasets

In Chapter 1, Getting Started with fastai, you encountered the MNIST dataset and saw how easy it was to make this dataset available to train a fastai deep learning model. You were able to train the model without needing to worry about the location of the dataset or its structure (apart from the names of the folders containing the training and validation datasets). You were able to examine elements of the dataset conveniently.

In this section, we'll take a closer look at the complete set of datasets that fastai curates and explain how you can get additional information about these datasets.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, so that you have a fastai environment set up. Confirm that you can open the fastai_dataset_walkthrough.ipynb notebook in the ch2 directory of your cloned repository.

How to do it…

In this section, you will be running through the fastai_dataset_walkthrough.ipynb notebook, as well as the fastai dataset documentation, so that you understand the datasets that fastai curates. Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first three cells of the notebook to load the required libraries, set up the notebook for fastai, and define the MNIST dataset:
    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

    Figure 2.1 – Cells to load the libraries, set up the notebook, and define the MNIST dataset

  2. Consider the argument to untar_data: URLs.MINST. What is this? Let's try the ??  shortcut to examine the source code for a URLs object:
    Figure 2.2 – Source for URLs

    Figure 2.2 – Source for URLs

  3. By looking at the image classification datasets section of the source code for URLs, we can find the definition of URLs.MNIST:
    MNIST           = f'{S3_IMAGE}mnist_png.tgz'
  4. Working backward through the source code for the URLs class, we can get the whole URL for MNIST:
    S3_IMAGE     = f'{S3}imageclas/'
    S3  = 'https://s3.amazonaws.com/fast-ai-'
  5. Putting it all together, we get the URL for URLs.MNIST:
    https://s3.amazonaws.com/fast-ai-imageclas/mnist_png.tgz
  6. You can download this file for yourself and untar it. You will see that the directory structure of the untarred package looks like this:
    mnist_png
    ├── testing
    │   ├── 0
    │   ├── 1
    │   ├── 2
    │   ├── 3
    │   ├── 4
    │   ├── 5
    │   ├── 6
    │   ├── 7
    │   ├── 8
    │   └── 9
    └── training
         ├── 0
         ├── 1
         ├── 2
         ├── 3
         ├── 4
         ├── 5
         ├── 6
         ├── 7
         ├── 8
         └── 9
  7. In the untarred directory structure, each of the testing and training directories contain subdirectories for each digit. These digit directories contain image files for that digit. This means that the label of the dataset – the value that we want the model to predict – is encoded in the directory that the image file resides in.
  8. Is there a way to get the directory structure of one of the curated datasets without having to determine its URL from the definition of URLs, download the dataset, and unpack it? There is – using path.ls():
    Figure 2.3 – Using path.ls() to get the dataset's directory structure

    Figure 2.3 – Using path.ls() to get the dataset's directory structure

  9. This tells us that there are two subdirectories in the dataset: training and testing. You can call ls() to get the structure of the training subdirectory:
    Figure 2.4 – The structure of the training subdirectory

    Figure 2.4 – The structure of the training subdirectory

  10. Now that we have learned how to get the directory structure of the MNIST dataset using the ls() function, what else can we learn from the output of ??URLs?
  11. First, let's look at the other datasets listed in the output of ??URLs by group. First, let's look at the datasets listed under main datasets. This list includes tabular datasets (ADULT_SAMPLE), text datasets (IMDB_SAMPLE), recommender system datasets (ML_SAMPLE), and a variety of image datasets (CIFAR, IMAGENETTE, COCO_SAMPLE):
         ADULT_SAMPLE           = f'{URL}adult_sample.tgz'
         BIWI_SAMPLE            = f'{URL}biwi_sample.tgz'
         CIFAR                     = f'{URL}cifar10.tgz'
         COCO_SAMPLE            = f'{S3_COCO}coco_sample.tgz'
         COCO_TINY               = f'{S3_COCO}coco_tiny.tgz'
         HUMAN_NUMBERS         = f'{URL}human_numbers.tgz'
         IMDB                       = f'{S3_NLP}imdb.tgz'
         IMDB_SAMPLE            = f'{URL}imdb_sample.tgz'
         ML_SAMPLE               = f'{URL}movie_lens_sample.tgz'
         ML_100k                  = 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
         MNIST_SAMPLE           = f'{URL}mnist_sample.tgz'
         MNIST_TINY              = f'{URL}mnist_tiny.tgz'
         MNIST_VAR_SIZE_TINY = f'{S3_IMAGE}mnist_var_size_tiny.tgz'
         PLANET_SAMPLE         = f'{URL}planet_sample.tgz'
         PLANET_TINY            = f'{URL}planet_tiny.tgz'
         IMAGENETTE              = f'{S3_IMAGE}imagenette2.tgz'
         IMAGENETTE_160        = f'{S3_IMAGE}imagenette2-160.tgz'
         IMAGENETTE_320        = f'{S3_IMAGE}imagenette2-320.tgz'
         IMAGEWOOF               = f'{S3_IMAGE}imagewoof2.tgz'
         IMAGEWOOF_160         = f'{S3_IMAGE}imagewoof2-160.tgz'
         IMAGEWOOF_320         = f'{S3_IMAGE}imagewoof2-320.tgz'
         IMAGEWANG               = f'{S3_IMAGE}imagewang.tgz'
         IMAGEWANG_160         = f'{S3_IMAGE}imagewang-160.tgz'
         IMAGEWANG_320         = f'{S3_IMAGE}imagewang-320.tgz'
  12. Next, let's look at the datasets in the other categories: image classification datasets, NLP datasets, image localization datasets, audio classification datasets, and medical image classification datasets. Note that the list of curated datasets includes datasets that aren't directly associated with any of the four main application areas supported by fastai. The audio datasets, for example, apply to a use case outside the four main application areas:
         # image classification datasets
         CALTECH_101  = f'{S3_IMAGE}caltech_101.tgz'
         CARS            = f'{S3_IMAGE}stanford-cars.tgz'
         CIFAR_100     = f'{S3_IMAGE}cifar100.tgz'
         CUB_200_2011 = f'{S3_IMAGE}CUB_200_2011.tgz'
         FLOWERS        = f'{S3_IMAGE}oxford-102-flowers.tgz'
         FOOD            = f'{S3_IMAGE}food-101.tgz'
         MNIST           = f'{S3_IMAGE}mnist_png.tgz'
         PETS            = f'{S3_IMAGE}oxford-iiit-pet.tgz'
         # NLP datasets
         AG_NEWS                        = f'{S3_NLP}ag_news_csv.tgz'
         AMAZON_REVIEWS              = f'{S3_NLP}amazon_review_full_csv.tgz'
         AMAZON_REVIEWS_POLARITY = f'{S3_NLP}amazon_review_polarity_csv.tgz'
         DBPEDIA                        = f'{S3_NLP}dbpedia_csv.tgz'
         MT_ENG_FRA                    = f'{S3_NLP}giga-fren.tgz'
         SOGOU_NEWS                    = f'{S3_NLP}sogou_news_csv.tgz'
         WIKITEXT                       = f'{S3_NLP}wikitext-103.tgz'
         WIKITEXT_TINY               = f'{S3_NLP}wikitext-2.tgz'
         YAHOO_ANSWERS               = f'{S3_NLP}yahoo_answers_csv.tgz'
         YELP_REVIEWS                 = f'{S3_NLP}yelp_review_full_csv.tgz'
         YELP_REVIEWS_POLARITY   = f'{S3_NLP}yelp_review_polarity_csv.tgz'
         # Image localization datasets
         BIWI_HEAD_POSE      = f"{S3_IMAGELOC}biwi_head_pose.tgz"
         CAMVID                  = f'{S3_IMAGELOC}camvid.tgz'
         CAMVID_TINY           = f'{URL}camvid_tiny.tgz'
         LSUN_BEDROOMS        = f'{S3_IMAGE}bedroom.tgz'
         PASCAL_2007           = f'{S3_IMAGELOC}pascal_2007.tgz'
         PASCAL_2012           = f'{S3_IMAGELOC}pascal_2012.tgz'
         # Audio classification datasets
         MACAQUES               = 'https://storage.googleapis.com/ml-animal-sounds-datasets/macaques.zip'
         ZEBRA_FINCH           = 'https://storage.googleapis.com/ml-animal-sounds-datasets/zebra_finch.zip'
         # Medical Imaging datasets
         SIIM_SMALL            = f'{S3_IMAGELOC}siim_small.tgz'
  13. Now that we have listed all the datasets defined in URLs, how can we find out more information about them?

    a) The fastai documentation (https://course.fast.ai/datasets) documents some of the datasets listed in URLs. Note that this documentation is not consistent with what's listed in the source of URLs. For example, the naming of the datasets is not consistent and the documentation page does not cover all the datasets. When in doubt, treat the source of URLs as your single source of truth about fastai curated datasets.

    b) Use the path.ls() function to examine the directory structure, as shown in the following example, which lists the directories under the training subdirectory of the MNIST dataset:

    Figure 2.5 – Structure of the training subdirectory

    Figure 2.5 – Structure of the training subdirectory

    c) Check out the file structure that gets installed when you run untar_data. For example, in Gradient, the datasets get installed in storage/data, so you can go into that directory in Gradient to inspect the directories for the curated dataset you're interested in.

    d) For example, let's say untar_data is run with URLs.PETS as the argument:

    path = untar_data(URLs.PETS)

    e) Here, you can find the dataset in storage/data/oxford-iiit-pet, and you can see the directory's structure:

    oxford-iiit-pet
    ├── annotations
    │   ├── trimaps
    │   └── xmls
    └── images
  14. If you want to see the definition of a function in a notebook, you can run a cell with ??, followed by the name of the function. For example, to see the definition of the ls() function, you can use ??Path.ls:
    Figure 2.6 – Source for Path.ls()

    Figure 2.6 – Source for Path.ls()

  15. To see the documentation for any function, you can use the doc() function. For example, the output of doc(Path.ls) shows the signature of the function, along with links to the source code (https://github.com/fastai/fastcore/blob/master/fastcore/xtras.py#L111) and the documentation (https://fastcore.fast.ai/xtras#Path.ls) for this function:
Figure 2.7 – Output of doc(Path.ls)

Figure 2.7 – Output of doc(Path.ls)

You have now explored the list of oven-ready datasets curated by fastai. You have also learned how to get the directory structure of these datasets, as well as how to examine the source and documentation of a function from within a notebook.

How it works…

As you saw in this section, fastai defines URLs for each of the curated datasets in the URLs class. When you call untar_data with one of the curated datasets as the argument, if the files for the dataset have not already been copied, these files get downloaded to your filesystem (storage/data in a Gradient instance). The object you get back from untar_data allows you to examine the directory structure of the dataset, and then pass it along to the next stage in the process of creating a fastai deep learning model. By wrapping a large sampling of interesting datasets in such a convenient way, fastai makes it easy for you to create deep learning models with these datasets, and also lets you focus your efforts on creating and improving the deep learning model rather than fiddling with the details of ingesting the datasets.

There's more…

You might be asking yourself why we went to the trouble of examining the source code for the URLs class to get details about the curated datasets. After all, these datasets are documented in https://course.fast.ai/datasets. The problem is that this documentation page doesn't give a complete list of all the curated datasets, and it doesn't clearly explain what you need to know to make the correct untar_data calls for a particular curated dataset. The incomplete documentation for the curated datasets demonstrates one of the weaknesses of fastai – inconsistent documentation. Sometimes, the documentation is complete, but sometimes, it is lacking details, so you will need to look at the source code directly to figure out what's going on, like we had to do in this section for the curated datasets. This problem is compounded by Google search returning hits for documentation for earlier versions of fastai. If you are searching for some details about fastai, avoid hits for fastai version 1 (https://fastai1.fast.ai/) and keep to the documentation for the current version of fastai: https://docs.fast.ai/.

Examining tabular datasets with fastai

In the previous section, we looked at the whole set of datasets curated by fastai. In this section, we are going to dig into a tabular dataset from the curated list. We will ingest the dataset, look at some example records, and then explore characteristics of the dataset, including the number of records and the number of unique values in each column.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_tabular_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to include the ADULT_SAMPLE dataset featured in this section.

Dataset citation

Ron Kohavi. (1996) Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid (http://robotics.stanford.edu/~ronnyk/nbtree.pdf).

How to do it…

In this section, you will be running through the examining_tabular_datasets.ipynb notebook to examine the ADULT_SAMPLE dataset.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.ADULT_SAMPLE)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.8 – Output of path.ls()

    Figure 2.8 – Output of path.ls()

  4. The dataset is in the adult.csv file. Run the following cell to ingest this CSV file into a pandas DataFrame:
    df = pd.read_csv(path/'adult.csv')
  5. Run the head() command to get a sample of records from the beginning of the dataset:
    Figure 2.9 – Sample of records from the beginning of the dataset

    Figure 2.9 – Sample of records from the beginning of the dataset

  6. Run the following command to get the number of records (rows) and fields (columns) in the dataset:
    df.shape
  7. Run the following command to get the number of unique values in each column of the dataset. Can you tell from the output which columns are categorical?
    df.nunique()
  8. Run the following command to get the count of missing values in each column of the dataset. Which columns have missing values?
    df.isnull().sum()
  9. Run the following command to display some sample records from the subset of the dataset for people whose age is less than or equal to 40:
    df_young = df[df.age <= 40]
    df_young.head()

Congratulations! You have ingested a tabular dataset curated by fastai and done a basic examination of the dataset.

How it works…

The dataset that you explored in this section, ADULT_SAMPLE, is one of the datasets you would have seen in the source for URLs in the previous section. Note that while the source for URLs identifies which datasets are related to image or NLP (text) applications, it does not explicitly identify the tabular or recommender system datasets. ADULT_SAMPLE is one of the datasets listed under main datasets:

Figure 2.10 – Main datasets from the source for URLs

Figure 2.10 – Main datasets from the source for URLs

How did I determine that ADULT_SAMPLE was a tabular dataset? First, the paper by Howard and Gugger (https://arxiv.org/pdf/2002.04688.pdf) identifies ADULT_SAMPLE as a tabular dataset. Second, I just had to ingest it and try it out to confirm it could be ingested into a pandas DataFrame.

There's more…

What about the other curated datasets that aren't explicitly categorized in the source for URLs? Here's a summary of the datasets listed in the source for URLs under main datasets:

  • Tabular:

    a) ADULT_SAMPLE

  • NLP (text):

    a) HUMAN_NUMBERS  

    b) IMDB

    c) IMDB_SAMPLE      

  • Collaborative filtering:

    a) ML_SAMPLE               

    b) ML_100k                              

  • Image data:  

    a) All of the other datasets listed in URLs under main datasets.

Examining text datasets with fastai

In the previous section, we looked at how a curated tabular dataset could be ingested. In this section, we are going to dig into a text dataset from the curated list.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_text_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the WIKITEXT_TINY dataset (https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/) featured in this section.

Dataset citation

Stephen Merity, Caiming Xiong, James Bradbury, Richard Socher. (2016). Pointer Sentinel Mixture Models (https://arxiv.org/pdf/1609.07843.pdf).

How to do it…

In this section, you will be running through the examining_text_datasets.ipynb notebook to examine the WIKITEXT_TINY dataset. As its name suggests, this is a small set of text that's been gleaned from good and featured Wikipedia articles.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.WIKITEXT_TINY)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.11 – Output of path.ls()

    Figure 2.11 – Output of path.ls()

  4. There are two CSV files that make up this dataset. Let's ingest each of them into a pandas DataFrame, starting with train.csv:
    df_train = pd.read_csv(path/'train.csv')
  5. When you use head() to check the DataFrame, you'll notice that something's wrong – the CSV file has no header with column names, but by default, read_csv assumes the first row is the header, so the first row gets misinterpreted as a header. As shown in the following screenshot, the first row of output is in bold, which indicates that the first row is being interpreted as a header, even though it contains a regular data row:
    Figure 2.12 – First record in df_train

    Figure 2.12 – First record in df_train

  6. To fix this problem, rerun the read_csv function, but this time with the header=None parameter, to specify that the CSV file doesn't have a header:
    df_train = pd.read_csv(path/'train.csv',header=None)
  7. Check head() again to confirm that the problem has been resolved:
    Figure 2.13 – Revising the first record in df_train

    Figure 2.13 – Revising the first record in df_train

  8. Ingest test.csv into a DataFrame using the header=None parameter:
    df_test = pd.read_csv(path/'test.csv',header=None)
  9. We want to tokenize the dataset and transform it into a list of words. Since we want a common set of tokens for the entire dataset, we will begin by combining the test and train DataFrames:
    df_combined = pd.concat([df_train,df_test])
  10. Confirm the shape of the train, test, and combined dataframes – the number of rows in the combined DataFrame should be the sum of the number of rows in the train and test DataFrames:
    print("df_train: ",df_train.shape)
    print("df_test: ",df_test.shape)
    print("df_combined: ",df_combined.shape)
  11. Now, we're ready to tokenize the DataFrame. The tokenize_df() function takes the list of columns containing the text we want to tokenize as a parameter. Since the columns of the DataFrame are not labeled, we need to refer to the column we want to tokenize using its position rather than its name:
    df_tok, count = tokenize_df(df_combined,[df_combined.columns[0]])
  12. Check the contents of the first few records of df_tok, which is the new DataFrame containing the tokenized contents of the combined DataFrame:
    Figure 2.14 – The first few records of df_tok

    Figure 2.14 – The first few records of df_tok

  13. Check the count for a few sample words to ensure they are roughly what you expected. Pick a very common word, a moderately common word, and a rare word:
    print("very common word (count['the']):", count['the'])
    print("moderately common word (count['prepared']):", count['prepared'])
    print("rare word (count['gaga']):", count['gaga'])

Congratulations! You have successfully ingested, explored, and tokenized a curated text dataset.

How it works…

The dataset that you explored in this section, WIKITEXT_TINY, is one of the datasets you would have seen in the source for URLs in the Getting the complete set of oven-ready fastai datasets section. Here, you can see that WIKITEXT_TINY is in the NLP datasets section of the source for URLs:

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Figure 2.15 – WIKITEXT_TINY in the NLP datasets list in the source for URLs

Examining image datasets with fastai

In the past two sections, we examined tabular and text datasets and got a taste of the facilities that fastai provides for accessing and exploring these datasets. In this section, we are going to look at image data. We are going to look at two datasets: the FLOWERS image classification dataset and the BIWI_HEAD_POSE image localization dataset.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the examining_image_datasets.ipynb notebook in the ch2 directory of your repository.

I am grateful for the opportunity to use the FLOWERS dataset featured in this section.

Dataset citation

Maria-Elena Nilsback, Andrew Zisserman. (2008). Automated flower classification over a large number of classes (https://www.robots.ox.ac.uk/~vgg/publications/papers/nilsback08.pdf).

I am grateful for the opportunity to use the BIWI_HEAD_POSE dataset featured in this section.

Dataset citation

Gabriele Fanelli, Thibaut Weise, Juergen Gall, Luc Van Gool. (2011). Real Time Head Pose Estimation from Consumer Depth Cameras (https://link.springer.com/chapter/10.1007/978-3-642-23123-0_11). Lecture Notes in Computer Science, vol 6835. Springer, Berlin, Heidelberg https://doi.org/10.1007/978-3-642-23123-0_11.

How to do it…

In this section, you will be running through the examining_image_datasets.ipynb notebook to examine the FLOWERS and BIWI_HEAD_POSE datasets.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Run the following cell to copy the FLOWERS dataset into your filesystem (if it's not already there) and to define the path for the dataset:
    path = untar_data(URLs.FLOWERS)
  3. Run the following cell to get the output of path.ls() so that you can examine the directory structure of the dataset:
    Figure 2.16 – Output of path.ls()

    Figure 2.16 – Output of path.ls()

  4. Look at the contents of the valid.txt file. This indicates that train.txt, valid.txt, and test.txt contain lists of the image files that belong to each of these datasets:
    Figure 2.17 – The first few records of valid.txt

    Figure 2.17 – The first few records of valid.txt

  5. Examine the jgp subdirectory:
    (path/'jpg').ls()
  6. Take a look at one of the image files. Note that the get_image_files() function doesn't need to be pointed to a particular subdirectory – it recursively collects all the image files in a directory and its subdirectories:
    img_files = get_image_files(path)
    img = PILImage.create(img_files[100])
    img
  7. You should have noticed that the image displayed in the previous step was the native size of the image, which makes it rather big for the notebook. To get the image at a more appropriate size, apply the to_thumb function with the image dimension specified as an argument. Note that you might see a different image when you run this cell:
    Figure 2.18 – Applying to_thumb to an image

    Figure 2.18 – Applying to_thumb to an image

  8. Now, ingest the BIWI_HEAD_POSE dataset:
    path = untar_data(URLs.BIWI_HEAD_POSE)
  9. Examine the path for this dataset:
    path.ls()
  10. Examine the 05 subdirectory:
    (path/"05").ls()
  11. Examine one of the images. Note that you may see a different image:
    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

    Figure 2.19 – One of the images in the BIWI_HEAD_POSE dataset

  12. In addition to the image files, this dataset also includes text files that encode the pose depicted in the image. Ingest one of these text files into a pandas DataFrame and display it:
Figure 2.20 – The first few records of one of the position text files

Figure 2.20 – The first few records of one of the position text files

In this section, you learned how to ingest two different kinds of image datasets, explore their directory structure, and examine images from the datasets.

How it works…

You used the same untar_data() function to ingest the curated tabular, text, and image datasets, and the same ls() function to examine the directory structures for all the different kinds of datasets. On top of these common facilities, fastai provides additional convenience functions for examining image data: get_image_files() to collect all the image files in a directory tree starting at a given directory, and to_thumb() to render the image at a size that is suitable for a notebook.

There's more…

In addition to image classification datasets (where the goal of the trained model is to predict the category of what's displayed in the image) and image localization datasets (where the goal is to predict the location in the image of a given feature), the fastai curated datasets also include image segmentation datasets where the goal is to identify the subsets of an image that contain a particular object, including the CAMVID and CAMVID_TINY datasets.

Cleaning up raw datasets with fastai

Now that we have explored a variety of datasets that are curated by fastai, there is one more topic left to cover in this chapter: how to clean up datasets with fastai. Cleaning up datasets includes dealing with missing values and converting categorical values into numeric identifiers. We need to apply these cleanup steps to datasets because deep learning models can only be trained with numeric data. If we try to train the model with datasets that contain non-numeric data, including missing values and alphanumeric identifiers in categorical columns, the training process will fail. In this section, we are going to review the facilities provided by fastai to make it easy to clean up datasets, and thus make the datasets ready to train deep learning models.

Getting ready

Ensure you have followed the steps in Chapter 1, Getting Started with fastai, to get a fastai environment set up. Confirm that you can open the cleaning_up_datasets.ipynb notebook in the ch2 directory of your repository.

How to do it…

In this section, you will be running through the cleaning_up_datasets.ipynb notebook to address missing values in the ADULT_SAMPLE dataset and replace categorical values with numeric identifiers.

Once you have the notebook open in your fastai environment, complete the following steps:

  1. Run the first two cells to import the necessary libraries and set up the notebook for fastai.
  2. Recall the Examining tabular datasets with fastai section of this chapter. When you checked to see which columns in the ADULT_SAMPLE dataset had missing values, you found that some columns did indeed have missing values. We are going to identify the columns in ADULT_SAMPLE that have missing values, and use the facilities of fastai to apply transformations to the dataset that deal with the missing values in those columns, and then replace those categorical values with numeric identifiers.
  3. First, let's ingest the ADULT_SAMPLE curated dataset again:
    path = untar_data(URLs.ADULT_SAMPLE)
  4. Now, create a pandas DataFrame for the dataset and check for the number of missing values in each column. Note which columns have missing values:
    df = pd.read_csv(path/'adult.csv')
    df.isnull().sum()
  5. To deal with these missing values (and prepare categorical columns), we will use the fastai TabularPandas class (https://docs.fast.ai/tabular.core.html#TabularPandas). To use this class, we need to prepare the following parameters:

    a) procs is the list of transformations that will be applied to TabularPandas. Here, we will specify that we want missing values to be filled (FillMissing) and that we will replace values in categorical columns with numeric identifiers (Categorify).

    b) dep_var specifies which column is the dependent variable; that is, the target that we want to ultimately predict with the model. In the case of ADULT_SAMPLE, the dependent variable is salary.

    c) cont and cat are lists of the columns in the dataset. They are continuous and categorical, respectively. Continuous columns contain numeric values, such as integers or floating-point values. Categorical values contain category identifiers, such as names of US states, days of the week, or colors. We use the cont_cat_split() (https://docs.fast.ai/tabular.core.html#cont_cat_split) function to automatically identify the continuous and categorical columns:

    procs = [FillMissing,Categorify]
    dep_var = 'salary'
    cont,cat = cont_cat_split(df, 1, dep_var=dep_var)
  6. Now, create a TabularPandas object called df_no_missing using these parameters. This object will contain the dataset with missing values replaced and the values in the categorical columns replaced with numeric identifiers:
    df_no_missing = TabularPandas(df, procs, cat, cont, y_names = dep_var)
  7. Apply the show API to df_no_missing to display samples of its contents. Note that the values in the categorical columns are maintained when the object is displayed using show(). What about replacing the categorical values with numeric identifiers? Don't worry – we'll see that result in the next step:
    Figure 2.21 – The first few records of df_no_missing

    Figure 2.21 – The first few records of df_no_missing

  8. Now, display some sample contents of df_no_missing using the items.head() API. This time, the categorical columns contain the numeric identifiers rather than the original values. This is an example of a benefit provided by fastai: the switch between the original categorical values and the numeric identifiers is handled elegantly. If you need to see the original values, you can use the show() API, which transforms the numeric values in categorical columns back into their original values, while the items.head() API shows the actual numeric identifiers in the categorical columns:
    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

    Figure 2.22 – The first few records of df_no_missing with numeric identifiers in categorical columns

  9. Finally, let's confirm that the missing values were handled correctly. As you can see, the two columns that originally had missing values no longer have missing values in df_no_missing:
Figure 2.23 – Missing values in df_no_missing

Figure 2.23 – Missing values in df_no_missing

By following these steps, you have seen how fastai makes it easy to prepare a dataset to train a deep learning model. It does this by replacing missing values and converting the values in the categorical columns into numeric identifiers.

How it works…

In this section, you saw several ways that fastai makes it easy to perform common data preparation steps. The TabularPandas class provides a lot of value by making it easy to execute common steps to prepare a tabular dataset (including replacing missing values and dealing with categorical columns). The cont_cat_split() function automatically identifies continuous and categorical columns in your dataset. In conclusion, fastai makes the cleanup process easy and less error prone than it would be if you had to hand code all the functions required to accomplish these dataset cleanup steps.

Left arrow icon Right arrow icon
Download code icon Download Code

Key benefits

  • Discover how to apply state-of-the-art deep learning techniques to real-world problems
  • Build and train neural networks using the power and flexibility of the fastai framework
  • Use deep learning to tackle problems such as image classification and text classification

Description

fastai is an easy-to-use deep learning framework built on top of PyTorch that lets you rapidly create complete deep learning solutions with as few as 10 lines of code. Both predominant low-level deep learning frameworks, TensorFlow and PyTorch, require a lot of code, even for straightforward applications. In contrast, fastai handles the messy details for you and lets you focus on applying deep learning to actually solve problems. The book begins by summarizing the value of fastai and showing you how to create a simple 'hello world' deep learning application with fastai. You'll then learn how to use fastai for all four application areas that the framework explicitly supports: tabular data, text data (NLP), recommender systems, and vision data. As you advance, you'll work through a series of practical examples that illustrate how to create real-world applications of each type. Next, you'll learn how to deploy fastai models, including creating a simple web application that predicts what object is depicted in an image. The book wraps up with an overview of the advanced features of fastai. By the end of this fastai book, you'll be able to create your own deep learning applications using fastai. You'll also have learned how to use fastai to prepare raw datasets, explore datasets, train deep learning models, and deploy trained models.

Who is this book for?

This book is for data scientists, machine learning developers, and deep learning enthusiasts looking to explore the fastai framework using a recipe-based approach. Working knowledge of the Python programming language and machine learning basics is strongly recommended to get the most out of this deep learning book.

What you will learn

  • Prepare real-world raw datasets to train fastai deep learning models
  • Train fastai deep learning models using text and tabular data
  • Create recommender systems with fastai
  • Find out how to assess whether fastai is a good fit for a given problem
  • Deploy fastai deep learning models in web applications
  • Train fastai deep learning models for image classification

Product Details

Country selected
Publication date, Length, Edition, Language, ISBN-13
Publication date : Sep 24, 2021
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781800208100
Category :
Languages :
Concepts :
Tools :

What do you get with a Packt Subscription?

Free for first 7 days. €18.99 p/m after that. Cancel any time!
Product feature icon Unlimited ad-free access to the largest independent learning library in tech. Access this title and thousands more!
Product feature icon 50+ new titles added per month, including many first-to-market concepts and exclusive early access to books as they are being written.
Product feature icon Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.
Product feature icon Thousands of reference materials covering every tech concept you need to stay up to date.
Subscribe now
View plans & pricing

Product Details

Publication date : Sep 24, 2021
Length: 340 pages
Edition : 1st
Language : English
ISBN-13 : 9781800208100
Category :
Languages :
Concepts :
Tools :

Packt Subscriptions

See our plans and pricing
Modal Close icon
€18.99 billed monthly
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Simple pricing, no contract
€189.99 billed annually
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts
€264.99 billed in 18 months
Feature tick icon Unlimited access to Packt's library of 7,000+ practical books and videos
Feature tick icon Constantly refreshed with 50+ new titles a month
Feature tick icon Exclusive Early access to books as they're written
Feature tick icon Solve problems while you work with advanced search and reference features
Feature tick icon Offline reading on the mobile app
Feature tick icon Choose a DRM-free eBook or Video every month to keep
Feature tick icon PLUS own as many other DRM-free eBooks or Videos as you like for just €5 each
Feature tick icon Exclusive print discounts

Frequently bought together


Stars icon
Total 123.97
Mastering PyTorch
€44.99
Machine Learning for Time-Series with Python
€41.99
Deep Learning with fastai Cookbook
€36.99
Total 123.97 Stars icon
Banner background image

Table of Contents

9 Chapters
Chapter 1: Getting Started with fastai Chevron down icon Chevron up icon
Chapter 2: Exploring and Cleaning Up Data with fastai Chevron down icon Chevron up icon
Chapter 3: Training Models with Tabular Data Chevron down icon Chevron up icon
Chapter 4: Training Models with Text Data Chevron down icon Chevron up icon
Chapter 5: Training Recommender Systems Chevron down icon Chevron up icon
Chapter 6: Training Models with Visual Data Chevron down icon Chevron up icon
Chapter 7: Deployment and Model Maintenance Chevron down icon Chevron up icon
Chapter 8: Extended fastai and Deployment Features Chevron down icon Chevron up icon
Other Books You May Enjoy Chevron down icon Chevron up icon

Customer reviews

Top Reviews
Rating distribution
Full star icon Full star icon Full star icon Full star icon Half star icon 4.5
(15 Ratings)
5 star 60%
4 star 33.3%
3 star 6.7%
2 star 0%
1 star 0%
Filter icon Filter
Top Reviews

Filter reviews by




Richa Sethi Oct 29, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
As the name of the book describes, this book is actually a cookbook where it starts from the very basic ingredients (setting up the environment) needed for fastai, and eventually builds up the pace by getting deep into data cleaning and exploration. It then delves into training tabular data, text data and recommender systems, and finally gets into model deployment and maintenance. Great for beginners as well as individuals like me who have used fastai for certain applications before. This book should allow anyone apply fastai to create end-to-end deep learning models to make predictions on a wide variety of datasets.
Amazon Verified review Amazon
Daniel Armstrong Oct 30, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is filled with great examples of how you can apply fast.ai to a wide range of deep learning projects. I may be a little biased because I am a huge fan of fast.ai, but I really enjoyed this book. I am glad to have it as part of my library. It not only covers the major deep learning topics like CV, NLP, and Tabular data, but is also covers other topics like using callback, memory management, and model deployment. The only part that I was a little disappointed about was the section on object detection, the author did a great job covering the subject, but the fast.ai library doesn't cover an easy way to do object detection out of the box. The good new is their is a library called Ice Vision that is built and maintained by a group of former Fast.ai student/alumni, which is built on top of fast.ai that does a great job making object detection possible and as easy as most things you can do in fast.ai.
Amazon Verified review Amazon
Guangping zhang Sep 24, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
Finally I read a book for FastAI, "Deep Learning with fastai Cookbook".Fist this book discusses how to setting up a fastai environment in Google Colab, then introduce theapplications: tables, text, recommender systems, and images.After discussing data cleaning, which is a necessary step using fastai, this book discuss the four typesof application of Fastai chapter by chapter, all the application share same training and predict api, fit_one_cycleand predict, for transfer learning, you can use fine_tune() replace fit_one_cycle.Several steps for model web deployment and advanced deployment, including export(), load_learner() and web_flask_deploy and/or web_flask_deploy_image_mode()were discussed using two chapter in this book.This book uses clean logic and easy understanding language to introduce fastai to readers, I suggest you read this book if you plan to use fastai.
Amazon Verified review Amazon
Erfan Chowdhury Nov 30, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book is perfect for people who wants to utilize the highly optimized Fastai library. Jeremy Howard , the creator of Fastai, has a free of charge online series where he demonstrates and teaches how to use the library. He even published a book "Deep Learning for Coders with Fastai and PyTorch: AI Applications Without a PhD" to follow along with the course. However , many topics were just brushed over in these videos and the book and the task was left for the user to go and tinker around.This is time consuming unfortunately. But here is the good part! This is the book that basically spoon feeds the Fastai library to you. This book contains in-depth explanations to topics that were not deeply covered in the series or the book. All the chapters in the book comes with their own datasets to test out what you have learnt. Starting from cleaning the data to solving all different kinds of structured and unstructured problems that you might face in the field of deep learning, this book teaches you how to effectively use the Fastai toolkit. This book also demonstrates how you can deploy the models that you trained. The "Test Your Knowledge" section at the end of every chapter is like a brain teaser that strengthens your conceptualization of that particular chapter.The book is well written and well organized. I thoroughly enjoyed the book. Highly Recommended!
Amazon Verified review Amazon
Carlo Nov 29, 2021
Full star icon Full star icon Full star icon Full star icon Full star icon 5
This book provides an excellent starting point to anyone interested in building deep learning applications.It effectively shows how to use fastai, a powerful and intuitive deep learning framework.The book provides detailed instructions and tutorials to quickly dive into deep learning examples. A very nice aspect is the step-by-step guidance in setting up the right working environment to run the examples.It is a perfect book for practitioners and machine learning beginners that want to get started as quickly as possible. However, it doesn't provide much theoretical background on machine learning and how models work, so should definitely be accompanied by other resources for practitioners with no background on machine learning who want to understand the inner workings of ML models.Overall it's a useful book especially recommended to anyone who wants to use the fast.ai framework to build powerful deep learning models quickly.
Amazon Verified review Amazon
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

What is included in a Packt subscription? Chevron down icon Chevron up icon

A subscription provides you with full access to view all Packt and licnesed content online, this includes exclusive access to Early Access titles. Depending on the tier chosen you can also earn credits and discounts to use for owning content

How can I cancel my subscription? Chevron down icon Chevron up icon

To cancel your subscription with us simply go to the account page - found in the top right of the page or at https://subscription.packtpub.com/my-account/subscription - From here you will see the ‘cancel subscription’ button in the grey box with your subscription information in.

What are credits? Chevron down icon Chevron up icon

Credits can be earned from reading 40 section of any title within the payment cycle - a month starting from the day of subscription payment. You also earn a Credit every month if you subscribe to our annual or 18 month plans. Credits can be used to buy books DRM free, the same way that you would pay for a book. Your credits can be found in the subscription homepage - subscription.packtpub.com - clicking on ‘the my’ library dropdown and selecting ‘credits’.

What happens if an Early Access Course is cancelled? Chevron down icon Chevron up icon

Projects are rarely cancelled, but sometimes it's unavoidable. If an Early Access course is cancelled or excessively delayed, you can exchange your purchase for another course. For further details, please contact us here.

Where can I send feedback about an Early Access title? Chevron down icon Chevron up icon

If you have any feedback about the product you're reading, or Early Access in general, then please fill out a contact form here and we'll make sure the feedback gets to the right team. 

Can I download the code files for Early Access titles? Chevron down icon Chevron up icon

We try to ensure that all books in Early Access have code available to use, download, and fork on GitHub. This helps us be more agile in the development of the book, and helps keep the often changing code base of new versions and new technologies as up to date as possible. Unfortunately, however, there will be rare cases when it is not possible for us to have downloadable code samples available until publication.

When we publish the book, the code files will also be available to download from the Packt website.

How accurate is the publication date? Chevron down icon Chevron up icon

The publication date is as accurate as we can be at any point in the project. Unfortunately, delays can happen. Often those delays are out of our control, such as changes to the technology code base or delays in the tech release. We do our best to give you an accurate estimate of the publication date at any given time, and as more chapters are delivered, the more accurate the delivery date will become.

How will I know when new chapters are ready? Chevron down icon Chevron up icon

We'll let you know every time there has been an update to a course that you've bought in Early Access. You'll get an email to let you know there has been a new chapter, or a change to a previous chapter. The new chapters are automatically added to your account, so you can also check back there any time you're ready and download or read them online.

I am a Packt subscriber, do I get Early Access? Chevron down icon Chevron up icon

Yes, all Early Access content is fully available through your subscription. You will need to have a paid for or active trial subscription in order to access all titles.

How is Early Access delivered? Chevron down icon Chevron up icon

Early Access is currently only available as a PDF or through our online reader. As we make changes or add new chapters, the files in your Packt account will be updated so you can download them again or view them online immediately.

How do I buy Early Access content? Chevron down icon Chevron up icon

Early Access is a way of us getting our content to you quicker, but the method of buying the Early Access course is still the same. Just find the course you want to buy, go through the check-out steps, and you’ll get a confirmation email from us with information and a link to the relevant Early Access courses.

What is Early Access? Chevron down icon Chevron up icon

Keeping up to date with the latest technology is difficult; new versions, new frameworks, new techniques. This feature gives you a head-start to our content, as it's being created. With Early Access you'll receive each chapter as it's written, and get regular updates throughout the product's development, as well as the final course as soon as it's ready.We created Early Access as a means of giving you the information you need, as soon as it's available. As we go through the process of developing a course, 99% of it can be ready but we can't publish until that last 1% falls in to place. Early Access helps to unlock the potential of our content early, to help you start your learning when you need it most. You not only get access to every chapter as it's delivered, edited, and updated, but you'll also get the finalized, DRM-free product to download in any format you want when it's published. As a member of Packt, you'll also be eligible for our exclusive offers, including a free course every day, and discounts on new and popular titles.