Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
The Unsupervised Learning Workshop

You're reading from   The Unsupervised Learning Workshop Get started with unsupervised learning algorithms and simplify your unorganized data to help make future predictions

Arrow left icon
Product type Paperback
Published in Jul 2020
Publisher Packt
ISBN-13 9781800200708
Length 550 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (11) Chapters Close

Preface
1. Introduction to Clustering 2. Hierarchical Clustering FREE CHAPTER 3. Neighborhood Approaches and DBSCAN 4. Dimensionality Reduction Techniques and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Clustering

Clustering is the overarching process that involves finding groups of similar data that exist in your dataset, which can be extremely valuable if you are trying to find its underlying meaning. If you were a store owner and you wanted to understand which customers are more valuable without a set idea of what valuable is, clustering would be a great place to start to find patterns in your data. You may have a few high-level ideas of what denotes a valuable customer, but you aren't entirely sure in the face of a large mountain of available data. Through clustering, you can find commonalities among similar groups in your data. For example, if you look more deeply at a cluster of similar people, you may learn that everyone in that group visits your website for longer periods of time than others. This can show you what the value is and also provide a clean sample size for future supervised learning experiments.

Identifying Clusters

The following image shows two scatterplots:

Figure 1.2: Two distinct scatterplots

Figure 1.2: Two distinct scatterplots

The following image separates the two scatterplots into two distinct clusters:

Figure 1.3: Scatterplots clearly showing clusters that exist in a provided dataset

Figure 1.3: Scatterplots clearly showing clusters that exist in a provided dataset

Figure 1.2 and Figure 1.3 display randomly generated number pairs (x and y coordinates) pulled from two distinct Gaussian distributions centered at different locations. Simply by glancing at the first image, it should be obvious where the clusters exist in your data; in real life, it will never be this easy. Now that you know that the data can be clearly separated into two clusters, you can start to understand what differences exist between the two groups.

Rewinding a bit from where unsupervised learning fits into the larger machine learning environment, let's begin by understanding the building blocks of clustering. The most basic definition finds clusters simply as groupings of similar data as subsets of a larger dataset. As an example, imagine that you had a room with 10 people in it and each person had a job either in finance or as a scientist. If you told all the financial workers to stand together and all the scientists to do the same, you would have effectively formed two clusters based on job types. Finding clusters can be immensely valuable in identifying items that are more similar and, on the other end of the scale, quite different from one another.

Two-Dimensional Data

To understand this, imagine that you were given a simple 1,000-row dataset by your employer that had two columns of numerical data, as follows:

Figure 1.4: Two-dimensional raw data in an array

Figure 1.4: Two-dimensional raw data in an array

At first glance, this dataset provides no real structure or understanding.

A dimension in a dataset is another way of simply counting the number of features available. In most organized data tables, you can view the number of features as the number of columns. So, using the 1,000-row dataset example of size (1,000 x 2), you will have 1,000 observations across two dimensions. Please note that dimensions of dataset should not be confused with the dimensions of an array.

You begin by plotting the first column against the second column to get a better idea of what the data structure looks like. There will be plenty of times where the cause of differences between groups will prove to be underwhelming; however, the cases that have differences that you can take action on are extremely rewarding.

Exercise 1.01: Identifying Clusters in Data

You are given two-dimensional plots of data that you suspect have clusters of similar data. Please look at the two-dimensional graphs provided in the exercise and identify the groups of data points to drive the point home that machine learning is important. Without using any algorithmic approaches, identify where these clusters exist in the data.

This exercise will help you start building your intuition of how we can identify clusters using our own eyes and thought processes. As you complete this exercise, think of the rationale of why a group of data points should be considered a cluster versus a group that should not be considered a cluster. Follow these steps to complete this exercise:

  1. Identify the clusters in the following scatterplot:
    Figure 1.5: Two-dimensional scatterplot

    Figure 1.5: Two-dimensional scatterplot

    The clusters are as follows:

    Figure 1.6: Clusters in the scatterplot

    Figure 1.6: Clusters in the scatterplot

  2. Identify the clusters in the following scatterplot:
    Figure 1.7: Two-dimensional scatterplot

    Figure 1.7: Two-dimensional scatterplot

    The clusters are as follows:

    Figure 1.8: Clusters in the scatterplot

    Figure 1.8: Clusters in the scatterplot

  3. Identify the clusters in the following scatterplot:
    Figure 1.9: Two-dimensional scatterplot

Figure 1.9: Two-dimensional scatterplot

The clusters are as follows:

Figure 1.10: Clusters in the scatterplot

Figure 1.10: Clusters in the scatterplot

Most of these examples were likely quite easy for you to understand, and that's the point. The human brain and eyes are incredible at finding patterns in the real world. Within milliseconds of viewing each plot, you could tell what fitted together and what didn't. While it is easy for you, a computer does not have the ability to see and process plots in the same manner that we do.

However, this is not always a bad thing. Look back at the preceding scatterplot. Were you able to find the six discrete clusters in the data just by looking at the plot? You probably found only three to four clusters in this scatterplot, while a computer would be able to see all six. The human brain is magnificent, but it also lacks the nuances that come with a strictly logic-based approach. Through algorithmic clustering, you will learn how to build a model that works even better than a human at these tasks.

We'll look at the clustering algorithm in the next section.

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image