Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Applied Unsupervised Learning with Python

You're reading from   Applied Unsupervised Learning with Python Discover hidden patterns and relationships in unstructured data with Python

Arrow left icon
Product type Paperback
Published in May 2019
Publisher
ISBN-13 9781789952292
Length 482 pages
Edition 1st Edition
Languages
Arrow right icon
Authors (3):
Arrow left icon
Benjamin Johnston Benjamin Johnston
Author Profile Icon Benjamin Johnston
Benjamin Johnston
Christopher Kruger Christopher Kruger
Author Profile Icon Christopher Kruger
Christopher Kruger
Aaron Jones Aaron Jones
Author Profile Icon Aaron Jones
Aaron Jones
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Applied Unsupervised Learning with Python
Preface
1. Introduction to Clustering FREE CHAPTER 2. Hierarchical Clustering 3. Neighborhood Approaches and DBSCAN 4. Dimension Reduction and PCA 5. Autoencoders 6. t-Distributed Stochastic Neighbor Embedding (t-SNE) 7. Topic Modeling 8. Market Basket Analysis 9. Hotspot Analysis Appendix

Chapter 9: Hotspot Analysis


Activity 21: Estimating Density in One Dimension

Solution:

  1. Open a new notebook and install all the necessary libraries.

    get_ipython().run_line_magic('matplotlib', 'inline')
    
    import matplotlib.pyplot as plt
    import numpy
    import pandas
    import seaborn
    import sklearn.datasets
    import sklearn.model_selection
    import sklearn.neighbors
    
    seaborn.set()
  2. Sample 1,000 data points from the standard normal distribution. Add 3.5 to each of the last 625 values of the sample (that is, the indices between 375 and 1,000). To do this, set a random state of 100 using numpy.random.RandomState to guarantee the same sampled values, and then randomly generate the data points using the randn(1000) call:

    rand = numpy.random.RandomState(100)
    vals = rand.randn(1000)  # standard normal
    vals[375:] += 3.5
  3. Plot the 1,000-point sample data as a histogram and add a scatterplot below it:

    fig, ax = plt.subplots(figsize=(14, 10))
    ax.hist(vals, bins=50, density=True, label='Sampled Values')
    ax.plot(vals, -0.005 - 0.01 * numpy.random.random(len(vals)), '+k', label='Individual Points')
    ax.legend(loc='upper right')

    The output is as follows:

    Figure 9.29: A histogram of the random sample with a scatterplot underneath

  4. Define a grid of bandwidth values. Then, define and fit a grid search cross-validation algorithm:

    bandwidths = 10 ** numpy.linspace(-1, 1, 100)
    
    grid = sklearn.model_selection.GridSearchCV(
        estimator=sklearn.neighbors.KernelDensity(kernel="gaussian"),
        param_grid={"bandwidth": bandwidths},
        cv=10
    )
    grid.fit(vals[:, None])
  5. Extract the optimal bandwidth value:

    best_bandwidth = grid.best_params_["bandwidth"]
    
    print(
        "Best Bandwidth Value: {}"
        .format(best_bandwidth)
    )
  6. Replot the histogram from Step 3 and overlay the estimated density:

    fig, ax = plt.subplots(figsize=(14, 10))
    
    ax.hist(vals, bins=50, density=True, alpha=0.75, label='Sampled Values')
    
    x_vec = numpy.linspace(-4, 8, 10000)[:, numpy.newaxis]
    log_density = numpy.exp(grid.best_estimator_.score_samples(x_vec))
    ax.plot(
         x_vec[:, 0], log_density, 
         '-', linewidth=4, label='Kernel = Gaussian'
    )
    
    ax.legend(loc='upper right')

    The output is as follows:

    Figure 9.30: A histogram of the random sample with the optimal estimated density overlaid

Activity 22: Analyzing Crime in London

Solution:

  1. Load the crime data. Use the path where you saved the downloaded directory, create a list of the year-month tags, use the read_csv command to load the individual files iteratively, and then concatenate these files together:

    base_path = (
        "~/Documents/packt/unsupervised-learning-python/"
        "lesson-9-hotspot-models/metro-jul18-dec18/"
        "{yr_mon}/{yr_mon}-metropolitan-street.csv"
    )
    
    print(base_path)
    
    yearmon_list = [
        "2018-0" + str(i) if i <= 9 else "2018-" + str(i) 
        for i in range(7, 13)
    ]
    
    print(yearmon_list)
    
    data_yearmon_list = []
    
    for idx, i in enumerate(yearmon_list):
        df = pandas.read_csv(
            base_path.format(yr_mon=i), 
            header=0
        )
        
        data_yearmon_list.append(df)
        
        if idx == 0:
            print("Month: {}".format(i))
            print("Dimensions: {}".format(df.shape))
            print("Head:\n{}\n".format(df.head(2)))
    
    london = pandas.concat(data_yearmon_list)

    The output is as follows:

    Figure 9.31: An example of one of the individual crime files

    This printed information is just for the first of the loaded files, which will be the criminal information from the Metropolitan Police Service for July 2018. This one file has nearly 100,000 entries. You will notice that there is a great deal of interesting information in this dataset, but we will focus on Longitude, Latitude, Month, and Crime type.

  2. Print diagnostics of the complete (six months) and concatenated dataset:

    print(
        "Dimensions - Full Data:\n{}\n"
        .format(london.shape)
    )
    print(
        "Unique Months - Full Data:\n{}\n"
        .format(london["Month"].unique())
    )
    print(
        "Number of Unique Crime Types - Full Data:\n{}\n"
        .format(london["Crime type"].nunique())
    )
    print(
        "Unique Crime Types - Full Data:\n{}\n"
        .format(london["Crime type"].unique())
    )
    print(
        "Count Occurrences Of Each Unique Crime Type - Full Type:\n{}\n"
        .format(london["Crime type"].value_counts())
    )

    The output is as follows:

    Figure 9.32: Descriptors of the full crime dataset

  3. Subset the DataFrame down to four variables (Longitude, Latitude, Month, and Crime type):

    london_subset = london[["Month", "Longitude", "Latitude", "Crime type"]]
    london_subset.head(5)

    The output is as follows:

    Figure 9.33: Crime data in DataFrame form subset down to the Longitude, Latitude, Month, and Crime type columns

  4. Using the jointplot function from seaborn, fit and visualize three kernel density estimation models for bicycle theft in July, September, and December 2018:

    crime_bicycle_jul = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-07")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_jul, kind="kde")

    The output is as follows:

    Figure 9.34: The estimated joint and marginal densities for bicycle thefts in July 2018

    crime_bicycle_sept = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-09")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_sept, kind="kde")

    The output is as follows:

    Figure 9.35: The estimated joint and marginal densities for bicycle thefts in September 2018

    crime_bicycle_dec = london_subset[
        (london_subset["Crime type"] == "Bicycle theft") & 
        (london_subset["Month"] == "2018-12")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_bicycle_dec, kind="kde")

    The output is as follows:

    Figure 9.36: The estimated joint and marginal densities for bicycle thefts in December 2018

    From month to month, the density of bicycle thefts stays quite constant. There are slight differences between the densities, which is to be expected given that the data that is the foundation of these estimated densities is three one-month samples. Given these results, police or criminologists should be confident in predicting where future bicycle thefts are most likely to occur.

  5. Repeat Step 4; this time, use shoplifting crimes for the months of August, October, and November 2018:

    crime_shoplift_aug = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-08")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_aug, kind="kde")

    The output is as follows:

    Figure 9.37: The estimated joint and marginal densities for shoplifting incidents in August 2018

    crime_shoplift_oct = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-10")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_oct, kind="kde")

    The output is as follows:

    Figure 9.38: The estimated joint and marginal densities for shoplifting incidents in October 2018

    crime_shoplift_nov = london_subset[
        (london_subset["Crime type"] == "Shoplifting") & 
        (london_subset["Month"] == "2018-11")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_shoplift_nov, kind="kde")

    The output is as follows:

    Figure 9.39: The estimated joint and marginal densities for shoplifting incidents in November 2018

    Like the bicycle theft results, the shoplifting densities are quite stable across the months. The density from August 2018 looks different from the other two months; however, if you look at the longitude and latitude values, you will notice that the density is very similar, but it has just shifted and scaled. The reason for this is that there were probably a number of outliers forcing the creation of a much larger plotting region.

  6. Repeat Step 5; this time use burglary crimes for the months of July, October, and December 2018:

    crime_burglary_jul = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-07")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_jul, kind="kde")

    The output is as follows:

    Figure 9.40: The estimated joint and marginal densities for burglaries in July 2018

    crime_burglary_oct = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-10")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_oct, kind="kde")

    The output is as follows:

    Figure 9.41: The estimated joint and marginal densities for burglaries in October 2018

    crime_burglary_dec = london_subset[
        (london_subset["Crime type"] == "Burglary") & 
        (london_subset["Month"] == "2018-12")
    ]
    
    seaborn.jointplot("Longitude", "Latitude", crime_burglary_dec, kind="kde")

    The output is as follows:

    Figure 9.42: The estimated joint and marginal densities for burglaries in December 2018

    Once again, we can see that the distributions are quite similar across the months. The only difference is that the densities seem to widen or spread from July to December. As always, the noise and inherent lack of information contained in the sample data is causing small shifts in the estimated densities.

lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image