Search icon CANCEL
Subscription
0
Cart icon
Your Cart (0 item)
Close icon
You have no products in your basket yet
Arrow left icon
Explore Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Free Learning
Arrow right icon
Arrow up icon
GO TO TOP
Data Science for Marketing Analytics

You're reading from   Data Science for Marketing Analytics Achieve your marketing goals with the data analytics power of Python

Arrow left icon
Product type Paperback
Published in Mar 2019
Publisher
ISBN-13 9781789959413
Length 420 pages
Edition 1st Edition
Languages
Tools
Arrow right icon
Authors (3):
Arrow left icon
Tommy Blanchard Tommy Blanchard
Author Profile Icon Tommy Blanchard
Tommy Blanchard
Debasish Behera Debasish Behera
Author Profile Icon Debasish Behera
Debasish Behera
Pranshu Bhatnagar Pranshu Bhatnagar
Author Profile Icon Pranshu Bhatnagar
Pranshu Bhatnagar
Arrow right icon
View More author details
Toc

Table of Contents (12) Chapters Close

Data Science for Marketing Analytics
Preface
1. Data Preparation and Cleaning FREE CHAPTER 2. Data Exploration and Visualization 3. Unsupervised Learning: Customer Segmentation 4. Choosing the Best Segmentation Approach 5. Predicting Customer Revenue Using Linear Regression 6. Other Regression Techniques and Tools for Evaluation 7. Supervised Learning: Predicting Customer Churn 8. Fine-Tuning Classification Algorithms 9. Modeling Customer Choice Appendix

Chapter 9: Modeling Customer Choice


Activity 18: Performing Multiclass Classification and Evaluating Performance

  1. Import pandas, numpy, randomforestclassifier, train_test_split, classification_report, confusion_matrix, accuracy_score, metrics, seaborn, matplotlib, and precision_recall_fscore_support:

    import pandas as pd
    import numpy as np
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    from sklearn import metrics
    from sklearn.metrics import precision_recall_fscore_support
    import matplotlib.pyplot as plt
    import seaborn as sns
  2. Load the marketing data using pandas:

    data= pd.read_csv(r'MarketingData.csv')
    data.head(5)
  3. Check the shape, the missing values, and show the summary report of the data:

    data.shape

    The shape should be (20000,7). Check for missing values:

    data.isnull().values.any()

    This will return False as there are no null values in the data. See the summary report of the data using the describe function:

    data.describe()
  4. Check the target variable, Channel, for the number of transactions for each of the channels:

    data['Channel'].value_counts()
  5. Split the data into training and testing sets:

    target = 'Channel'
    X = data.drop(['Channel'],axis=1)
    y=data[target]
    X_train, X_test, y_train, y_test = train_test_split(X.values,y,test_size=0.20, random_state=123, stratify=y)
  6. Fit a random forest classifier and store the model in a clf_random variable:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
        min_samples_split=7, random_state=0)
    clf_random.fit(X_train,y_train)
  7. Predict on the test data and store the predictions in y_pred:

    y_pred=clf_random.predict(X_test)
  8. Find out the micro- and macro-average report:

    precision_recall_fscore_support(y_test, y_pred, average='macro')
    precision_recall_fscore_support(y_test, y_pred, average='micro')

    You will get approximately the following values as output for macro- and micro-averages respectively: 0.891, 0.891, 0.891, None and 0.891, 0.891, 0.891, None.

  9. Print the classification report:

    target_names = ["Retail","RoadShow","SocialMedia","Televison"]
    print(classification_report(y_test, y_pred,target_names=target_names))
  10. Plot the confusion matrix:

    cm = confusion_matrix(y_test, y_pred)
    cm_df = pd.DataFrame(cm,
                         index = target_names, 
                         columns = target_names)
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()

From this activity, we can conclude that our random forest model was able to predict the most effective channel for marketing using customers' annual spend data with an accuracy of 89%.

Activity 19: Dealing with Imbalanced Data

  1. Import all the necessary libraries.

    from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.model_selection import train_test_split
    from imblearn.over_sampling import SMOTE
    from sklearn.preprocessing import StandardScaler
    from collections import Counter
  2. Read the dataset into a pandas DataFrame named bank and look at the first few rows of the data:

    bank = pd.read_csv('bank.csv', sep = ';')
    bank.head()
  3. Rename the y column as Target:

    bank = bank.rename(columns={
                            'y': 'Target'
                            })
  4. Replace the no value with 0 and yes with 1:

    bank['Target']=bank['Target'].replace({'no': 0, 'yes': 1})
  5. Check the shape and missing values in the data:

    bank.shape
    bank.isnull().values.any()
  6. Use the describe function to check the continuous and categorical values:

    bank.describe()
    bank.describe(include=['O'])
  7. Check the count of the class labels present in the target variable:

    bank['Target'].value_counts(0)
  8. Use the cat.codes function to encode the job, marital, default, housing, loan, contact, and poutcome columns:

    bank["job"] = bank["job"].astype('category').cat.codes
    bank["marital"] = bank["marital"].astype('category').cat.codes
    bank["default"] = bank["job"].astype('category').cat.codes
    bank["housing"] = bank["marital"].astype('category').cat.codes
    bank["loan"] = bank["loan"].astype('category').cat.codes
    bank["contact"] = bank["contact"].astype('category').cat.codes
    bank["poutcome"] = bank["poutcome"].astype('category').cat.codes

    Since education and month are ordinal columns, convert them as follows:

    bank['education']=bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
    bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
    bank['education'].replace({'primary': 0, 'secondary': 1,'tertiary':2})
    bank['month'].replace(['jan', 'feb', 'mar','apr','may','jun','jul','aug','sep','oct','nov','dec'], [1,2,3,4,5,6,7,8,9,10,11,12], inplace  = True)
  9. Check the bank data after conversion:

    bank.head()
  10. Split the data into training and testing sets using train_test_split, as follows:

    target = 'Target'
    X = bank.drop(['Target'], axis=1)
    y=bank[target]
    
    X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.15, random_state=123, stratify=y)
  11. Check the number of classes in y_train and y_test:

    print(sorted(Counter(y_train).items()))
    print(sorted(Counter(y_test).items()))
  12. Use the standard_scalar function to transform the X_train and X_test data. Assign it to the X_train_sc and X_test_sc variables:

    standard_scalar = StandardScaler()
    X_train_sc = standard_scalar.fit_transform(X_train)
    X_test_sc = standard_scalar.transform(X_test)
  13. Call the random forest classifier with parameters n_estimators=20, max_depth=None, min_samples_split=7, and random_state=0:

    clf_random = RandomForestClassifier(n_estimators=20, max_depth=None,
    min_samples_split=7, random_state=0)
  14. Fit the random forest model:

    clf_random.fit(X_train_sc,y_train)
  15. Predict on the test data using the random forest model:

    y_pred=clf_random.predict(X_test_sc)
  16. Get the classification report:

    target_names = ['No', 'Yes']
    print(classification_report(y_test, y_pred,target_names=target_names))
    cm = confusion_matrix(y_test, y_pred) 
  17. Get the confusion matrix:

    cm_df = pd.DataFrame(cm,
                         index = ['No', 'Yes'], 
                         columns = ['No', 'Yes'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()
  18. Use the smote() function on x_train and y_train. Assign it to the x_resampled and y_resampled variables, respectively:

    X_resampled, y_resampled = SMOTE().fit_resample(X_train,y_train)
  19. Use standard_scalar to fit on x_resampled and x_test. Assign it to the X_train_sc_resampled and X_test_sc variables:

    standard_scalar = StandardScaler()
    X_train_sc_resampled = standard_scalar.fit_transform(X_resampled)
    X_test_sc = standard_scalar.transform(X_test)
  20. Fit the random forest classifier on X_train_sc_resampled and y_resampled:

    clf_random.fit(X_train_sc_resampled,y_resampled)
  21. Predict on X_test_sc:

    y_pred=clf_random.predict(X_test_sc)
  22. Generate the classification report:

    target_names = ['No', 'Yes']
    print(classification_report(y_test, y_pred,target_names=target_names))
  23. Plot the confusion matrix:

    cm = confusion_matrix(y_test, y_pred) 
    
    cm_df = pd.DataFrame(cm,
                         index = ['No', 'Yes'], 
                         columns = ['No', 'Yes'])
    plt.figure(figsize=(8,6))
    sns.heatmap(cm_df, annot=True,fmt='g',cmap='Blues')
    plt.title('Random Forest \nAccuracy:{0:.3f}'.format(accuracy_score(y_test, y_pred)))
    plt.ylabel('True Values')
    plt.xlabel('Predicted Values')
    plt.show()

In this activity, our bank marketing data was highly imbalanced. We observed that, although without using a sampling technique our model accuracy is around 90%, the recall score and macro-average score was 32% (Yes - Term Deposit) and 65%, respectively. This implies that our model is not able to generalize, and most of the time it misses potential customers who would subscribe to the term deposit.

On the other hand, when we used SMOTE, our model accuracy was around 87%, but the recall score and macro-average score was 61% (Yes - Term Deposit) and 76%, respectively. This implies that our model can generalize and, more than 60% of the time, it detects potential customers who would subscribe to the term deposit.

lock icon The rest of the chapter is locked
arrow left Previous Section
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $19.99/month. Cancel anytime
Banner background image