How much do you spend on coffee? based on demographics and offer type In Starbucks

Starbucks Capstone Challenge — Capstone project of Udacity DSND

Reem

6 min readAug 31, 2019

Introduction

This project is part of the Udacity Capstone Challenge.

This dataset consists of three files:

profile.json: Rewards program users (17000 users x 5 fields)

2. portfolio.json: Offers sent during 30-day test period (10 offers x 6 fields)

3. transcript.json: Event log (306648 events x 4 fields)

Preparing the Data

Data cleaning:

1.Portfolio Data

Changed the name of the ‘id’ column to ‘offerid’.
Removed underscores from column names.
One hot encode the ‘offertype’ column and ‘channels’ columns.

2. Profile Data

Removed customers with missing income data and customer profiles where the gender attribute is missing.
Changed the name of the ‘id’ column to ‘customerid’
Transformed the ‘became_member_on’ column to a datetime object and gender column from a character to a number.
One hot encode a customer’s membership start year, start Month and age range.

3. Transcript Data

I changed the name of the ‘person’ column to ‘customerid’ and ‘time’ column to ‘timedays’, removed customer id’s that are not in the customer profile DataFrame. also,converted time variable from hours to days.
Create a two DataFrame that describes: offers and customer transaction events.
One hot encode offer events.

Exploratory Data Analysis

Profile General Distribution

Income( after encode), Age, Gender, Membership start year, Membership start month

Box Plot for Income, Age

Sum Income by Gender

Income by gender for each Month

Income by gender for each Year

Timeline for Sum of income by Year

Income for Age Range

Modelling

steps:

Transforming Skewed Continuous Features
for membershipstartyear and membershipstartMonth columns
Normalising Numerical Features
Shuffle and Split Data
Evaluating Model Performance
- Naive Predictor Performance
- Creating a Training and Predicting Pipeline
- Initial Model Evaluation
Improving Results (grid search)

Transforming Skewed Continuous Features

Skewed distributions on the feature’s values can make an algorithm to underperfom if the range is not normalized.

Normalising Numerical Features

It is recommended to perform some type of scaling on numerical features. This scaling will not change the shape of the feature’s distribution but it ensures the equal treatment of each feature when supervised models are applied.

Shuffle and Split Data

When all categorical variables are transformed, and all numerical features normalized, we need to split our data into training and test sets. We’ll use 80% of the data to training and 20% for testing,

Evaluating Model Performance

In this section, we will study Three different algorithms, and determine the best at modeling and predicting our data. One of these algorithms will be a naive predictor which will serve as a baseline of performance and, the other three will be supervised learners.

Accuracy: Percentage of total items classified correctly

Accuracy = (TP+TN)/(N+P)

F-beta score as a metric that considers both precision and recall:

In particular, when β=0.5, emphasis on precision. I choose β=0.5 . So I make sure to send all offers to each customer whose income is above average.

1. Naive Predictor

A Naive predictor assumes every offer being sent is successful irrespective of any features.

Accuracy: 0.464

F-Score: 0.520

2. Three of these algorithms will be supervised learners:

Creating a Training and Predicting Pipeline

To properly evaluate the performance of each model, we will create a training and predicting pipeline.

In the code block below:

Import fbeta_score and accuracy_score from sklearn.metrics.
Fit the learner to the sampled training data and record the training time.
Perform predictions on the test data X_test, and also on the first 300 training points X_train[:300].
Record the total prediction time.
Calculate the accuracy score for both the training subset and testing set.
Calculate the F-score for both the training subset and testing set.

Initial Model Evaluation

The Results:

Improving Results

Finally, in this section we will choose the best model to use on our data, and then, perform a grid search optimization

Model Tuning

n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf are tuned to fit the training data using RandomizedSearchCV.

max_features ( auto): The number of features to consider when looking for the best split.
- max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
n_estimators: The number of trees in the forest.
min_samples_split: The minimum number of samples required to split an internal node.
min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.

https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Model built is as below:

Feature Importance

Extracting Feature Importance

An important task when performing supervised learning on a dataset like his one is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do.