How much do you spend on coffee? based on demographics and offer type In Starbucks
Introduction
This project is part of the Udacity Capstone Challenge.
This dataset consists of three files:
- profile.json: Rewards program users (17000 users x 5 fields)
2. portfolio.json: Offers sent during 30-day test period (10 offers x 6 fields)
3. transcript.json: Event log (306648 events x 4 fields)
Preparing the Data
Data cleaning:
1.Portfolio Data
- Changed the name of the ‘id’ column to ‘offerid’.
- Removed underscores from column names.
- One hot encode the ‘offertype’ column and ‘channels’ columns.
2. Profile Data
- Removed customers with missing income data and customer profiles where the gender attribute is missing.
- Changed the name of the ‘id’ column to ‘customerid’
- Transformed the ‘became_member_on’ column to a datetime object and gender column from a character to a number.
- One hot encode a customer’s membership start year, start Month and age range.
3. Transcript Data
- I changed the name of the ‘person’ column to ‘customerid’ and ‘time’ column to ‘timedays’, removed customer id’s that are not in the customer profile DataFrame. also,converted time variable from hours to days.
- Create a two DataFrame that describes: offers and customer transaction events.
- One hot encode offer events.
Exploratory Data Analysis
Profile General Distribution
Income( after encode), Age, Gender, Membership start year, Membership start month
Box Plot for Income, Age
Sum Income by Gender
Income by gender for each Month
Income by gender for each Year
Timeline for Sum of income by Year
Income for Age Range
Modelling
steps:
- Transforming Skewed Continuous Features
for membershipstartyear and membershipstartMonth columns - Normalising Numerical Features
- Shuffle and Split Data
- Evaluating Model Performance
- Naive Predictor Performance
- Creating a Training and Predicting Pipeline
- Initial Model Evaluation - Improving Results (grid search)
Transforming Skewed Continuous Features
Skewed distributions on the feature’s values can make an algorithm to underperfom if the range is not normalized.
Normalising Numerical Features
It is recommended to perform some type of scaling on numerical features. This scaling will not change the shape of the feature’s distribution but it ensures the equal treatment of each feature when supervised models are applied.
Shuffle and Split Data
When all categorical variables are transformed, and all numerical features normalized, we need to split our data into training and test sets. We’ll use 80% of the data to training and 20% for testing,
Evaluating Model Performance
In this section, we will study Three different algorithms, and determine the best at modeling and predicting our data. One of these algorithms will be a naive predictor which will serve as a baseline of performance and, the other three will be supervised learners.
- Accuracy: Percentage of total items classified correctly
Accuracy = (TP+TN)/(N+P)
- F-beta score as a metric that considers both precision and recall:
In particular, when β=0.5, emphasis on precision. I choose β=0.5 . So I make sure to send all offers to each customer whose income is above average.
1. Naive Predictor
A Naive predictor assumes every offer being sent is successful irrespective of any features.
Accuracy: 0.464
F-Score: 0.520
2. Three of these algorithms will be supervised learners:
Creating a Training and Predicting Pipeline
To properly evaluate the performance of each model, we will create a training and predicting pipeline.
In the code block below:
- Import
fbeta_score
andaccuracy_score
fromsklearn.metrics
. - Fit the learner to the sampled training data and record the training time.
- Perform predictions on the test data
X_test
, and also on the first 300 training pointsX_train[:300]
. - Record the total prediction time.
- Calculate the accuracy score for both the training subset and testing set.
- Calculate the F-score for both the training subset and testing set.
Initial Model Evaluation
The Results:
Improving Results
Finally, in this section we will choose the best model to use on our data, and then, perform a grid search optimization
Model Tuning
n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf are tuned to fit the training data using RandomizedSearchCV.
- max_features ( auto): The number of features to consider when looking for the best split.
- max_depth : The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples. - n_estimators: The number of trees in the forest.
- min_samples_split: The minimum number of samples required to split an internal node.
- min_samples_leaf: The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least
min_samples_leaf
training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Model built is as below:
Feature Importance
Extracting Feature Importance
An important task when performing supervised learning on a dataset like his one is determining which features provide the most predictive power. By focusing on the relationship between only a few crucial features and the target label we simplify our understanding of the phenomenon, which is most always a useful thing to do.
in this code as below:
Normalized Weights for First Five Most Predictive Features:
Feature Selection
Results: