In this article, we’ll dive into Starbucks data to analyze which demographic groups respond best to offers, how to distribute Starbucks offers to optimize sales. This is part of Udacity Data Scientist Nanodegree program.
Provided transaction, demographic and offer data contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.
In this project, we’d like to find out which demographic groups respond best to the offers and which features are important for consideration if we want optimize the fulfillment of offers.
To answer this question, we’ll perform some visual analyses, calculate some basic metrics (such as how much percentage of received offers are completed among different demographic groups). In addition, multiple machine learning models will be used to predict whether a customer will complete an offer or not. The most suitable machine learning model is the one with both high accuracy and f1 score. Once the most suitable machine learning model is identified, we’ll then look at the importances of each feature to determine which features play an important role in determining whether a customer will complete an offer.
First, let’s get a closer look at the data.
Offer dataset (10 rows, 6 columns) include 10 different offers. There are BOGO, discount or informational with different rewards, minimum required spend to complete an offer, time for offer to be open and what channels the offer is distributed.
Demographic data (17000 rows, 5 columns) contains customer info, such as gender, age, income and when the customer became a member.
Transcript data (306534 rows, 4 columns) contains transaction related info, event (transaction, offer received, offer viewed or offer completed), and so on.
A detailed schema and schema and explanation of each variable in the files:
* id (string) — offer id
* offer_type (string) — type of offer ie BOGO, discount, informational
* difficulty (int) — minimum required spend to complete an offer
* reward (int) — reward given for completing an offer
* duration (int) — time for offer to be open, in days
* channels (list of strings)
* age (int) — age of the customer
* became_member_on (int) — date when customer created an app account
* gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
* id (str) — customer id
* income (float) — customer’s income
* event (str) — record description (ie transaction, offer received, offer viewed, etc.)
* person (str) — customer id
* time (int) — time in hours since start of test. The data begins at time t=0
* value — (dict of strings) — either an offer id or transaction amount depending on the record
Data Cleaning and Preprocessing
In order to analyze data and generate insights, the first step is to clean and prepare data and merge all three datasets together. (Explicit code is included in the Github repository, of which the link is attached at the end of this article.)
For demographic data, it’s noticed that a bunch of customers are recorded as 118 years old. At a closer look, those customers are not really 118 years old, instead 118 is a value assigned when age of a customer is missing. The customers with age 118 also have missing gender and income.
In addition, from the boxplot it shows customers older then 80 doesn’t take much proportion in the customer database. Therefore, customers older then 80 will be excluded from the following analysis.
Therefore, for customers who have missing age and income, the missing values are imputed by mean. Missing gender is imputed by mode, which is male in this project. When a customer becomes a member is converted to duration (in year) for a customer maintains their membership.
Once age and income are inputed without missing values, the values are grouped so that they become categorical data. The reason is that constructing data so will help us identify which demographic group responds best to offers in following steps.
To sum up, the preprocessing steps on demographic data include:
- customers older then 80 will be excluded from analysis
- age = 118 is not the real age, will needs to be imputed using mean
- missing gender will be imputed using mode
- missing income will be imputed using mean
- convert ‘became_member_on’ into date format and create a new variable to calculate how many years a user has been a member
- income and age are grouped into categorical data
The preprocessing on offer data is mainly focused on:
- rename the columns for the ease of merge
- split up the distribution channel information to individual columns and one hot encode them
The data cleaning and preparation on transcript data focus on events. Since the purpose of the analysis is to find out which demographic group responds best to offers, two events are of particular interest, offer received and offer completed. Transaction and offer reviewed events will not be analyzed. Detailed steps include:
- split up value dictionary to separate columns
- combine values of ‘offer_id’ and ‘offer id’ together into one column since they mean the same thing
- drop unused columns
Exploratory Data Analysis
Once data is tidy and clean, let’s look at some graphs to get a better idea of data.
Most customers have an income ranging 45000 from 65000, while not so many customers have income over 105000.
The number of male customers is almost twice as the number of female customers. One thing to note is that the customers with missing gender are imputed as male, but there are still a lot more male customers than female customers.
From the above plot, customers in their 40s and 50s are the main consumption force of Starbucks.
To get more specific, let’s look at the customer age distribution by gender.
The plot shows Starbucks have more male customers in their 40s and 50s, but it’s noted that they might from the imputation. Overall, male customers with age 20–80 take much proportion of the customer base, so are the female customers with age 40–80. Customers younger than 20 take a very small proportion of the customer base.
From transcript data, it shows that only about half of the received offers are completed by customers. One thing to note here is that it’s impossible to record whether an informational offer is completed or not. Therefore, the completed offers here are mainly BOGO and discount.
To get a closer look, among the completed offers, more discounted offers are completed then BOGO.
Data Analysis Questions
Next, let’s get some simple metrics out of the data.
income_group completed_count received_count percentage
0 45000-65000 10228 30002 0.340911
1 65000-85000 9266 17338 0.534433
2 85000-105000 5527 12318 0.448693
3 30000-45000 4251 9015 0.471547
4 105000-120000 1972 3190 0.618182
Among different income groups, the table above shows that more customers with income 105000–12000 tend to complete offer when they receive one. Around 60% offers are used in this customer group.
age_group completed_count received_count percentage
0 40-60 14266 35958 0.396741
1 60-80 11263 21917 0.513893
2 20-40 5395 13067 0.412872
3 Under 20 320 921 0.347448
Customers aging 60–80 like to use offers when they receive one, while customers younger than 20 are the ones using the offers less.
gender completed_count received_count percentage
0 M 17208 46885 0.367026
1 F 14036 24978 0.561935
Female customers like to use offers more then male customers.
offer_type completed_count received_count percentage
0 discount 16696 28806 0.579601
1 bogo 14548 28763 0.505789
Customers like to use discount then BOGO offers.
To sum up, in order to increase the completion of offers, Starbucks may want to send more offers to female customers aging 60–80 with a high income, and discount offer is preferred.
In this part, five models are used to predict whether a customer will complete an offer to receive one. The model with the best performance will be selected and evaluate which features play a role in whether a customer will fulfill an offer.
The five models are: naive predictor, logistic regression, random forest, gradient boosting, and decision tree models.
For all five models, the default parameter settings are used, except for the random state is assigned. A function is developed for the ease of metrics printout.
Among five models, it appears the gradient boosting model is the most suitable, with a test f1 score 0.778 and test accuracy score 0.742. Logistic regression, random forest, decision tree all have better performance than the benchmark model.
Naive predictor accuracy score, train: 0.5657859301594901, test: 0.5641974446694128
Naive predictor f1 score, train: 0.7226861849523191, test: 0.7213890376718443Logistic regression model accuracy score, train: 0.6890690924324486, test: 0.6867974586329679
Logistic regression model f1 score, train: 0.733543605918404, test: 0.7312162971839424Random forest model accuracy score, train: 0.9917113019539783, test: 0.6982475738322977
Random forest model f1 score, train: 0.9926597238784217, test: 0.7374240583232078Gradient boosting model accuracy score, train: 0.7491546724916963, test: 0.7424422257906863
Gradient boosting model f1 score, train: 0.7841935899086112, test: 0.7786112944847865Decision tree model accuracy score, train: 0.9918309943445346, test: 0.6839349298331355
Decision tree model f1 score, train: 0.9927325968321576, test: 0.7185226636821487
Therefore, the gradient boosting model can be used to predict whether a customer will complete offer if received. Next, let’s try to refine the gradient boosting model by tuning some parameters and using GridSearchCV.
The best parameters for a tuned gradient boosting model are:
Accuracy and f1 score for the tuned gradient boosting model are:
Refined gradient boosting model accuracy score, train: 0.7536431371375565, test: 0.7434894924247714
Refined gradient boosting model f1 score, train: 0.7881750585329457, test: 0.7794717887154863
Model Evaluation and Validation
In order to validate the gradient boosting model with the best parameters, cross-validation is used. The scores from 5 consecutive times with different splits each time are:
array([0.74929312, 0.74248612, 0.74780059, 0.7395266 , 0.73282363])cross validation scores have 0.742386012164979 accuracy with a standard deviation of 0.005943924418752841
The validation performance is stable and doesn’t fluctuate much, which suggests that the model is robust enough.
Let’s then look at importance of each feature to determine which features play a role in whether a customer will fulfill an offer.
From the above plot, it shows that income of a customer, the duration of membership, how much award the offer offers and the gender of a customer all play an important role in determining whether a customer will likely fulfill a received offer. In addition, based on previous metrics, Starbucks may want to send more offers to female customers aging 60–80 with a high income, and discount offer is preferred.
From the gradient boosting model with best parameters, the accuracy score on test is increased to 0.743 from 0.742, and f1 score on test is increased to 0.779 from 0.778. The improvement is minimal, but these metrics could be potentially increased if more parameters are tuned, which can be explored as next steps. For now, the current model is sufficient as the insights generated from the gradient boosting model is in alignment with the simple metrics calculated from the data.
In this project, Starbucks data is analyzed in order to determine which demographic groups best respond to offers. To solve this problem, the analysis went through several steps, including basic understanding of data, data cleaning and preprocessing, EDA, data modeling and evaluation. The finalized model is adequate for the analysis but refinements in the future can also be considered.
For the future work on the analysis, we can explore other machine learning tools to see their performance on predicting whether a customer will complete a received offer; or we can focus on the gradient boosting model and tune parameters to increase its performance.
If the above suggestions on Starbucks should send offers to which demographic groups are to be implemented. An A/B test can also be implemented to measure the metrics (such as how many offers are completed before and after change) in order to determine whether the suggestions will make a difference.
A Github repository with detailed code for this project can be found here.