Given the US Consumer Expenses (Expenditure) dataset for
1996-‐2000 containing 12,000 rows and 220 columns. The objective of the analysis was to propose one way of using the data employing one of the following methods: regression, classification or clustering. This presentation shows my approach and methodology I have also shared the insights from the model and how it could be presented to a senior level manager with(out) the technical details.
2. Objective
• Propose one way of using the data employing one of the following
methods: regression, classification or clustering. Execute your
proposal and discuss your methodology, justify your algorithm/
feature selection and share insights from the model.
• Dataset: Consumer Expenditure Survey for 1996-‐2000 (12k rows, 220
columns)
3. A typical American family
This infographic summarizes the consumer demographics in the expenditure data. It provides for a very good macro overview of the
dataset and what can be expected out of it.
About Chart
2.3 vehicles per family
77% own a home
2.8 members per family
1.5 earning members per family
4. How much do they earn?
Description
For every dollar earned by the family members, about 78 cents are
used to pay various expenses to support and maintain the family. 20
cents are used to pay various taxes including social security.
Maintenance
About 60% of the expenses are
towards the non discretionary items
like rent, food etc.
Expenses
$40,679
Income
$53,147
Entertainment
The balance 40% is what is used for
discretionary items like Alcohol,
entertainment and travel.
Taxes
$9,962
5. Where does the money go?
0
50
100
$765 $1489 $1956 $2806
RentAlcohol
Tobacco
Entertainment Clothes Utilities Transport Food
$3921 $5821 $10687
7. Potential questions data can answer?
Who are these people?
Who are these people? What are their demographics? Should
we customize the product for the diversity?
Targeting specific groups
Why should be target certain demographics? Why would they
buy the product from you?
Potential reach
Where should they grow the business next?
Is this necessary for them to get your product? If so how
frequently?
What motivates them to buy?
How much elasticity do they have in purchasing the product?
Would they be ok with price increases or would this product be
a battle over prices.
1. Other macro economic indicators can also be calculated as well using this data.
But since our focus is on CE goods company, we will exclude them.
8. Steps for the analysis
Step 04
Step 03
Step 02
Step 01
Initial Analysis
After eliminating lag variables, a pair-wise correlation
analysis was performed to id key variables.
Calculations
Calculated savings using residual & net worth methods
to identify elasticity of each demographic.
Understanding the data
K-Means to identify clusters within the groups. Decision
trees & ridge regression to understand the expenses.
Validation
Tried to understand the clusters and the data
patterns to get additional insights.
Presentation
Preparation of the results in the simplistic manner to be
presented to the Consumer goods executive team.
9. Demographics (using clustering)
Rich / Super Rich
3.7%
Single earner
25.7%
Singles
25.0%
Working spouse
33.1%
Widows
12.5%
59 years old
Mostly female
1 member
High school
46 years old
Mostly female
1 to 2 members
Some degree
45 years old
Mostly male
3 to 6 members
College, no degree
55 years old
Mostly male
2 to 4 members
College educated
47 years old
Fe(Male)
3 to 5 members
Bachelors degree
These were arrived using the K Means clustering algorithm. The features names were arrived on the basis of what the key separation features were for each cluster. I included the
calculated parameters of residual savings and net worth savings to be included in the clustering as well. The outliers were kept in the separate cluster and is being named as super rich
or the 0.01 percenter. Additional cluster level information can be found in the slide notes for this page.
10. Elasticity (expense / income)4
Widows Singles Working Spouse Single Earner
Income5
Clothes2
Alcohol / Tobacco
Entertainment
Residual Savings1
Net worth savings
19$
42$
70$
42$
38%
10%
2%
$18
$0
22%
7%
1%
$35
$5
17%
7%
2%
$49
$42
25%
9%
1%
$35
$13
1. The residual savings are a bit inflated due to some outlier data points, that fall on the cluster boundary. Did not get time to clean up.
2. For food I should have included the food away from home and working expenses. A potential link to elasticity could have helped further.
3. The (super) rich spend about 7 to 11% on clothes; 2 to 4% on alchol/tobacco and 1% on entertainment.
4. I would also carry out the elasticity analysis over the lag variables to determine the sensitivity towards price (data not used)
5. All income values in 10,000’s
14. Clothing spend (decision trees)
Gradient Boosted
Tried this approach to see if
building multiple decision trees
changes the variable importance on
the clothing spend
Simple decision tree
A quick look at the variable
importance in a build up of a
decision tree. These line up with
the variables found via correlation
analysis
17
%
14%
5%
5%
4%
Income
Residual savings
Education
Vehicles
Hours worked
68%
9%
6%
5%
4%
Income
Renter
Residual Savings
West US
Education
1. Explained variance is 0.35 for decision trees vs 0.48 for gradient booted trees
2. RMSE 5323 for decision trees vs 4765 for the gradient boosted trees
These were arrived using the K Means clustering algorithm. The features names were arrived on the basis of what the key separation features were for each cluster. I included the calculated parameters of residual savings and net worth savings to be included in the clustering as well. The outliers were kept in the separate cluster and is being named as super rich or the 0.01 percenter.
Non- Working Widows:
Observations: 40.88% of the cluster has 2 for marital (against 7.66 % globally)83.82% of the cluster has \N for emptype (against 24.41 % globally)83.82% of the cluster has \N for empstat (against 24.50 % globally)
Rich:
Observationswages_calc is in average 245% greater : mean of 190k against 55002 globallyexpenses is in average 246% greater : mean of 180k against 53148 globallyresidual_savings is in average 172% greater : mean of 110k against 40679 globally
Singles:
Observations33.53% of the cluster has 5 for marital (against 10.52 % globally)46.29% of the cluster has 3 for marital (against 15.34 % globally)97.24% of the cluster has 0 for married (against 36.23 % globally)
Working Spouses:
Observations36.28% of the cluster has 1 for working_part_spouse (against 15.40 % globally)45.41% of the cluster has 40 for hrswkd_spouse (against 19.98 % globally)98.90% of the cluster has 1 for working_spouse (against 44.15 % globally)
Single Earner
Observations67.08% of the cluster has 0 for wkswkd_spouse (against 17.98 % globally)67.46% of the cluster has 0 for hrswkd_spouse (against 18.09 % globally)52.79% of the cluster has \N for empstat (against 24.50 % globally)
Super Rich:
Observationsnet_worth_savings is in average 1531% greater : mean of 400k against 24731 globallyexpenses is in average 564% greater : mean of 350k against 53148 globallywages_calc is in average 559% greater : mean of 360k against 55002 globally