Analysis on the US Consumer Expenditure

Consumer Expenditure
Setu Chokshi
14th July 2017

Objective
• Propose one way of using the data employing one of the following
methods: regression, classification or clustering. Execute your
proposal and discuss your methodology, justify your algorithm/
feature selection and share insights from the model.
• Dataset: Consumer Expenditure Survey for 1996-‐2000 (12k rows, 220
columns)

A typical American family
This infographic summarizes the consumer demographics in the expenditure data. It provides for a very good macro overview of the
dataset and what can be expected out of it.
About Chart
2.3 vehicles per family
77% own a home
2.8 members per family
1.5 earning members per family

How much do they earn?
Description
For every dollar earned by the family members, about 78 cents are
used to pay various expenses to support and maintain the family. 20
cents are used to pay various taxes including social security.
Maintenance
About 60% of the expenses are
towards the non discretionary items
like rent, food etc.
Expenses
$40,679
Income
$53,147
Entertainment
The balance 40% is what is used for
discretionary items like Alcohol,
entertainment and travel.
Taxes
$9,962

Where does the money go?
0
50
100
$765 $1489 $1956 $2806
RentAlcohol
Tobacco
Entertainment Clothes Utilities Transport Food
$3921 $5821 $10687

Potential questions data can answer?
Who are these people?
Who are these people? What are their demographics? Should
we customize the product for the diversity?
Targeting specific groups
Why should be target certain demographics? Why would they
buy the product from you?
Potential reach
Where should they grow the business next?
Is this necessary for them to get your product? If so how
frequently?
What motivates them to buy?
How much elasticity do they have in purchasing the product?
Would they be ok with price increases or would this product be
a battle over prices.
1. Other macro economic indicators can also be calculated as well using this data.
But since our focus is on CE goods company, we will exclude them.

Steps for the analysis
Step 04
Step 03
Step 02
Step 01
Initial Analysis
After eliminating lag variables, a pair-wise correlation
analysis was performed to id key variables.
Calculations
Calculated savings using residual & net worth methods
to identify elasticity of each demographic.
Understanding the data
K-Means to identify clusters within the groups. Decision
trees & ridge regression to understand the expenses.
Validation
Tried to understand the clusters and the data
patterns to get additional insights.
Presentation
Preparation of the results in the simplistic manner to be
presented to the Consumer goods executive team.

Demographics (using clustering)
Rich / Super Rich
3.7%
Single earner
25.7%
Singles
25.0%
Working spouse
33.1%
Widows
12.5%
59 years old
Mostly female
1 member
High school
46 years old
Mostly female
1 to 2 members
Some degree
45 years old
Mostly male
3 to 6 members
College, no degree
55 years old
Mostly male
2 to 4 members
College educated
47 years old
Fe(Male)
3 to 5 members
Bachelors degree
These were arrived using the K Means clustering algorithm. The features names were arrived on the basis of what the key separation features were for each cluster. I included the
calculated parameters of residual savings and net worth savings to be included in the clustering as well. The outliers were kept in the separate cluster and is being named as super rich
or the 0.01 percenter. Additional cluster level information can be found in the slide notes for this page.

Elasticity (expense / income)4
Widows Singles Working Spouse Single Earner
Income5
Clothes2
Alcohol / Tobacco
Entertainment
Residual Savings1
Net worth savings
19$
42$
70$
42$
38%
10%
2%
$18
$0
22%
7%
1%
$35
$5
17%
7%
2%
$49
$42
25%
9%
1%
$35
$13
1. The residual savings are a bit inflated due to some outlier data points, that fall on the cluster boundary. Did not get time to clean up.
2. For food I should have included the food away from home and working expenses. A potential link to elasticity could have helped further.
3. The (super) rich spend about 7 to 11% on clothes; 2 to 4% on alchol/tobacco and 1% on entertainment.
4. I would also carry out the elasticity analysis over the lag variables to determine the sensitivity towards price (data not used)
5. All income values in 10,000’s

Pairwise Correlation Analysis (sklearn)
Unsorted Sorted

t-SNE for cluster analysis (sklearn)

Clothing spend (decision trees)
Gradient Boosted
Tried this approach to see if
building multiple decision trees
changes the variable importance on
the clothing spend
Simple decision tree
A quick look at the variable
importance in a build up of a
decision tree. These line up with
the variables found via correlation
analysis
17
%
14%
5%
5%
4%
Income
Residual savings
Education
Vehicles
Hours worked
68%
9%
6%
5%
4%
Income
Renter
Residual Savings
West US
Education
1. Explained variance is 0.35 for decision trees vs 0.48 for gradient booted trees
2. RMSE 5323 for decision trees vs 4765 for the gradient boosted trees

Food_Away Analysis using Ridge
Regression
See reference excel sheet.

Analysis on the US Consumer Expenditure

Recomendados

Recomendados

Más contenido relacionado

Similar a Analysis on the US Consumer Expenditure

Similar a Analysis on the US Consumer Expenditure (20)

Más de Setu Chokshi

Más de Setu Chokshi (9)

Último

Último (20)

Analysis on the US Consumer Expenditure

Notas del editor