Predicting the future is hard and it requires a lot of assumptions, also known as beliefs, also known as faith. In “Assumptions: Check yo self, before you wreck yo self” we explore the consequences of beliefs when constructing predictive models. We’ll walk through the process of developing a demand forecast for Evo, a Seattle-based outdoor recreation retailer, and discuss how assumptions influence the behavior of your application and ultimately the decisions you make.
15. Do the easiest thing
•Subset the data and focus on one category of
product.
• e.g. Alpine ski bindings.
• Prototype & validate in R.
Units Soldi = α + β1(pricei) + εi
16. Do the easiest thing
•Subset the data and focus on one category of
product.
• e.g. Alpine ski bindings.
• Prototype & validate in R.
Units Soldi = α + β1(pricei) + εi
Residual
17. Assumptions of SLR
•We assume that residuals:
1.Normal, with mean zero.
2.Are not autocorrelated.
3.Are unrelated to the predictors.
18. Checking assumptions is
hard
•…and boring!
•For statistical methods, assumption
testing traditionally relies on
visually inspecting plots (and lets
be real, most people don’t even
do that).
23. Psych.
> test_file("./tests/test_slr.R")
Check assumptions of SLR : [1] "units_sold ~ price"
1..
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed
------------------------
shapiro.test(model_object$residuals)$p.value not more than 0.05. Difference: 0.05
!
24. Linear? Eh.
•We assumed the
2500
functional form was
2000
linear, but there are
1500
several common forms
1000
that might better fit the
500
data. 0
100 200 300 400 500
Price ($)
Units Sold
25. Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Linear Log-log
Linear-log Log-linear
26. Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Price ($)
Units Sold
Linear response to change in price. Much more sensitive to change in price.
More gradual response to changes in price Sensitive initially, then gradual
27.
28.
29.
30. # Automagically explore SLR with common functional forms
candidate_models = list(linear = 'units_sold ~ price',
loglog = 'log(units_sold + 1) ~ log(price + 1)',
linearlog = 'units_sold ~ log(price + 1)',
loglinear = 'log(units_sold + 1) ~ price')
!
run = function(candidate_models, input_data) {
forecasts = list()
test_input = data.frame(price = 0:1000)
!
# Forecast
for (model in candidate_models) {
test_environment = new.env()
!
# Generate the forecast
forecasts[[model]] = generate_forecast(model, input_data)
!
# Save off current value of things for testing
assign("model", forecasts[[model]], envir = test_environment)
assign("errors", forecasts[[model]]$residuals, envir = test_environment)
assign("covariate", input_data$price, envir = test_environment)
assign("label", model, envir = test_environment)
!
save(test_environment, file = 'env_to_test.Rda')
!
# Run assumption tests
test_file("./tests/test_slr.R")
!
#### OPTIMIZE PRICE!!! ####
opt_results = optimizer(forecasts[[model]], test_input)
!
# Multiply the predicted demand by the price for expected revenue
opt_results$expected_revenue = test_data$price * opt_results$predicted_units_sold
!
pdf(paste(model, “.pdf”, sep = ‘’))
plot_price(opt_results)
!
}
!
return(forecasts)
!
}
31. rut roh…
> run(candidate_models, slr_data)
Check assumptions of SLR : [1] "units_sold ~ price"
1..
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed ---------------------------------
shapiro.test(linear$residuals)$p.value not more than 0.05. Difference: 0.05
!
Check assumptions of SLR : [1] "log(units_sold + 1) ~ log(price + 1)"
1.2
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed ---------------------------------
shapiro.test(linear$residuals)$p.value not more than 0.05. Difference: 0.05
!
2. Failure(@test_slr.R#24): The residuals are unrelated to the predictor ---------------------------
cor(test_environment$errors, test_environment$covariate) not equal to 0
Mean absolute difference: 0.05545615
!
Check assumptions of SLR : [1] "units_sold ~ log(price + 1)"
1.2
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed ---------------------------------
shapiro.test(linear$residuals)$p.value not more than 0.05. Difference: 0.05
!
2. Failure(@test_slr.R#24): The residuals are unrelated to the predictor ---------------------------
cor(test_environment$errors, test_environment$covariate) not equal to 0
Mean absolute difference: 0.04201906
!
Check assumptions of SLR : [1] "log(units_sold + 1) ~ price"
1..
!!
1. Failure(@test_slr.R#12): The residuals are normally distributed ---------------------------------
shapiro.test(linear$residuals)$p.value not more than 0.05. Difference: 0.05
42. Yeah, but who cares?
•Do we need to throw everything out
just because some assumptions are
invalidated?
•What is our goal?
•Is it still better than what we did
previously?
43. Wrap it up.
1. Do the easiest thing first, and do it well.
It’s how you’re going to learn the domain,
and it’s your benchmark for improvement.
2. Test your assumptions, and invest time in
building the tools needed to do that
effectively.
3. Be cool, stay in school.
44. Thanks bros!!
Nathan Decker, Brian Pratt & the Evo crew
Jason Gowans & Bryan Mayer
Elissa “Downtown” Brown, forecasting genius
John Foreman, MailChimp
#nordstromdatalab
45. Click-bait!
1. Data Carpentry: http://mimno.infosci.cornell.edu/b/articles/carpentry/
2. Getting started with testthat. http://journal.r-project.org/archive/2011-1/
RJournal_2011-1_Wickham.pdf
3. Clean Code: http://www.amazon.com/Clean-Code-Handbook-Software-
Craftsmanship/dp/0132350882/
4. Quality Code: http://www.amazon.com/Quality-Code-Software-Principles-
Practices/dp/0321832981
5. Revenue Management: http://www.amazon.com/Practice-Management-
International-Operations-Research/dp/0387243763/
6. Pricing and Revenue Optimization: http://www.amazon.com/Pricing-Revenue-
Optimization-Robert-Phillips-ebook/dp/B005JTDOVE/
7. Original G, Rob Hyndman: https://www.otexts.org/fpp and http://
robjhyndman.com/hyndsight/