4. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Where fools fear to tread 2
5. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
6. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
7. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
8. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
9. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
10. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
11. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
12. My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
Man vs Wild Data Where fools fear to tread 3
13. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Working with inadequate tools 4
14. Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.
Man vs Wild Data Working with inadequate tools 5
15. Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.
Additional information
Program written in COBOL making numerical calculations
limited. It is not possible to do any optimisation.
Man vs Wild Data Working with inadequate tools 5
16. Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.
Additional information
Program written in COBOL making numerical calculations
limited. It is not possible to do any optimisation.
Their programmer has little experience in numerical
computing.
Man vs Wild Data Working with inadequate tools 5
17. Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.
Additional information
Program written in COBOL making numerical calculations
limited. It is not possible to do any optimisation.
Their programmer has little experience in numerical
computing.
They employ no statisticians and want the program to
produce forecasts automatically.
Man vs Wild Data Working with inadequate tools 5
18. Disposable tableware company
Methods currently used
A 12 month average
C 6 month average
E straight line regression over last 12 months
G straight line regression over last 6 months
H average slope between last year’s and this
year’s values.
(Equivalent to differencing at lag 12 and
taking mean.)
I Same as H except over 6 months.
K I couldn’t understand the explanation.
Man vs Wild Data Working with inadequate tools 6
19. Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonal
differencing to deal with seasonality.
Use simple exponential smoothing on (differenced)
data with the parameter selected from
{0.1, 0.3, 0.5, 0.7, 0.9}.
For each series, try 15 models: no differencing, first
differencing, and seasonal differencing, plus SES
with 5 parameter values.
Model selected based on smallest MSE. (Only one
parameter for each model, so no need to penalize
for model size.)
Man vs Wild Data Working with inadequate tools 7
20. Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonal
differencing to deal with seasonality.
Use simple exponential smoothing on (differenced)
data with the parameter selected from
{0.1, 0.3, 0.5, 0.7, 0.9}.
For each series, try 15 models: no differencing, first
differencing, and seasonal differencing, plus SES
with 5 parameter values.
Model selected based on smallest MSE. (Only one
parameter for each model, so no need to penalize
for model size.)
Man vs Wild Data Working with inadequate tools 7
21. Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonal
differencing to deal with seasonality.
Use simple exponential smoothing on (differenced)
data with the parameter selected from
{0.1, 0.3, 0.5, 0.7, 0.9}.
For each series, try 15 models: no differencing, first
differencing, and seasonal differencing, plus SES
with 5 parameter values.
Model selected based on smallest MSE. (Only one
parameter for each model, so no need to penalize
for model size.)
Man vs Wild Data Working with inadequate tools 7
22. Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonal
differencing to deal with seasonality.
Use simple exponential smoothing on (differenced)
data with the parameter selected from
{0.1, 0.3, 0.5, 0.7, 0.9}.
For each series, try 15 models: no differencing, first
differencing, and seasonal differencing, plus SES
with 5 parameter values.
Model selected based on smallest MSE. (Only one
parameter for each model, so no need to penalize
for model size.)
Man vs Wild Data Working with inadequate tools 7
23. Disposable tableware company
My solution
Use first differencing to deal with trend, or seasonal
Some lessons with seasonality.
differencing to deal
Use simple exponential smoothing on (differenced)
Be pragmatic.
data with the parameter selected from
{0Understand .9}.
.1, 0.3, 0.5, 0.7, 0 your tools well enough
For each series, to adapt them.
to be able try 15 models: no differencing, first
differencing, and seasonal differencing, plus SES
with successful consulting job often
A 5 parameter values.
Model selected based on methods. (Only one
uses very simple smallest MSE.
parameter for each model, so no need to penalize
for model size.)
Man vs Wild Data Working with inadequate tools 7
24. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data When you can’t lose 8
26. Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.
Many drugs bought from pharmacies are
subsidised to allow more equitable access to
modern drugs.
The cost to government is determined by the
number and types of drugs purchased.
Currently nearly 1% of GDP.
The total cost is budgeted based on forecasts
of drug usage.
Man vs Wild Data When you can’t lose 10
27. Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.
Many drugs bought from pharmacies are
subsidised to allow more equitable access to
modern drugs.
The cost to government is determined by the
number and types of drugs purchased.
Currently nearly 1% of GDP.
The total cost is budgeted based on forecasts
of drug usage.
Man vs Wild Data When you can’t lose 10
28. Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.
Many drugs bought from pharmacies are
subsidised to allow more equitable access to
modern drugs.
The cost to government is determined by the
number and types of drugs purchased.
Currently nearly 1% of GDP.
The total cost is budgeted based on forecasts
of drug usage.
Man vs Wild Data When you can’t lose 10
29. Forecasting the PBS
The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.
Many drugs bought from pharmacies are
subsidised to allow more equitable access to
modern drugs.
The cost to government is determined by the
number and types of drugs purchased.
Currently nearly 1% of GDP.
The total cost is budgeted based on forecasts
of drug usage.
Man vs Wild Data When you can’t lose 10
31. Forecasting the PBS
In 2001: $4.5 billion budget, under-forecasted
by $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,
uncontrollable expenditure.
Although monthly data available for 10 years,
data are aggregated to annual values, and only
the first three years are used in estimating the
forecasts.
All forecasts being done with the FORECAST
function in MS-Excel!
Man vs Wild Data When you can’t lose 12
32. Forecasting the PBS
In 2001: $4.5 billion budget, under-forecasted
by $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,
uncontrollable expenditure.
Although monthly data available for 10 years,
data are aggregated to annual values, and only
the first three years are used in estimating the
forecasts.
All forecasts being done with the FORECAST
function in MS-Excel!
Man vs Wild Data When you can’t lose 12
33. Forecasting the PBS
In 2001: $4.5 billion budget, under-forecasted
by $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,
uncontrollable expenditure.
Although monthly data available for 10 years,
data are aggregated to annual values, and only
the first three years are used in estimating the
forecasts.
All forecasts being done with the FORECAST
function in MS-Excel!
Man vs Wild Data When you can’t lose 12
34. Forecasting the PBS
In 2001: $4.5 billion budget, under-forecasted
by $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,
uncontrollable expenditure.
Although monthly data available for 10 years,
data are aggregated to annual values, and only
the first three years are used in estimating the
forecasts.
All forecasts being done with the FORECAST
function in MS-Excel!
Man vs Wild Data When you can’t lose 12
35. Forecasting the PBS
In 2001: $4.5 billion budget, under-forecasted
by $800 million.
Thousands of products. Seasonal demand.
Subject to covert marketing, volatile products,
uncontrollable expenditure.
Although monthly data available for 10 years,
data are aggregated to annual values, and only
the first three years are used in estimating the
forecasts.
All forecasts being done with the FORECAST
function in MS-Excel!
Man vs Wild Data When you can’t lose 12
36. ATC drug classification
A Alimentary tract and metabolism
B Blood and blood forming organs
C Cardiovascular system
D Dermatologicals
G Genito-urinary system and sex hormones
H Systemic hormonal preparations, excluding sex hor-
mones and insulins
J Anti-infectives for systemic use
L Antineoplastic and immunomodulating agents
M Musculo-skeletal system
N Nervous system
P Antiparasitic products, insecticides and repellents
R Respiratory system
S Sensory organs
V Various
Man vs Wild Data When you can’t lose 13
37. ATC drug classification
14 classes A Alimentary tract and metabolism
84 classes A10 Drugs used in diabetes
A10B Blood glucose lowering drugs
A10BA Biguanides
A10BA02 Metformin
Man vs Wild Data When you can’t lose 14
38. Forecasting the PBS
Monthly data on thousands of drug groups and 4
concession types available from 1991.
Method needs to be automated and implemented
within MS-Excel.
Exponential smoothing seems appropriate (monthly
data with changing trends and seasonal patterns),
but in 2001, automated exponential smoothing was
not well-developed, and not available in MS-Excel.
As part of this project, we developed an automatic
forecasting algorithm for exponential smoothing
state space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
39. Forecasting the PBS
Monthly data on thousands of drug groups and 4
concession types available from 1991.
Method needs to be automated and implemented
within MS-Excel.
Exponential smoothing seems appropriate (monthly
data with changing trends and seasonal patterns),
but in 2001, automated exponential smoothing was
not well-developed, and not available in MS-Excel.
As part of this project, we developed an automatic
forecasting algorithm for exponential smoothing
state space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
40. Forecasting the PBS
Monthly data on thousands of drug groups and 4
concession types available from 1991.
Method needs to be automated and implemented
within MS-Excel.
Exponential smoothing seems appropriate (monthly
data with changing trends and seasonal patterns),
but in 2001, automated exponential smoothing was
not well-developed, and not available in MS-Excel.
As part of this project, we developed an automatic
forecasting algorithm for exponential smoothing
state space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
41. Forecasting the PBS
Monthly data on thousands of drug groups and 4
concession types available from 1991.
Method needs to be automated and implemented
within MS-Excel.
Exponential smoothing seems appropriate (monthly
data with changing trends and seasonal patterns),
but in 2001, automated exponential smoothing was
not well-developed, and not available in MS-Excel.
As part of this project, we developed an automatic
forecasting algorithm for exponential smoothing
state space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
42. Forecasting the PBS
Monthly data on thousands of drug groups and 4
concession types available from 1991.
Method needs to be automated and implemented
within MS-Excel.
Exponential smoothing seems appropriate (monthly
data with changing trends and seasonal patterns),
but in 2001, automated exponential smoothing was
not well-developed, and not available in MS-Excel.
As part of this project, we developed an automatic
forecasting algorithm for exponential smoothing
state space models based on the AIC.
Forecast MAPE reduced from 15–20% to about 0.6%.
Man vs Wild Data When you can’t lose 15
43. Forecasting the PBS
Total cost: A03 concession safety net group
1200
1000
800
$ thousands
600
400
200
0
1995 2000 2005 2010
Man vs Wild Data When you can’t lose 16
44. Forecasting the PBS
Total cost: A05 general copayments group
250
200
$ thousands
150
100
50
0
1995 2000 2005 2010
Man vs Wild Data When you can’t lose 16
45. Forecasting the PBS
Total cost: D01 general copayments group
700
600
500
400
$ thousands
300
200
100
0
1995 2000 2005 2010
Man vs Wild Data When you can’t lose 16
46. Forecasting the PBS
Total cost: S01 general copayments group
6000
5000
4000
$ thousands
3000
2000
1000
0
1995 2000 2005 2010
Man vs Wild Data When you can’t lose 16
47. Forecasting the PBS
Total cost: R03 general copayments group
1000 2000 3000 4000 5000 6000 7000
$ thousands
1995 2000 2005 2010
Man vs Wild Data When you can’t lose 16
48. Forecasting the PBS
Total cost: R03 general copayments group
1000 2000 3000 4000 5000 6000 7000
Some lessons
Often what people do is very bad, and
it is easy to make a big difference.
$ thousands
Sometimes you have to invent new
methods, and that can lead to
publications.
You have to implement solutions in the
client’s software environment.
Be aware of the2000
1995
politics. 2005 2010
Man vs Wild Data When you can’t lose 16
49. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Getting dirty with data 17
51. Airline passenger traffic
First class passengers: Melbourne−Sydney
2.0
1.0
0.0
1988 1989 1990 1991 1992 1993
Year
Business class passengers: Melbourne−Sydney
0 2 4 6 8
1988 1989 1990 1991 1992 1993
Year
Economy class passengers: Melbourne−Sydney
30
20
10
0
1988 1989 1990 1991 1992 1993
Man vs Wild Data Year Getting dirty with data 19
52. Airline passenger traffic
First class passengers: Melbourne−Sydney
2.0
1.0
0.0
1988
Not1989 real 1990
the data! 1991 1992 1993
Year
Or is it? class passengers: Melbourne−Sydney
Business
0 2 4 6 8
1988 1989 1990 1991 1992 1993
Year
Economy class passengers: Melbourne−Sydney
30
20
10
0
1988 1989 1990 1991 1992 1993
Man vs Wild Data Year Getting dirty with data 19
53. Airline passenger traffic
Economy Class Passengers: Melbourne−Sydney
35
30
Passengers (thousands)
25
20
15
10
5
0
1988 1989 1990 1991 1992 1993
Man vs Wild Data Getting dirty with data 20
54. Airline passenger traffic
Economy Class Passengers: Melbourne−Sydney
35
30
Passengers (thousands)
25
20
15
10
5
0
1988 1989 1990 1991 1992 1993
Man vs Wild Data Getting dirty with data 20
55. Airline passenger traffic
Economy Class Passengers: Melbourne−Sydney
35
30
Passengers (thousands)
25
20
15
10
5
0
1988 1989 1990 1991 1992 1993
Man vs Wild Data Getting dirty with data 20
56. Possible model
∗
Yt = Yt + Z t
∗
Yt = β0 + βj xt,j + Nt
j
Yt = observed data for one passenger class.
∗
Yt = reconstructed data.
Zt = latent process (usually equal to zero).
xt,j are covariates and dummy variables.
Nt = seasonal ARIMA process of period 52.
Man vs Wild Data Getting dirty with data 21
57. Possible model
∗
Yt = Yt + Z t
∗
Some lessonsβ0 +
Yt = βj xt,j + Nt
j
Real data is often very messy. Be
Yt = aware of the causes. passenger class.
observed data for one
∗
Yt = Get an answer data. if it isn’t pretty.
reconstructed even
Zt = What to do with the non-integer zero).
latent process (usually equal to
xt,j are covariates (average 52.19)
seasonality? and dummy variables.
Nt = How to deal with process of period 52.
seasonal ARIMA the correlations
between classes and between routes?
You often think of better approaches
long after the project is finished.
Man vs Wild Data Getting dirty with data 21
58. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Going to extremes 22
60. The problem
We want to forecast the peak electricity
demand in a half-hour period in ten years time.
We have twelve years of half-hourly electricity
data, temperature data and some economic
and demographic data.
The location is South Australia: home to the
most volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
61. The problem
We want to forecast the peak electricity
demand in a half-hour period in ten years time.
We have twelve years of half-hourly electricity
data, temperature data and some economic
and demographic data.
The location is South Australia: home to the
most volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
62. The problem
We want to forecast the peak electricity
demand in a half-hour period in ten years time.
We have twelve years of half-hourly electricity
data, temperature data and some economic
and demographic data.
The location is South Australia: home to the
most volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
63. The problem
We want to forecast the peak electricity
demand in a half-hour period in ten years time.
We have twelve years of half-hourly electricity
data, temperature data and some economic
and demographic data.
The location is South Australia: home to the
most volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
64. The problem
We want to forecast the peak electricity
demand in a half-hour period in ten years time.
We have twelve years of half-hourly electricity
data, temperature data and some economic
and demographic data.
The location is South Australia: home to the
most volatile electricity demand in the world.
Sounds impossible?
Man vs Wild Data Going to extremes 24
67. South Australian demand data
South Australia state wide demand (summer 10/11)
3.5
South Australia state wide demand (GW)
3.0
2.5
2.0
1.5
Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11
Man vs Wild Data Going to extremes 25
68. South Australian demand data
South Australia state wide demand (January 2011)
3.5
3.0
South Australian demand (GW)
2.5
2.0
1.5
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Date in January
Man vs Wild Data Going to extremes 25
70. Temperature data (Sth Aust)
Time: 12 midnight
3.5
Workday
Non−workday
3.0
2.5
Demand (GW)
2.0
1.5
1.0
10 20 30 40
Temperature (deg C)
Man vs Wild Data Going to extremes 27
71. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
yt denotes per capita demand at time t (measured in
half-hourly intervals) and p denotes the time of day
p = 1, . . . , 48;
hp (t ) models all calendar effects;
fp (w1,t , w2,t ) models all temperature effects where w1,t is
a vector of recent temperatures at location 1 and w2,t is
a vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
72. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
yt denotes per capita demand at time t (measured in
half-hourly intervals) and p denotes the time of day
p = 1, . . . , 48;
hp (t ) models all calendar effects;
fp (w1,t , w2,t ) models all temperature effects where w1,t is
a vector of recent temperatures at location 1 and w2,t is
a vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
73. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
yt denotes per capita demand at time t (measured in
half-hourly intervals) and p denotes the time of day
p = 1, . . . , 48;
hp (t ) models all calendar effects;
fp (w1,t , w2,t ) models all temperature effects where w1,t is
a vector of recent temperatures at location 1 and w2,t is
a vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
74. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
yt denotes per capita demand at time t (measured in
half-hourly intervals) and p denotes the time of day
p = 1, . . . , 48;
hp (t ) models all calendar effects;
fp (w1,t , w2,t ) models all temperature effects where w1,t is
a vector of recent temperatures at location 1 and w2,t is
a vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
75. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
yt denotes per capita demand at time t (measured in
half-hourly intervals) and p denotes the time of day
p = 1, . . . , 48;
hp (t ) models all calendar effects;
fp (w1,t , w2,t ) models all temperature effects where w1,t is
a vector of recent temperatures at location 1 and w2,t is
a vector of recent temperatures at location 2;
zj,t is a demographic or economic variable at time t
nt denotes the model error at time t.
Man vs Wild Data Going to extremes 28
76. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
77. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
78. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
79. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
80. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
81. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:
hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p
p (t) is “time of summer” effect (a regression spline);
αt,p is day of week effect;
βt,p is “holiday” effect;
γt,p New Year’s Eve effect;
δt,p is millennium effect;
Man vs Wild Data Going to extremes 29
82. Fitted results (Summer 3pm)
Time: 3:00 pm
0.4
0.4
Effect on demand
Effect on demand
0.0
0.0
−0.4
−0.4
0 50 100 150 Mon Tue Wed Thu Fri Sat Sun
Day of summer Day of week
0.4
Effect on demand
0.0
−0.4
Normal Day before Holiday Day after
Holiday
Man vs Wild Data Going to extremes 30
83. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
84. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
85. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
86. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
87. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
88. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
89. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
6
+ −
fp (w1,t , w2,t ) = ¯
fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
k =0 6
+ Fj,p (xt−48j ) + Gj,p (dt−48j )
j=1
xt is ave temp across two sites (Kent Town and Adelaide
Airport) at time t;
dt is the temp difference between two sites at time t;
+
xt is max of xt values in past 24 hours;
−
xt is min of xt values in past 24 hours;
¯
xt is ave temp in past seven days.
Each function is smooth & estimated using regression splines.
Man vs Wild Data Going to extremes 31
90. 0.4 Fitted results (Summer 3pm)
Time: 3:00 pm
0.4
0.4
0.4
0.2
0.2
0.2
0.2
Effect on demand
Effect on demand
Effect on demand
Effect on demand
0.0
0.0
0.0
0.0
−0.2
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−0.4
10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40
Temperature Lag 1 temperature Lag 2 temperature Lag 3 temperature
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
Effect on demand
Effect on demand
Effect on demand
Effect on demand
0.0
0.0
0.0
0.0
−0.2
−0.2
−0.2
−0.2
−0.4
−0.4
−0.4
−0.4
10 20 30 40 10 15 20 25 30 15 25 35 10 15 20 25
Lag 1 day temperature Last week average temp Previous max temp Previous min temp
Man vs Wild Data Going to extremes 32
91. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
Same predictors used for all 48 models.
Predictors chosen by cross-validation on
summer of 2007/2008 and 2009/2010.
Each model is fitted to the data twice, first
excluding the summer of 2009/2010 and then
excluding the summer of 2010/2011. The
average out-of-sample MSE is calculated from
the omitted data for the time periods
12noon–8.30pm.
Man vs Wild Data Going to extremes 33
92. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
Same predictors used for all 48 models.
Predictors chosen by cross-validation on
summer of 2007/2008 and 2009/2010.
Each model is fitted to the data twice, first
excluding the summer of 2009/2010 and then
excluding the summer of 2010/2011. The
average out-of-sample MSE is calculated from
the omitted data for the time periods
12noon–8.30pm.
Man vs Wild Data Going to extremes 33
93. Monash Electricity Forecasting Model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
Same predictors used for all 48 models.
Predictors chosen by cross-validation on
summer of 2007/2008 and 2009/2010.
Each model is fitted to the data twice, first
excluding the summer of 2009/2010 and then
excluding the summer of 2010/2011. The
average out-of-sample MSE is calculated from
the omitted data for the time periods
12noon–8.30pm.
Man vs Wild Data Going to extremes 33
95. Half-hourly models
R−squared
90
R−squared (%)
80
70
60
12 midnight 3:00 am 6:00 am 9:00 am 12 noon 3:00 pm 6:00 pm 9:00 pm 12 midnight
Time of day
Man vs Wild Data Going to extremes 35
96. Half-hourly models
South Australian demand (January 2011)
4.0
Actual
Fitted
3.5
South Australian demand (GW)
3.0
2.5
2.0
1.5
1.0
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
Man vs Wild Data Date in January Going to extremes 35
99. Adjusted model
Original model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
Model allowing saturated usage
J
qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j=1
qt if qt ≤ τ ;
log(yt ) =
τ + k(qt − τ ) if qt > τ .
Man vs Wild Data Going to extremes 36
100. Adjusted model
Original model
J
log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j =1
Model allowing saturated usage
J
qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j=1
qt if qt ≤ τ ;
log(yt ) =
τ + k(qt − τ ) if qt > τ .
Man vs Wild Data Going to extremes 36
101. Peak demand forecasting
J
qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j=1
Multiple alternative futures created:
hp (t ) known;
simulate future temperatures using double
seasonal block bootstrap with variable
blocks (with adjustment for climate change);
use assumed values for GSP, population and
price;
resample residuals using double seasonal block
bootstrap with variable blocks.
Man vs Wild Data Going to extremes 37
102. Peak demand backcasting
J
qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt
j=1
Multiple alternative pasts created:
hp (t ) known;
simulate past temperatures using double
seasonal block bootstrap with variable
blocks;
use actual values for GSP, population and
price;
resample residuals using double seasonal block
bootstrap with variable blocks.
Man vs Wild Data Going to extremes 37
103. Peak demand backcasting
PoE (annual interpretation)
4.0
10 %
50 %
90 %
3.5
q
q
q
PoE Demand
q
3.0
q q
q
q
q
q q
q
2.5
q
q
2.0
98/99 00/01 02/03 04/05 06/07 08/09 10/11
Year
Man vs Wild Data Going to extremes 38
104. Peak demand forecasting
South Australia GSP
120
High
billion dollars (08/09 dollars)
Base
100
Low
80
60
40
1990 1995 2000 2005 2010 2015 2020
Year
South Australia population
2.0
High
Base
Low
1.8
million
1.6
1.4
1990 1995 2000 2005 2010 2015 2020
Year
Average electricity prices
High
22
Base
Low
20
c/kWh
18
16
14
12
1990 1995 2000 2005 2010 2015 2020
Year
Man vs Wild Data Major industrial offset demand Going to extremes 39
0
105. Peak demand distribution
Annual POE levels
6
1 % POE
5 % POE
10 % POE
50 % POE
5
90 % POE
q Actual annual maximum
PoE Demand
4
q q
q
q
3
q q
q
q
q q q
q q
2
98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21
Year
Man vs Wild Data Going to extremes 40
106. Results
We have successfully forecast the extreme upper tail in
ten years time using only twelve years of data!
This method has now been adopted for the official
long-term peak electricity demand forecasts for all states
except WA.
Some lessons
Cross-validation is very useful in prediction
problems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles is
extremely difficult.
Beware of clients who think they know more
than you!
Man vs Wild Data Going to extremes 41
107. Results
We have successfully forecast the extreme upper tail in
ten years time using only twelve years of data!
This method has now been adopted for the official
long-term peak electricity demand forecasts for all states
except WA.
Some lessons
Cross-validation is very useful in prediction
problems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles is
extremely difficult.
Beware of clients who think they know more
than you!
Man vs Wild Data Going to extremes 41
108. Results
We have successfully forecast the extreme upper tail in
ten years time using only twelve years of data!
This method has now been adopted for the official
long-term peak electricity demand forecasts for all states
except WA.
Some lessons
Cross-validation is very useful in prediction
problems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles is
extremely difficult.
Beware of clients who think they know more
than you!
Man vs Wild Data Going to extremes 41
109. Results
We have successfully forecast the extreme upper tail in
ten years time using only twelve years of data!
This method has now been adopted for the official
long-term peak electricity demand forecasts for all states
except WA.
Some lessons
Cross-validation is very useful in prediction
problems.
Statistical modelling is an iterative process.
Getting client understanding of percentiles is
extremely difficult.
Beware of clients who think they know more
than you!
Man vs Wild Data Going to extremes 41
110. Outline
1 Where fools fear to tread
2 Working with inadequate tools
3 When you can’t lose
4 Getting dirty with data
5 Going to extremes
6 Final thoughts
Man vs Wild Data Final thoughts 42
111. Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
112. Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
113. Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
114. Crazy clients
The client who wouldn’t tell me the
problem.
The client who wanted all meetings
held at random locations for security
reasons.
The client who didn’t like the answer.
Expert witnessing on the color purple
(and now yellow).
Man vs Wild Data Final thoughts 43
115. Go forth and consult
A good statistician is not smarter than
everyone else, he merely has his ignorance
better organised.
(Anonymous)
Man vs Wild Data Final thoughts 44
116. Go forth and consult
All models are wrong, some are useful.
(George E P Box)
Man vs Wild Data Final thoughts 44
117. Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
(John W Tukey)
Man vs Wild Data Final thoughts 44
118. Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
(John W Tukey)
Slides available from robjhyndman.com
Man vs Wild Data Final thoughts 44