SlideShare una empresa de Scribd logo
1 de 118
Descargar para leer sin conexión
Young
Statisticians
Conference




7 February 2013
Young
Statisticians
Conference




7 February 2013
Young
Statisticians
Conference




7 February 2013
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data      Where fools fear to tread   2
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
My story
Olympic video poker slots
Beware of smelly clients
Threats and slander
Nerves in court
Three university
consulting services
Reviewing my own
work
Six times an expert
witness
Hundreds of clients
      Man vs Wild Data      Where fools fear to tread   3
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data      Working with inadequate tools   4
Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.




          Man vs Wild Data           Working with inadequate tools   5
Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.

Additional information
    Program written in COBOL making numerical calculations
    limited. It is not possible to do any optimisation.




          Man vs Wild Data           Working with inadequate tools   5
Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.

Additional information
    Program written in COBOL making numerical calculations
    limited. It is not possible to do any optimisation.
    Their programmer has little experience in numerical
    computing.



          Man vs Wild Data           Working with inadequate tools   5
Disposable tableware company
Problem: Want forecasts of each of
hundreds of items. Series can be
stationary, trended or seasonal. They
currently have a large forecasting
program written in-house but it doesn’t
seem to produce sensible forecasts.
They want me to tell them what is
wrong and fix it.

Additional information
    Program written in COBOL making numerical calculations
    limited. It is not possible to do any optimisation.
    Their programmer has little experience in numerical
    computing.
    They employ no statisticians and want the program to
    produce forecasts automatically.
          Man vs Wild Data           Working with inadequate tools   5
Disposable tableware company
Methods currently used
      A 12 month average
      C 6 month average
      E straight line regression over last 12 months
     G straight line regression over last 6 months
     H average slope between last year’s and this
       year’s values.
       (Equivalent to differencing at lag 12 and
       taking mean.)
      I Same as H except over 6 months.
      K I couldn’t understand the explanation.

      Man vs Wild Data        Working with inadequate tools   6
Disposable tableware company
My solution
   Use first differencing to deal with trend, or seasonal
   differencing to deal with seasonality.
   Use simple exponential smoothing on (differenced)
   data with the parameter selected from
   {0.1, 0.3, 0.5, 0.7, 0.9}.
   For each series, try 15 models: no differencing, first
   differencing, and seasonal differencing, plus SES
   with 5 parameter values.
   Model selected based on smallest MSE. (Only one
   parameter for each model, so no need to penalize
   for model size.)

       Man vs Wild Data         Working with inadequate tools   7
Disposable tableware company
My solution
   Use first differencing to deal with trend, or seasonal
   differencing to deal with seasonality.
   Use simple exponential smoothing on (differenced)
   data with the parameter selected from
   {0.1, 0.3, 0.5, 0.7, 0.9}.
   For each series, try 15 models: no differencing, first
   differencing, and seasonal differencing, plus SES
   with 5 parameter values.
   Model selected based on smallest MSE. (Only one
   parameter for each model, so no need to penalize
   for model size.)

       Man vs Wild Data         Working with inadequate tools   7
Disposable tableware company
My solution
   Use first differencing to deal with trend, or seasonal
   differencing to deal with seasonality.
   Use simple exponential smoothing on (differenced)
   data with the parameter selected from
   {0.1, 0.3, 0.5, 0.7, 0.9}.
   For each series, try 15 models: no differencing, first
   differencing, and seasonal differencing, plus SES
   with 5 parameter values.
   Model selected based on smallest MSE. (Only one
   parameter for each model, so no need to penalize
   for model size.)

       Man vs Wild Data         Working with inadequate tools   7
Disposable tableware company
My solution
   Use first differencing to deal with trend, or seasonal
   differencing to deal with seasonality.
   Use simple exponential smoothing on (differenced)
   data with the parameter selected from
   {0.1, 0.3, 0.5, 0.7, 0.9}.
   For each series, try 15 models: no differencing, first
   differencing, and seasonal differencing, plus SES
   with 5 parameter values.
   Model selected based on smallest MSE. (Only one
   parameter for each model, so no need to penalize
   for model size.)

       Man vs Wild Data         Working with inadequate tools   7
Disposable tableware company
My solution
  Use first differencing to deal with trend, or seasonal
 Some lessons with seasonality.
  differencing to deal
   Use simple exponential smoothing on (differenced)
     Be pragmatic.
   data with the parameter selected from
   {0Understand .9}.
     .1, 0.3, 0.5, 0.7, 0 your tools well enough
   For each series, to adapt them.
     to be able try 15 models: no differencing, first
   differencing, and seasonal differencing, plus SES
   with successful consulting job often
     A 5 parameter values.
   Model selected based on methods. (Only one
     uses very simple smallest MSE.
   parameter for each model, so no need to penalize
   for model size.)

       Man vs Wild Data        Working with inadequate tools   7
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data        When you can’t lose   8
Forecasting the PBS




   Man vs Wild Data   When you can’t lose   9
Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.

    Many drugs bought from pharmacies are
    subsidised to allow more equitable access to
    modern drugs.
    The cost to government is determined by the
    number and types of drugs purchased.
    Currently nearly 1% of GDP.
    The total cost is budgeted based on forecasts
    of drug usage.

       Man vs Wild Data        When you can’t lose   10
Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.

    Many drugs bought from pharmacies are
    subsidised to allow more equitable access to
    modern drugs.
    The cost to government is determined by the
    number and types of drugs purchased.
    Currently nearly 1% of GDP.
    The total cost is budgeted based on forecasts
    of drug usage.

       Man vs Wild Data        When you can’t lose   10
Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.

    Many drugs bought from pharmacies are
    subsidised to allow more equitable access to
    modern drugs.
    The cost to government is determined by the
    number and types of drugs purchased.
    Currently nearly 1% of GDP.
    The total cost is budgeted based on forecasts
    of drug usage.

       Man vs Wild Data        When you can’t lose   10
Forecasting the PBS

The Pharmaceutical Benefits Scheme (PBS) is
the Australian government drugs subsidy scheme.

    Many drugs bought from pharmacies are
    subsidised to allow more equitable access to
    modern drugs.
    The cost to government is determined by the
    number and types of drugs purchased.
    Currently nearly 1% of GDP.
    The total cost is budgeted based on forecasts
    of drug usage.

       Man vs Wild Data        When you can’t lose   10
Forecasting the PBS




   Man vs Wild Data   When you can’t lose   11
Forecasting the PBS
 In 2001: $4.5 billion budget, under-forecasted
 by $800 million.
 Thousands of products. Seasonal demand.
 Subject to covert marketing, volatile products,
 uncontrollable expenditure.
 Although monthly data available for 10 years,
 data are aggregated to annual values, and only
 the first three years are used in estimating the
 forecasts.
 All forecasts being done with the FORECAST
 function in MS-Excel!
    Man vs Wild Data         When you can’t lose   12
Forecasting the PBS
 In 2001: $4.5 billion budget, under-forecasted
 by $800 million.
 Thousands of products. Seasonal demand.
 Subject to covert marketing, volatile products,
 uncontrollable expenditure.
 Although monthly data available for 10 years,
 data are aggregated to annual values, and only
 the first three years are used in estimating the
 forecasts.
 All forecasts being done with the FORECAST
 function in MS-Excel!
    Man vs Wild Data         When you can’t lose   12
Forecasting the PBS
 In 2001: $4.5 billion budget, under-forecasted
 by $800 million.
 Thousands of products. Seasonal demand.
 Subject to covert marketing, volatile products,
 uncontrollable expenditure.
 Although monthly data available for 10 years,
 data are aggregated to annual values, and only
 the first three years are used in estimating the
 forecasts.
 All forecasts being done with the FORECAST
 function in MS-Excel!
    Man vs Wild Data         When you can’t lose   12
Forecasting the PBS
 In 2001: $4.5 billion budget, under-forecasted
 by $800 million.
 Thousands of products. Seasonal demand.
 Subject to covert marketing, volatile products,
 uncontrollable expenditure.
 Although monthly data available for 10 years,
 data are aggregated to annual values, and only
 the first three years are used in estimating the
 forecasts.
 All forecasts being done with the FORECAST
 function in MS-Excel!
    Man vs Wild Data         When you can’t lose   12
Forecasting the PBS
 In 2001: $4.5 billion budget, under-forecasted
 by $800 million.
 Thousands of products. Seasonal demand.
 Subject to covert marketing, volatile products,
 uncontrollable expenditure.
 Although monthly data available for 10 years,
 data are aggregated to annual values, and only
 the first three years are used in estimating the
 forecasts.
 All forecasts being done with the FORECAST
 function in MS-Excel!
    Man vs Wild Data         When you can’t lose   12
ATC drug classification
A   Alimentary tract and metabolism
B   Blood and blood forming organs
C   Cardiovascular system
D   Dermatologicals
G   Genito-urinary system and sex hormones
H   Systemic hormonal preparations, excluding sex hor-
    mones and insulins
J   Anti-infectives for systemic use
L   Antineoplastic and immunomodulating agents
M   Musculo-skeletal system
N   Nervous system
P   Antiparasitic products, insecticides and repellents
R   Respiratory system
S   Sensory organs
V   Various
          Man vs Wild Data           When you can’t lose   13
ATC drug classification

14 classes           A      Alimentary tract and metabolism



84 classes         A10      Drugs used in diabetes



                  A10B      Blood glucose lowering drugs



                 A10BA      Biguanides



               A10BA02      Metformin


         Man vs Wild Data                When you can’t lose   14
Forecasting the PBS
 Monthly data on thousands of drug groups and 4
 concession types available from 1991.
 Method needs to be automated and implemented
 within MS-Excel.
 Exponential smoothing seems appropriate (monthly
 data with changing trends and seasonal patterns),
 but in 2001, automated exponential smoothing was
 not well-developed, and not available in MS-Excel.
 As part of this project, we developed an automatic
 forecasting algorithm for exponential smoothing
 state space models based on the AIC.
 Forecast MAPE reduced from 15–20% to about 0.6%.

    Man vs Wild Data           When you can’t lose    15
Forecasting the PBS
 Monthly data on thousands of drug groups and 4
 concession types available from 1991.
 Method needs to be automated and implemented
 within MS-Excel.
 Exponential smoothing seems appropriate (monthly
 data with changing trends and seasonal patterns),
 but in 2001, automated exponential smoothing was
 not well-developed, and not available in MS-Excel.
 As part of this project, we developed an automatic
 forecasting algorithm for exponential smoothing
 state space models based on the AIC.
 Forecast MAPE reduced from 15–20% to about 0.6%.

    Man vs Wild Data           When you can’t lose    15
Forecasting the PBS
 Monthly data on thousands of drug groups and 4
 concession types available from 1991.
 Method needs to be automated and implemented
 within MS-Excel.
 Exponential smoothing seems appropriate (monthly
 data with changing trends and seasonal patterns),
 but in 2001, automated exponential smoothing was
 not well-developed, and not available in MS-Excel.
 As part of this project, we developed an automatic
 forecasting algorithm for exponential smoothing
 state space models based on the AIC.
 Forecast MAPE reduced from 15–20% to about 0.6%.

    Man vs Wild Data           When you can’t lose    15
Forecasting the PBS
 Monthly data on thousands of drug groups and 4
 concession types available from 1991.
 Method needs to be automated and implemented
 within MS-Excel.
 Exponential smoothing seems appropriate (monthly
 data with changing trends and seasonal patterns),
 but in 2001, automated exponential smoothing was
 not well-developed, and not available in MS-Excel.
 As part of this project, we developed an automatic
 forecasting algorithm for exponential smoothing
 state space models based on the AIC.
 Forecast MAPE reduced from 15–20% to about 0.6%.

    Man vs Wild Data           When you can’t lose    15
Forecasting the PBS
 Monthly data on thousands of drug groups and 4
 concession types available from 1991.
 Method needs to be automated and implemented
 within MS-Excel.
 Exponential smoothing seems appropriate (monthly
 data with changing trends and seasonal patterns),
 but in 2001, automated exponential smoothing was
 not well-developed, and not available in MS-Excel.
 As part of this project, we developed an automatic
 forecasting algorithm for exponential smoothing
 state space models based on the AIC.
 Forecast MAPE reduced from 15–20% to about 0.6%.

    Man vs Wild Data           When you can’t lose    15
Forecasting the PBS
                              Total cost: A03 concession safety net group
              1200
              1000
              800
$ thousands

              600
              400
              200
              0




                             1995             2000              2005           2010



                     Man vs Wild Data                    When you can’t lose      16
Forecasting the PBS
                              Total cost: A05 general copayments group
              250
              200
$ thousands

              150
              100
              50
              0




                            1995             2000              2005           2010



                    Man vs Wild Data                    When you can’t lose      16
Forecasting the PBS
                              Total cost: D01 general copayments group
              700
              600
              500
              400
$ thousands

              300
              200
              100
              0




                            1995             2000              2005           2010



                    Man vs Wild Data                    When you can’t lose      16
Forecasting the PBS
                               Total cost: S01 general copayments group
              6000
              5000
              4000
$ thousands

              3000
              2000
              1000
              0




                             1995             2000              2005           2010



                     Man vs Wild Data                    When you can’t lose      16
Forecasting the PBS
                                                             Total cost: R03 general copayments group
              1000 2000 3000 4000 5000 6000 7000
$ thousands




                                                           1995             2000              2005           2010



                                                   Man vs Wild Data                    When you can’t lose      16
Forecasting the PBS
                                                                Total cost: R03 general copayments group
              1000 2000 3000 4000 5000 6000 7000




                                                   Some lessons
                                                      Often what people do is very bad, and
                                                      it is easy to make a big difference.
$ thousands




                                                       Sometimes you have to invent new
                                                       methods, and that can lead to
                                                       publications.
                                                       You have to implement solutions in the
                                                       client’s software environment.
                                                       Be aware of the2000
                                                          1995
                                                                       politics.                 2005           2010



                                                      Man vs Wild Data                    When you can’t lose      16
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data        Getting dirty with data   17
Airline passenger traffic




   Man vs Wild Data   Getting dirty with data   18
Airline passenger traffic
                                        First class passengers: Melbourne−Sydney
2.0
1.0
0.0




              1988                1989               1990              1991               1992   1993
                                                            Year
                                   Business class passengers: Melbourne−Sydney
0 2 4 6 8




              1988                1989               1990              1991               1992   1993
                                                            Year
                                   Economy class passengers: Melbourne−Sydney
30
20
10
0




              1988                1989               1990              1991               1992   1993

                     Man vs Wild Data                       Year      Getting dirty with data     19
Airline passenger traffic
                                        First class passengers: Melbourne−Sydney
2.0
1.0
0.0




              1988
                           Not1989 real 1990
                                the         data!            1991                         1992   1993
                                                Year
                           Or is it? class passengers: Melbourne−Sydney
                                Business
0 2 4 6 8




              1988                1989               1990              1991               1992   1993
                                                            Year
                                   Economy class passengers: Melbourne−Sydney
30
20
10
0




              1988                1989               1990              1991               1992   1993

                     Man vs Wild Data                       Year      Getting dirty with data     19
Airline passenger traffic
                                     Economy Class Passengers: Melbourne−Sydney
                         35
                         30
Passengers (thousands)

                         25
                         20
                         15
                         10
                         5
                         0




                              1988         1989   1990     1991         1992            1993



                               Man vs Wild Data               Getting dirty with data          20
Airline passenger traffic
                                     Economy Class Passengers: Melbourne−Sydney
                         35
                         30
Passengers (thousands)

                         25
                         20
                         15
                         10
                         5
                         0




                              1988         1989   1990     1991         1992            1993



                               Man vs Wild Data               Getting dirty with data          20
Airline passenger traffic
                                     Economy Class Passengers: Melbourne−Sydney
                         35
                         30
Passengers (thousands)

                         25
                         20
                         15
                         10
                         5
                         0




                              1988         1989   1990     1991         1992            1993



                               Man vs Wild Data               Getting dirty with data          20
Possible model
                       ∗
                 Yt = Yt + Z t
                  ∗
                 Yt = β0 +       βj xt,j + Nt
                             j


 Yt = observed data for one passenger class.
  ∗
 Yt = reconstructed data.
 Zt = latent process (usually equal to zero).
 xt,j are covariates and dummy variables.
 Nt = seasonal ARIMA process of period 52.




    Man vs Wild Data                 Getting dirty with data   21
Possible model
                       ∗
                 Yt = Yt + Z t
             ∗
 Some lessonsβ0 +
            Yt =          βj xt,j + Nt
                        j
      Real data is often very messy. Be
 Yt = aware of the causes. passenger class.
      observed data for one
  ∗
 Yt = Get an answer data. if it isn’t pretty.
        reconstructed even
 Zt = What to do with the non-integer zero).
        latent process (usually equal to
 xt,j are covariates (average 52.19)
       seasonality? and dummy variables.
 Nt = How to deal with process of period 52.
        seasonal ARIMA the correlations
     between classes and between routes?
     You often think of better approaches
     long after the project is finished.
    Man vs Wild Data             Getting dirty with data   21
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data        Going to extremes   22
Extreme electricity demand




   Man vs Wild Data   Going to extremes   23
The problem

 We want to forecast the peak electricity
 demand in a half-hour period in ten years time.
 We have twelve years of half-hourly electricity
 data, temperature data and some economic
 and demographic data.
 The location is South Australia: home to the
 most volatile electricity demand in the world.


                       Sounds impossible?


    Man vs Wild Data          Going to extremes    24
The problem

 We want to forecast the peak electricity
 demand in a half-hour period in ten years time.
 We have twelve years of half-hourly electricity
 data, temperature data and some economic
 and demographic data.
 The location is South Australia: home to the
 most volatile electricity demand in the world.


                       Sounds impossible?


    Man vs Wild Data          Going to extremes    24
The problem

 We want to forecast the peak electricity
 demand in a half-hour period in ten years time.
 We have twelve years of half-hourly electricity
 data, temperature data and some economic
 and demographic data.
 The location is South Australia: home to the
 most volatile electricity demand in the world.


                       Sounds impossible?


    Man vs Wild Data          Going to extremes    24
The problem

 We want to forecast the peak electricity
 demand in a half-hour period in ten years time.
 We have twelve years of half-hourly electricity
 data, temperature data and some economic
 and demographic data.
 The location is South Australia: home to the
 most volatile electricity demand in the world.


                       Sounds impossible?


    Man vs Wild Data          Going to extremes    24
The problem

 We want to forecast the peak electricity
 demand in a half-hour period in ten years time.
 We have twelve years of half-hourly electricity
 data, temperature data and some economic
 and demographic data.
 The location is South Australia: home to the
 most volatile electricity demand in the world.


                       Sounds impossible?


    Man vs Wild Data          Going to extremes    24
South Australian demand data




   Man vs Wild Data   Going to extremes   25
South Australian demand data

                      Black Saturday →




   Man vs Wild Data        Going to extremes   25
South Australian demand data
                                                    South Australia state wide demand (summer 10/11)
                                         3.5
South Australia state wide demand (GW)

                                         3.0
                                         2.5
                                         2.0
                                         1.5




                                               Oct 10        Nov 10   Dec 10   Jan 11       Feb 11     Mar 11



                                                Man vs Wild Data                   Going to extremes            25
South Australian demand data
                                              South Australia state wide demand (January 2011)
                               3.5
                               3.0
South Australian demand (GW)

                               2.5
                               2.0
                               1.5




                                     1    3    5     7      9   11   13   15   17   19     21    23   25    27   29   31

                                                                      Date in January

                                         Man vs Wild Data                               Going to extremes              25
Demand boxplots (Sth Aust)
                                        Time: 12 midnight
              3.5
              3.0
              2.5
Demand (GW)




                                                  q         q
                                                                         q
                                                                         q
                     q           q                                              q
                                                                                q
                                 q                q
              2.0




                                                                         q
                                        q
                                        q                   q
                                                            q
                     q           q
                                 q      q         q         q
                                                            q            q      q
                                                                                q
                                 q      q         q         q            q      q
                     q
                     q           q      q         q
                                                  q         q            q
                                 q
                                 q      q
                                        q         q         q            q      q
                                                                                q
                     q
                     q           q      q         q
                                                  q         q            q
                                                                         q      q
                     q           q      q                   q            q
                                                                         q      q
                     q                                                          q
                                                                                q
                                                                                q
              1.5




                                 q      q         q         q
                                                            q
                                 q      q         q         q                   q
                                 q
                                 q      q                   q            q
                                                                         q      q
                     q
              1.0




                     q                                                          q




                    Mon         Tue    Wed      Thu         Fri         Sat    Sun

                                             Day of week

                    Man vs Wild Data                       Going to extremes         26
Temperature data (Sth Aust)
                                         Time: 12 midnight
              3.5

                    Workday
                    Non−workday
              3.0
              2.5
Demand (GW)

              2.0
              1.5
              1.0




                            10            20                   30              40

                                         Temperature (deg C)

                      Man vs Wild Data                     Going to extremes        27
Monash Electricity Forecasting Model
                                                J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +            cj zj,t + nt
                                              j =1


 yt denotes per capita demand at time t (measured in
 half-hourly intervals) and p denotes the time of day
 p = 1, . . . , 48;
 hp (t ) models all calendar effects;
 fp (w1,t , w2,t ) models all temperature effects where w1,t is
 a vector of recent temperatures at location 1 and w2,t is
 a vector of recent temperatures at location 2;
 zj,t is a demographic or economic variable at time t
 nt denotes the model error at time t.

     Man vs Wild Data                   Going to extremes           28
Monash Electricity Forecasting Model
                                                J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +            cj zj,t + nt
                                              j =1


 yt denotes per capita demand at time t (measured in
 half-hourly intervals) and p denotes the time of day
 p = 1, . . . , 48;
 hp (t ) models all calendar effects;
 fp (w1,t , w2,t ) models all temperature effects where w1,t is
 a vector of recent temperatures at location 1 and w2,t is
 a vector of recent temperatures at location 2;
 zj,t is a demographic or economic variable at time t
 nt denotes the model error at time t.

     Man vs Wild Data                   Going to extremes           28
Monash Electricity Forecasting Model
                                                J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +            cj zj,t + nt
                                              j =1


 yt denotes per capita demand at time t (measured in
 half-hourly intervals) and p denotes the time of day
 p = 1, . . . , 48;
 hp (t ) models all calendar effects;
 fp (w1,t , w2,t ) models all temperature effects where w1,t is
 a vector of recent temperatures at location 1 and w2,t is
 a vector of recent temperatures at location 2;
 zj,t is a demographic or economic variable at time t
 nt denotes the model error at time t.

     Man vs Wild Data                   Going to extremes           28
Monash Electricity Forecasting Model
                                                J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +            cj zj,t + nt
                                              j =1


 yt denotes per capita demand at time t (measured in
 half-hourly intervals) and p denotes the time of day
 p = 1, . . . , 48;
 hp (t ) models all calendar effects;
 fp (w1,t , w2,t ) models all temperature effects where w1,t is
 a vector of recent temperatures at location 1 and w2,t is
 a vector of recent temperatures at location 2;
 zj,t is a demographic or economic variable at time t
 nt denotes the model error at time t.

     Man vs Wild Data                   Going to extremes           28
Monash Electricity Forecasting Model
                                                J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +            cj zj,t + nt
                                              j =1


 yt denotes per capita demand at time t (measured in
 half-hourly intervals) and p denotes the time of day
 p = 1, . . . , 48;
 hp (t ) models all calendar effects;
 fp (w1,t , w2,t ) models all temperature effects where w1,t is
 a vector of recent temperatures at location 1 and w2,t is
 a vector of recent temperatures at location 2;
 zj,t is a demographic or economic variable at time t
 nt denotes the model error at time t.

     Man vs Wild Data                   Going to extremes           28
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Monash Electricity Forecasting Model
                                                           J

      log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                  cj zj,t + nt
                                                         j =1

hp (t ) includes handle annual, weekly and daily seasonal
patterns as well as public holidays:

hp (t ) =    p   (t) + αt,p + βt,p + γt,p + δt,p

       p    (t) is “time of summer” effect (a regression spline);
      αt,p is day of week effect;
      βt,p is “holiday” effect;
      γt,p New Year’s Eve effect;
      δt,p is millennium effect;

                 Man vs Wild Data                  Going to extremes           29
Fitted results (Summer 3pm)
                                                                      Time: 3:00 pm
                   0.4




                                                                                                 0.4
Effect on demand




                                                                              Effect on demand
                   0.0




                                                                                                 0.0
                   −0.4




                                                                                                 −0.4
                          0               50              100           150                             Mon   Tue   Wed   Thu   Fri   Sat   Sun

                                           Day of summer                                                             Day of week
                   0.4
Effect on demand

                   0.0
                   −0.4




                              Normal    Day before   Holiday    Day after

                                               Holiday
                                       Man vs Wild Data                                                  Going to extremes                    30
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
Monash Electricity Forecasting Model
                                                                    J

         log(yt ) = hp (t ) + fp (w1,t , w2,t ) +                        cj zj,t + nt
                                                                  j =1
                       6
                                                                +          −
fp (w1,t , w2,t ) =                                                                  ¯
                             fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt )
                      k =0     6

                           +         Fj,p (xt−48j ) + Gj,p (dt−48j )
                               j=1

         xt is ave temp across two sites (Kent Town and Adelaide
         Airport) at time t;
         dt is the temp difference between two sites at time t;
           +
         xt is max of xt values in past 24 hours;
           −
         xt is min of xt values in past 24 hours;
         ¯
         xt is ave temp in past seven days.
   Each function is smooth & estimated using regression splines.
               Man vs Wild Data                             Going to extremes           31
0.4     Fitted results (Summer 3pm)
                                        Time: 3:00 pm




                                                                       0.4




                                                                                                                           0.4




                                                                                                                                                                            0.4
                   0.2




                                                                       0.2




                                                                                                                           0.2




                                                                                                                                                                            0.2
Effect on demand




                                                    Effect on demand




                                                                                                        Effect on demand




                                                                                                                                                         Effect on demand
                   0.0




                                                                       0.0




                                                                                                                           0.0




                                                                                                                                                                            0.0
                   −0.2




                                                                       −0.2




                                                                                                                           −0.2




                                                                                                                                                                            −0.2
                   −0.4




                                                                       −0.4




                                                                                                                           −0.4




                                                                                                                                                                            −0.4
                          10    20    30      40                              10      20    30     40                             10   20    30     40                             10        20    30    40
                               Temperature                                         Lag 1 temperature                               Lag 2 temperature                                    Lag 3 temperature
                   0.4




                                                                       0.4




                                                                                                                           0.4




                                                                                                                                                                            0.4
                   0.2




                                                                       0.2




                                                                                                                           0.2




                                                                                                                                                                            0.2
Effect on demand




                                                    Effect on demand




                                                                                                        Effect on demand




                                                                                                                                                         Effect on demand
                   0.0




                                                                       0.0




                                                                                                                           0.0




                                                                                                                                                                            0.0
                   −0.2




                                                                       −0.2




                                                                                                                           −0.2




                                                                                                                                                                            −0.2
                   −0.4




                                                                       −0.4




                                                                                                                           −0.4




                                                                                                                                                                            −0.4
                          10    20    30      40                              10 15 20 25 30                                      15    25     35                                       10    15   20   25
                           Lag 1 day temperature                              Last week average temp                               Previous max temp                                    Previous min temp



                                           Man vs Wild Data                                                                             Going to extremes                                               32
Monash Electricity Forecasting Model
                                              J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +          cj zj,t + nt
                                            j =1


 Same predictors used for all 48 models.
 Predictors chosen by cross-validation on
 summer of 2007/2008 and 2009/2010.
 Each model is fitted to the data twice, first
 excluding the summer of 2009/2010 and then
 excluding the summer of 2010/2011. The
 average out-of-sample MSE is calculated from
 the omitted data for the time periods
 12noon–8.30pm.
     Man vs Wild Data                 Going to extremes           33
Monash Electricity Forecasting Model
                                              J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +          cj zj,t + nt
                                            j =1


 Same predictors used for all 48 models.
 Predictors chosen by cross-validation on
 summer of 2007/2008 and 2009/2010.
 Each model is fitted to the data twice, first
 excluding the summer of 2009/2010 and then
 excluding the summer of 2010/2011. The
 average out-of-sample MSE is calculated from
 the omitted data for the time periods
 12noon–8.30pm.
     Man vs Wild Data                 Going to extremes           33
Monash Electricity Forecasting Model
                                              J

 log(yt ) = hp (t ) + fp (w1,t , w2,t ) +          cj zj,t + nt
                                            j =1


 Same predictors used for all 48 models.
 Predictors chosen by cross-validation on
 summer of 2007/2008 and 2009/2010.
 Each model is fitted to the data twice, first
 excluding the summer of 2009/2010 and then
 excluding the summer of 2010/2011. The
 average out-of-sample MSE is calculated from
 the omitted data for the time periods
 12noon–8.30pm.
     Man vs Wild Data                 Going to extremes           33
Half-hourly models
     x x1 x2 x3 x4 x5 x6 x48 x96 x144 x192 x240 x288 d d1 d2 d3 d4 d5 d6 d48 d96 d144 d192 d240 d288 x+ x− x dow hol dos MSE
                                                                                                           ¯
 1   • • • • • • • • • •               •    •    • • • • • • • • • •              •    •    •    • • • • • • • 1.037
 2   • • • • • • • • • •               •    •    • • • • • • • • • •              •    •    •         • • • • • • 1.034
 3   • • • • • • • • • •               •    •    • • • • • • • • • •              •    •              • • • • • • 1.031
 4   • • • • • • • • • •               •    •    • • • • • • • • • •              •                   • • • • • • 1.027
 5   • • • • • • • • • •               •    •    • • • • • • • • • •                                  • • • • • • 1.025
 6   • • • • • • • • • •               •    •    • • • • • • • • •                                    • • • • • • 1.020
 7   • • • • • • • • • •               •    •    • • • • • • • •                                      • • • • • • 1.025
 8   • • • • • • • • • •               •    •    • • • • • • •            •                           • • • • • • 1.026
 9   • • • • • • • • • •               •    •    • • • • • •              •                           • • • • • • 1.035
10   • • • • • • • • • •               •    •    • • • • •                •                           • • • • • • 1.044
11   • • • • • • • • • •               •    •    • • • •                  •                           • • • • • • 1.057
12   • • • • • • • • • •               •    •    • • •                    •                           • • • • • • 1.076
13   • • • • • • • • • •               •    •    • •                      •                           • • • • • • 1.102
14   • • • • • • • • • •               •    •        • • • • • • • •                                  • • • • • • 1.018
15   • • • • • • • • • •               •             • • • • • • • •                                  • • • • • • 1.021
16   • • • • • • • • • •                             • • • • • • • •                                  • • • • • • 1.037
17   • • • • • • • • •                               • • • • • • • •                                  • • • • • • 1.074
18   • • • • • • • •                                 • • • • • • • •                                  • • • • • • 1.152
19   • • • • • • •                                   • • • • • • • •                                  • • • • • • 1.180
20   • • • • • •          • • •        •    •        • • • • • • • •                                  • • • • • • 1.021
21   • • • • •            • • •        •    •        • • • • • • • •                                  • • • • • • 1.027
22   • • • •              • • •        •    •        • • • • • • • •                                  • • • • • • 1.038
23   • • •                • • •        •    •        • • • • • • • •                                  • • • • • • 1.056
24   • •                  • • •        •    •        • • • • • • • •                                  • • • • • • 1.086
25   •                    • • •        •    •        • • • • • • • •                                  • • • • • • 1.135
26   • • • • • • • • • •               •    •        • • • • • • • •                                     • • • • • 1.009
27   • • • • • • • • • •               •    •        • • • • • • • •                                  •    • • • • 1.063
28   • • • • • • • • • •               •    •        • • • • • • • •                                  • •     • • • 1.028
29   • • • • • • • • • •               •    •        • • • • • • • •                                  • • •       • • 3.523
30   • • • • • • • • • •               •    •        • • • • • • • •                                  • • • •         • 2.143
31   • • • • • • • • • •               •    •        • • • • • • • •                                  • • • • •          1.523

                       Man vs Wild Data                                        Going to extremes                       34
Half-hourly models
                                                     R−squared
                90
R−squared (%)

                80
                70
                60




                12 midnight 3:00 am   6:00 am   9:00 am   12 noon   3:00 pm   6:00 pm     9:00 pm 12 midnight

                                                      Time of day

                          Man vs Wild Data                            Going to extremes                 35
Half-hourly models
                                                    South Australian demand (January 2011)
                               4.0



                                          Actual
                                          Fitted
                               3.5
South Australian demand (GW)

                               3.0
                               2.5
                               2.0
                               1.5
                               1.0




                                     1    3     5     7     9   11   13   15   17   19   21   23   25    27   29   31

                                         Man vs Wild Data            Date in January Going to extremes                  35
Half-hourly models




   Man vs Wild Data   Going to extremes   35
Half-hourly models




   Man vs Wild Data   Going to extremes   35
Adjusted model

Original model
                                                  J

   log(yt ) = hp (t ) + fp (w1,t , w2,t ) +             cj zj,t + nt
                                                 j =1


Model allowing saturated usage
                                             J

     qt = hp (t ) + fp (w1,t , w2,t ) +           cj zj,t + nt
                                           j=1

                          qt             if qt ≤ τ ;
           log(yt ) =
                          τ + k(qt − τ ) if qt > τ .

       Man vs Wild Data                   Going to extremes            36
Adjusted model

Original model
                                                  J

   log(yt ) = hp (t ) + fp (w1,t , w2,t ) +             cj zj,t + nt
                                                 j =1


Model allowing saturated usage
                                             J

     qt = hp (t ) + fp (w1,t , w2,t ) +           cj zj,t + nt
                                           j=1

                          qt             if qt ≤ τ ;
           log(yt ) =
                          τ + k(qt − τ ) if qt > τ .

       Man vs Wild Data                   Going to extremes            36
Peak demand forecasting
                                             J

     qt,p = hp (t ) + fp (w1,t , w2,t ) +         cj zj,t + nt
                                            j=1

Multiple alternative futures created:
   hp (t ) known;
   simulate future temperatures using double
   seasonal block bootstrap with variable
   blocks (with adjustment for climate change);
   use assumed values for GSP, population and
   price;
   resample residuals using double seasonal block
   bootstrap with variable blocks.
       Man vs Wild Data                 Going to extremes        37
Peak demand backcasting
                                             J

     qt,p = hp (t ) + fp (w1,t , w2,t ) +         cj zj,t + nt
                                            j=1

Multiple alternative pasts created:
   hp (t ) known;
   simulate past temperatures using double
   seasonal block bootstrap with variable
   blocks;
   use actual values for GSP, population and
   price;
   resample residuals using double seasonal block
   bootstrap with variable blocks.
       Man vs Wild Data                 Going to extremes        37
Peak demand backcasting
                                                  PoE (annual interpretation)
             4.0

                           10 %
                           50 %
                           90 %
             3.5




                                                                                              q
                                                                                                        q
                                                                                                  q
PoE Demand




                                                                                     q
             3.0




                                                                      q       q
                                    q
                                                    q
                                                                q
                            q                             q
                     q
             2.5




                                              q


             q
             2.0




                   98/99          00/01           02/03       04/05         06/07         08/09       10/11

                                                              Year

                           Man vs Wild Data                               Going to extremes             38
Peak demand forecasting
                                                                         South Australia GSP




                                    120
                                                 High



  billion dollars (08/09 dollars)
                                                 Base




                                    100
                                                 Low



                                    80
                                    60
                                    40




                                          1990          1995   2000            2005            2010        2015       2020
                                                                                 Year
                                                                      South Australia population
                                    2.0




                                                 High
                                                 Base
                                                 Low
                                    1.8
  million
                                    1.6
                                    1.4




                                          1990          1995   2000            2005            2010        2015       2020
                                                                                 Year
                                                                      Average electricity prices
                                                 High
                                    22




                                                 Base
                                                 Low
                                    20
  c/kWh
                                    18
                                    16
                                    14
                                    12




                                          1990          1995   2000            2005            2010        2015       2020
                                                                                 Year
                                    Man vs Wild Data              Major industrial offset demand      Going to extremes      39
                                    0
Peak demand distribution
                                                     Annual POE levels
             6

                      1 % POE
                      5 % POE
                      10 % POE
                      50 % POE
             5




                      90 % POE
                  q   Actual annual maximum
PoE Demand

             4




                                                             q          q
                                                                  q
                                                         q
             3




                                                 q   q
                            q
                                    q
                      q                 q    q
                  q             q
             2




                 98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21

                                                                 Year

                          Man vs Wild Data                                  Going to extremes   40
Results
    We have successfully forecast the extreme upper tail in
    ten years time using only twelve years of data!
    This method has now been adopted for the official
    long-term peak electricity demand forecasts for all states
    except WA.
Some lessons
   Cross-validation is very useful in prediction
   problems.
    Statistical modelling is an iterative process.
    Getting client understanding of percentiles is
    extremely difficult.
    Beware of clients who think they know more
    than you!
        Man vs Wild Data              Going to extremes       41
Results
    We have successfully forecast the extreme upper tail in
    ten years time using only twelve years of data!
    This method has now been adopted for the official
    long-term peak electricity demand forecasts for all states
    except WA.
Some lessons
   Cross-validation is very useful in prediction
   problems.
    Statistical modelling is an iterative process.
    Getting client understanding of percentiles is
    extremely difficult.
    Beware of clients who think they know more
    than you!
        Man vs Wild Data              Going to extremes       41
Results
    We have successfully forecast the extreme upper tail in
    ten years time using only twelve years of data!
    This method has now been adopted for the official
    long-term peak electricity demand forecasts for all states
    except WA.
Some lessons
   Cross-validation is very useful in prediction
   problems.
    Statistical modelling is an iterative process.
    Getting client understanding of percentiles is
    extremely difficult.
    Beware of clients who think they know more
    than you!
        Man vs Wild Data              Going to extremes       41
Results
    We have successfully forecast the extreme upper tail in
    ten years time using only twelve years of data!
    This method has now been adopted for the official
    long-term peak electricity demand forecasts for all states
    except WA.
Some lessons
   Cross-validation is very useful in prediction
   problems.
    Statistical modelling is an iterative process.
    Getting client understanding of percentiles is
    extremely difficult.
    Beware of clients who think they know more
    than you!
        Man vs Wild Data              Going to extremes       41
Outline

1 Where fools fear to tread

2 Working with inadequate tools

3 When you can’t lose

4 Getting dirty with data

5 Going to extremes

6 Final thoughts


      Man vs Wild Data        Final thoughts   42
Crazy clients

 The client who wouldn’t tell me the
 problem.
 The client who wanted all meetings
 held at random locations for security
 reasons.
 The client who didn’t like the answer.
 Expert witnessing on the color purple
 (and now yellow).

    Man vs Wild Data      Final thoughts   43
Crazy clients

 The client who wouldn’t tell me the
 problem.
 The client who wanted all meetings
 held at random locations for security
 reasons.
 The client who didn’t like the answer.
 Expert witnessing on the color purple
 (and now yellow).

    Man vs Wild Data      Final thoughts   43
Crazy clients

 The client who wouldn’t tell me the
 problem.
 The client who wanted all meetings
 held at random locations for security
 reasons.
 The client who didn’t like the answer.
 Expert witnessing on the color purple
 (and now yellow).

    Man vs Wild Data      Final thoughts   43
Crazy clients

 The client who wouldn’t tell me the
 problem.
 The client who wanted all meetings
 held at random locations for security
 reasons.
 The client who didn’t like the answer.
 Expert witnessing on the color purple
 (and now yellow).

    Man vs Wild Data      Final thoughts   43
Go forth and consult
A good statistician is not smarter than
everyone else, he merely has his ignorance
better organised.
                               (Anonymous)




      Man vs Wild Data     Final thoughts   44
Go forth and consult
All models are wrong, some are useful.
                         (George E P Box)




      Man vs Wild Data    Final thoughts   44
Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
                                (John W Tukey)




       Man vs Wild Data      Final thoughts   44
Go forth and consult
It is better to solve the right problem the
wrong way than the wrong problem the
right way.
                                (John W Tukey)




 Slides available from robjhyndman.com


       Man vs Wild Data      Final thoughts   44

Más contenido relacionado

La actualidad más candente

Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryPranov Mishra
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in REdureka!
 
L08 Over Fitting
L08 Over FittingL08 Over Fitting
L08 Over FittingYujin Chung
 
Forecasting Techniques - Data Science SG
Forecasting Techniques - Data Science SG Forecasting Techniques - Data Science SG
Forecasting Techniques - Data Science SG Kai Xin Thia
 
What Is a Model, Anyhow?
What Is a Model, Anyhow?What Is a Model, Anyhow?
What Is a Model, Anyhow?Bill Cassill
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Simplilearn
 
Optimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designsOptimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designsPankaj Sharma
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithmsShalitha Suranga
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAmit Sharma
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee AttritionShruti Mohan
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationSara Hooker
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9Roger Barga
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningBill Liu
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceAmit Sharma
 
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...Barton Poulson
 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronUser Vision
 
Forecasting using data - Deliver 2016
Forecasting using data  - Deliver 2016Forecasting using data  - Deliver 2016
Forecasting using data - Deliver 2016Troy Magennis
 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsMostafa El-Hosseini
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshopodsc
 
Stock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningStock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningSharvil Katariya
 

La actualidad más candente (20)

Reduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage IndustryReduction in customer complaints - Mortgage Industry
Reduction in customer complaints - Mortgage Industry
 
Linear Regression in R
Linear Regression in RLinear Regression in R
Linear Regression in R
 
L08 Over Fitting
L08 Over FittingL08 Over Fitting
L08 Over Fitting
 
Forecasting Techniques - Data Science SG
Forecasting Techniques - Data Science SG Forecasting Techniques - Data Science SG
Forecasting Techniques - Data Science SG
 
What Is a Model, Anyhow?
What Is a Model, Anyhow?What Is a Model, Anyhow?
What Is a Model, Anyhow?
 
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
Machine Learning Algorithms | Machine Learning Tutorial | Data Science Algori...
 
Optimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designsOptimizing marketing campaigns using experimental designs
Optimizing marketing campaigns using experimental designs
 
Machine learning algorithms
Machine learning algorithmsMachine learning algorithms
Machine learning algorithms
 
Alleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal ModelsAlleviating Privacy Attacks Using Causal Models
Alleviating Privacy Attacks Using Causal Models
 
Predicting Employee Attrition
Predicting Employee AttritionPredicting Employee Attrition
Predicting Employee Attrition
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
Causal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine LearningCausal Inference in Data Science and Machine Learning
Causal Inference in Data Science and Machine Learning
 
Dowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inferenceDowhy: An end-to-end library for causal inference
Dowhy: An end-to-end library for causal inference
 
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...
Prescriptive Analytics: A Hands-on Introduction to Getting Actionable Insight...
 
Statistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica CameronStatistics for UX Professionals - Jessica Cameron
Statistics for UX Professionals - Jessica Cameron
 
Forecasting using data - Deliver 2016
Forecasting using data  - Deliver 2016Forecasting using data  - Deliver 2016
Forecasting using data - Deliver 2016
 
Lecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_modelsLecture 19 chapter_4_regularized_linear_models
Lecture 19 chapter_4_regularized_linear_models
 
Uplift Modeling Workshop
Uplift Modeling WorkshopUplift Modeling Workshop
Uplift Modeling Workshop
 
Stock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised LearningStock Price Trend Forecasting using Supervised Learning
Stock Price Trend Forecasting using Supervised Learning
 

Similar a Ysc2013

AI Fails: Avoiding bias in your systems
AI Fails: Avoiding bias in your systemsAI Fails: Avoiding bias in your systems
AI Fails: Avoiding bias in your systemsDr Janet Bastiman
 
Smartcon 2015 – Automated Decisions in the Supply Chain
Smartcon 2015 – Automated Decisions in the Supply ChainSmartcon 2015 – Automated Decisions in the Supply Chain
Smartcon 2015 – Automated Decisions in the Supply ChainLars Trieloff
 
Automated decision making with predictive applications – Big Data Frankfurt
Automated decision making with predictive applications – Big Data FrankfurtAutomated decision making with predictive applications – Big Data Frankfurt
Automated decision making with predictive applications – Big Data FrankfurtLars Trieloff
 
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...IT Arena
 
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]Search Engine Journal
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentationsteve muzzy
 
ArtificialIntelligenceandMachineLearningforBusiness.pptx
ArtificialIntelligenceandMachineLearningforBusiness.pptxArtificialIntelligenceandMachineLearningforBusiness.pptx
ArtificialIntelligenceandMachineLearningforBusiness.pptxPerumalPitchandi
 
Business Reasons for Predictive Applications
Business Reasons for Predictive ApplicationsBusiness Reasons for Predictive Applications
Business Reasons for Predictive ApplicationsLars Trieloff
 
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxDiscussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxedgar6wallace88877
 
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docx
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docxHomework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docx
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docxpooleavelina
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfDatacademy.ai
 
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docx
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docxSOC212 - Application Question #2Due Friday, April 8th at 1159pm.docx
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docxwhitneyleman54422
 
How Innovation Could Apply to Customer Insights for Better Decision Making?
How Innovation Could Apply to Customer Insights for Better Decision Making?How Innovation Could Apply to Customer Insights for Better Decision Making?
How Innovation Could Apply to Customer Insights for Better Decision Making?Frédéric Baffou
 
TitleABC123 Version X1Time to Practice – Week Four .docx
TitleABC123 Version X1Time to Practice – Week Four .docxTitleABC123 Version X1Time to Practice – Week Four .docx
TitleABC123 Version X1Time to Practice – Week Four .docxedwardmarivel
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdfDrAnilKannur1
 
Analysis Of Data Using SPSS
Analysis Of Data Using SPSSAnalysis Of Data Using SPSS
Analysis Of Data Using SPSSBrittany Brown
 
Artificial Intelligence and Machine Learning for business
Artificial Intelligence and Machine Learning for businessArtificial Intelligence and Machine Learning for business
Artificial Intelligence and Machine Learning for businessSteven Finlay
 

Similar a Ysc2013 (20)

AI Fails: Avoiding bias in your systems
AI Fails: Avoiding bias in your systemsAI Fails: Avoiding bias in your systems
AI Fails: Avoiding bias in your systems
 
Smartcon 2015 – Automated Decisions in the Supply Chain
Smartcon 2015 – Automated Decisions in the Supply ChainSmartcon 2015 – Automated Decisions in the Supply Chain
Smartcon 2015 – Automated Decisions in the Supply Chain
 
Automated decision making with predictive applications – Big Data Frankfurt
Automated decision making with predictive applications – Big Data FrankfurtAutomated decision making with predictive applications – Big Data Frankfurt
Automated decision making with predictive applications – Big Data Frankfurt
 
Teaching students how to critically appraise organizational data
Teaching students how to critically appraise organizational dataTeaching students how to critically appraise organizational data
Teaching students how to critically appraise organizational data
 
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...
Pavlo Pedenko. Measuring Product-Market Fit Quantitatively. Setapp and Slack ...
 
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]
How to Perform Website Experiments [+ SEJ Experiment Walk-Through & Results]
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
 
ArtificialIntelligenceandMachineLearningforBusiness.pptx
ArtificialIntelligenceandMachineLearningforBusiness.pptxArtificialIntelligenceandMachineLearningforBusiness.pptx
ArtificialIntelligenceandMachineLearningforBusiness.pptx
 
Business Reasons for Predictive Applications
Business Reasons for Predictive ApplicationsBusiness Reasons for Predictive Applications
Business Reasons for Predictive Applications
 
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docxDiscussion Questions Chapter 15Terms in Review1Define or exp.docx
Discussion Questions Chapter 15Terms in Review1Define or exp.docx
 
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docx
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docxHomework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docx
Homework #1SOCY 3115Spring 20Read the Syllabus and FAQ on ho.docx
 
Top 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdfTop 100+ Google Data Science Interview Questions.pdf
Top 100+ Google Data Science Interview Questions.pdf
 
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docx
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docxSOC212 - Application Question #2Due Friday, April 8th at 1159pm.docx
SOC212 - Application Question #2Due Friday, April 8th at 1159pm.docx
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
How Innovation Could Apply to Customer Insights for Better Decision Making?
How Innovation Could Apply to Customer Insights for Better Decision Making?How Innovation Could Apply to Customer Insights for Better Decision Making?
How Innovation Could Apply to Customer Insights for Better Decision Making?
 
TitleABC123 Version X1Time to Practice – Week Four .docx
TitleABC123 Version X1Time to Practice – Week Four .docxTitleABC123 Version X1Time to Practice – Week Four .docx
TitleABC123 Version X1Time to Practice – Week Four .docx
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
analysing_data_using_spss.pdf
analysing_data_using_spss.pdfanalysing_data_using_spss.pdf
analysing_data_using_spss.pdf
 
Analysis Of Data Using SPSS
Analysis Of Data Using SPSSAnalysis Of Data Using SPSS
Analysis Of Data Using SPSS
 
Artificial Intelligence and Machine Learning for business
Artificial Intelligence and Machine Learning for businessArtificial Intelligence and Machine Learning for business
Artificial Intelligence and Machine Learning for business
 

Más de Rob Hyndman

Exploring the feature space of large collections of time series
Exploring the feature space of large collections of time seriesExploring the feature space of large collections of time series
Exploring the feature space of large collections of time seriesRob Hyndman
 
Exploring the boundaries of predictability
Exploring the boundaries of predictabilityExploring the boundaries of predictability
Exploring the boundaries of predictabilityRob Hyndman
 
MEFM: An R package for long-term probabilistic forecasting of electricity demand
MEFM: An R package for long-term probabilistic forecasting of electricity demandMEFM: An R package for long-term probabilistic forecasting of electricity demand
MEFM: An R package for long-term probabilistic forecasting of electricity demandRob Hyndman
 
Visualization of big time series data
Visualization of big time series dataVisualization of big time series data
Visualization of big time series dataRob Hyndman
 
Probabilistic forecasting of long-term peak electricity demand
Probabilistic forecasting of long-term peak electricity demandProbabilistic forecasting of long-term peak electricity demand
Probabilistic forecasting of long-term peak electricity demandRob Hyndman
 
Visualization and forecasting of big time series data
Visualization and forecasting of big time series dataVisualization and forecasting of big time series data
Visualization and forecasting of big time series dataRob Hyndman
 
Academia sinica jan-2015
Academia sinica jan-2015Academia sinica jan-2015
Academia sinica jan-2015Rob Hyndman
 
Coherent mortality forecasting using functional time series models
Coherent mortality forecasting using functional time series modelsCoherent mortality forecasting using functional time series models
Coherent mortality forecasting using functional time series modelsRob Hyndman
 
Forecasting Hierarchical Time Series
Forecasting Hierarchical Time SeriesForecasting Hierarchical Time Series
Forecasting Hierarchical Time SeriesRob Hyndman
 
Forecasting using R
Forecasting using RForecasting using R
Forecasting using RRob Hyndman
 
SimpleR: tips, tricks & tools
SimpleR: tips, tricks & toolsSimpleR: tips, tricks & tools
SimpleR: tips, tricks & toolsRob Hyndman
 
R tools for hierarchical time series
R tools for hierarchical time seriesR tools for hierarchical time series
R tools for hierarchical time seriesRob Hyndman
 
Demographic forecasting
Demographic forecastingDemographic forecasting
Demographic forecastingRob Hyndman
 
Forecasting electricity demand distributions using a semiparametric additive ...
Forecasting electricity demand distributions using a semiparametric additive ...Forecasting electricity demand distributions using a semiparametric additive ...
Forecasting electricity demand distributions using a semiparametric additive ...Rob Hyndman
 
FPP 1. Getting started
FPP 1. Getting startedFPP 1. Getting started
FPP 1. Getting startedRob Hyndman
 

Más de Rob Hyndman (15)

Exploring the feature space of large collections of time series
Exploring the feature space of large collections of time seriesExploring the feature space of large collections of time series
Exploring the feature space of large collections of time series
 
Exploring the boundaries of predictability
Exploring the boundaries of predictabilityExploring the boundaries of predictability
Exploring the boundaries of predictability
 
MEFM: An R package for long-term probabilistic forecasting of electricity demand
MEFM: An R package for long-term probabilistic forecasting of electricity demandMEFM: An R package for long-term probabilistic forecasting of electricity demand
MEFM: An R package for long-term probabilistic forecasting of electricity demand
 
Visualization of big time series data
Visualization of big time series dataVisualization of big time series data
Visualization of big time series data
 
Probabilistic forecasting of long-term peak electricity demand
Probabilistic forecasting of long-term peak electricity demandProbabilistic forecasting of long-term peak electricity demand
Probabilistic forecasting of long-term peak electricity demand
 
Visualization and forecasting of big time series data
Visualization and forecasting of big time series dataVisualization and forecasting of big time series data
Visualization and forecasting of big time series data
 
Academia sinica jan-2015
Academia sinica jan-2015Academia sinica jan-2015
Academia sinica jan-2015
 
Coherent mortality forecasting using functional time series models
Coherent mortality forecasting using functional time series modelsCoherent mortality forecasting using functional time series models
Coherent mortality forecasting using functional time series models
 
Forecasting Hierarchical Time Series
Forecasting Hierarchical Time SeriesForecasting Hierarchical Time Series
Forecasting Hierarchical Time Series
 
Forecasting using R
Forecasting using RForecasting using R
Forecasting using R
 
SimpleR: tips, tricks & tools
SimpleR: tips, tricks & toolsSimpleR: tips, tricks & tools
SimpleR: tips, tricks & tools
 
R tools for hierarchical time series
R tools for hierarchical time seriesR tools for hierarchical time series
R tools for hierarchical time series
 
Demographic forecasting
Demographic forecastingDemographic forecasting
Demographic forecasting
 
Forecasting electricity demand distributions using a semiparametric additive ...
Forecasting electricity demand distributions using a semiparametric additive ...Forecasting electricity demand distributions using a semiparametric additive ...
Forecasting electricity demand distributions using a semiparametric additive ...
 
FPP 1. Getting started
FPP 1. Getting startedFPP 1. Getting started
FPP 1. Getting started
 

Ysc2013

  • 4. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Where fools fear to tread 2
  • 5. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 6. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 7. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 8. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 9. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 10. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 11. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 12. My story Olympic video poker slots Beware of smelly clients Threats and slander Nerves in court Three university consulting services Reviewing my own work Six times an expert witness Hundreds of clients Man vs Wild Data Where fools fear to tread 3
  • 13. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Working with inadequate tools 4
  • 14. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and fix it. Man vs Wild Data Working with inadequate tools 5
  • 15. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and fix it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Man vs Wild Data Working with inadequate tools 5
  • 16. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and fix it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Their programmer has little experience in numerical computing. Man vs Wild Data Working with inadequate tools 5
  • 17. Disposable tableware company Problem: Want forecasts of each of hundreds of items. Series can be stationary, trended or seasonal. They currently have a large forecasting program written in-house but it doesn’t seem to produce sensible forecasts. They want me to tell them what is wrong and fix it. Additional information Program written in COBOL making numerical calculations limited. It is not possible to do any optimisation. Their programmer has little experience in numerical computing. They employ no statisticians and want the program to produce forecasts automatically. Man vs Wild Data Working with inadequate tools 5
  • 18. Disposable tableware company Methods currently used A 12 month average C 6 month average E straight line regression over last 12 months G straight line regression over last 6 months H average slope between last year’s and this year’s values. (Equivalent to differencing at lag 12 and taking mean.) I Same as H except over 6 months. K I couldn’t understand the explanation. Man vs Wild Data Working with inadequate tools 6
  • 19. Disposable tableware company My solution Use first differencing to deal with trend, or seasonal differencing to deal with seasonality. Use simple exponential smoothing on (differenced) data with the parameter selected from {0.1, 0.3, 0.5, 0.7, 0.9}. For each series, try 15 models: no differencing, first differencing, and seasonal differencing, plus SES with 5 parameter values. Model selected based on smallest MSE. (Only one parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7
  • 20. Disposable tableware company My solution Use first differencing to deal with trend, or seasonal differencing to deal with seasonality. Use simple exponential smoothing on (differenced) data with the parameter selected from {0.1, 0.3, 0.5, 0.7, 0.9}. For each series, try 15 models: no differencing, first differencing, and seasonal differencing, plus SES with 5 parameter values. Model selected based on smallest MSE. (Only one parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7
  • 21. Disposable tableware company My solution Use first differencing to deal with trend, or seasonal differencing to deal with seasonality. Use simple exponential smoothing on (differenced) data with the parameter selected from {0.1, 0.3, 0.5, 0.7, 0.9}. For each series, try 15 models: no differencing, first differencing, and seasonal differencing, plus SES with 5 parameter values. Model selected based on smallest MSE. (Only one parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7
  • 22. Disposable tableware company My solution Use first differencing to deal with trend, or seasonal differencing to deal with seasonality. Use simple exponential smoothing on (differenced) data with the parameter selected from {0.1, 0.3, 0.5, 0.7, 0.9}. For each series, try 15 models: no differencing, first differencing, and seasonal differencing, plus SES with 5 parameter values. Model selected based on smallest MSE. (Only one parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7
  • 23. Disposable tableware company My solution Use first differencing to deal with trend, or seasonal Some lessons with seasonality. differencing to deal Use simple exponential smoothing on (differenced) Be pragmatic. data with the parameter selected from {0Understand .9}. .1, 0.3, 0.5, 0.7, 0 your tools well enough For each series, to adapt them. to be able try 15 models: no differencing, first differencing, and seasonal differencing, plus SES with successful consulting job often A 5 parameter values. Model selected based on methods. (Only one uses very simple smallest MSE. parameter for each model, so no need to penalize for model size.) Man vs Wild Data Working with inadequate tools 7
  • 24. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data When you can’t lose 8
  • 25. Forecasting the PBS Man vs Wild Data When you can’t lose 9
  • 26. Forecasting the PBS The Pharmaceutical Benefits Scheme (PBS) is the Australian government drugs subsidy scheme. Many drugs bought from pharmacies are subsidised to allow more equitable access to modern drugs. The cost to government is determined by the number and types of drugs purchased. Currently nearly 1% of GDP. The total cost is budgeted based on forecasts of drug usage. Man vs Wild Data When you can’t lose 10
  • 27. Forecasting the PBS The Pharmaceutical Benefits Scheme (PBS) is the Australian government drugs subsidy scheme. Many drugs bought from pharmacies are subsidised to allow more equitable access to modern drugs. The cost to government is determined by the number and types of drugs purchased. Currently nearly 1% of GDP. The total cost is budgeted based on forecasts of drug usage. Man vs Wild Data When you can’t lose 10
  • 28. Forecasting the PBS The Pharmaceutical Benefits Scheme (PBS) is the Australian government drugs subsidy scheme. Many drugs bought from pharmacies are subsidised to allow more equitable access to modern drugs. The cost to government is determined by the number and types of drugs purchased. Currently nearly 1% of GDP. The total cost is budgeted based on forecasts of drug usage. Man vs Wild Data When you can’t lose 10
  • 29. Forecasting the PBS The Pharmaceutical Benefits Scheme (PBS) is the Australian government drugs subsidy scheme. Many drugs bought from pharmacies are subsidised to allow more equitable access to modern drugs. The cost to government is determined by the number and types of drugs purchased. Currently nearly 1% of GDP. The total cost is budgeted based on forecasts of drug usage. Man vs Wild Data When you can’t lose 10
  • 30. Forecasting the PBS Man vs Wild Data When you can’t lose 11
  • 31. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the first three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12
  • 32. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the first three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12
  • 33. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the first three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12
  • 34. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the first three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12
  • 35. Forecasting the PBS In 2001: $4.5 billion budget, under-forecasted by $800 million. Thousands of products. Seasonal demand. Subject to covert marketing, volatile products, uncontrollable expenditure. Although monthly data available for 10 years, data are aggregated to annual values, and only the first three years are used in estimating the forecasts. All forecasts being done with the FORECAST function in MS-Excel! Man vs Wild Data When you can’t lose 12
  • 36. ATC drug classification A Alimentary tract and metabolism B Blood and blood forming organs C Cardiovascular system D Dermatologicals G Genito-urinary system and sex hormones H Systemic hormonal preparations, excluding sex hor- mones and insulins J Anti-infectives for systemic use L Antineoplastic and immunomodulating agents M Musculo-skeletal system N Nervous system P Antiparasitic products, insecticides and repellents R Respiratory system S Sensory organs V Various Man vs Wild Data When you can’t lose 13
  • 37. ATC drug classification 14 classes A Alimentary tract and metabolism 84 classes A10 Drugs used in diabetes A10B Blood glucose lowering drugs A10BA Biguanides A10BA02 Metformin Man vs Wild Data When you can’t lose 14
  • 38. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15
  • 39. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15
  • 40. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15
  • 41. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15
  • 42. Forecasting the PBS Monthly data on thousands of drug groups and 4 concession types available from 1991. Method needs to be automated and implemented within MS-Excel. Exponential smoothing seems appropriate (monthly data with changing trends and seasonal patterns), but in 2001, automated exponential smoothing was not well-developed, and not available in MS-Excel. As part of this project, we developed an automatic forecasting algorithm for exponential smoothing state space models based on the AIC. Forecast MAPE reduced from 15–20% to about 0.6%. Man vs Wild Data When you can’t lose 15
  • 43. Forecasting the PBS Total cost: A03 concession safety net group 1200 1000 800 $ thousands 600 400 200 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16
  • 44. Forecasting the PBS Total cost: A05 general copayments group 250 200 $ thousands 150 100 50 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16
  • 45. Forecasting the PBS Total cost: D01 general copayments group 700 600 500 400 $ thousands 300 200 100 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16
  • 46. Forecasting the PBS Total cost: S01 general copayments group 6000 5000 4000 $ thousands 3000 2000 1000 0 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16
  • 47. Forecasting the PBS Total cost: R03 general copayments group 1000 2000 3000 4000 5000 6000 7000 $ thousands 1995 2000 2005 2010 Man vs Wild Data When you can’t lose 16
  • 48. Forecasting the PBS Total cost: R03 general copayments group 1000 2000 3000 4000 5000 6000 7000 Some lessons Often what people do is very bad, and it is easy to make a big difference. $ thousands Sometimes you have to invent new methods, and that can lead to publications. You have to implement solutions in the client’s software environment. Be aware of the2000 1995 politics. 2005 2010 Man vs Wild Data When you can’t lose 16
  • 49. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Getting dirty with data 17
  • 50. Airline passenger traffic Man vs Wild Data Getting dirty with data 18
  • 51. Airline passenger traffic First class passengers: Melbourne−Sydney 2.0 1.0 0.0 1988 1989 1990 1991 1992 1993 Year Business class passengers: Melbourne−Sydney 0 2 4 6 8 1988 1989 1990 1991 1992 1993 Year Economy class passengers: Melbourne−Sydney 30 20 10 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Year Getting dirty with data 19
  • 52. Airline passenger traffic First class passengers: Melbourne−Sydney 2.0 1.0 0.0 1988 Not1989 real 1990 the data! 1991 1992 1993 Year Or is it? class passengers: Melbourne−Sydney Business 0 2 4 6 8 1988 1989 1990 1991 1992 1993 Year Economy class passengers: Melbourne−Sydney 30 20 10 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Year Getting dirty with data 19
  • 53. Airline passenger traffic Economy Class Passengers: Melbourne−Sydney 35 30 Passengers (thousands) 25 20 15 10 5 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Getting dirty with data 20
  • 54. Airline passenger traffic Economy Class Passengers: Melbourne−Sydney 35 30 Passengers (thousands) 25 20 15 10 5 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Getting dirty with data 20
  • 55. Airline passenger traffic Economy Class Passengers: Melbourne−Sydney 35 30 Passengers (thousands) 25 20 15 10 5 0 1988 1989 1990 1991 1992 1993 Man vs Wild Data Getting dirty with data 20
  • 56. Possible model ∗ Yt = Yt + Z t ∗ Yt = β0 + βj xt,j + Nt j Yt = observed data for one passenger class. ∗ Yt = reconstructed data. Zt = latent process (usually equal to zero). xt,j are covariates and dummy variables. Nt = seasonal ARIMA process of period 52. Man vs Wild Data Getting dirty with data 21
  • 57. Possible model ∗ Yt = Yt + Z t ∗ Some lessonsβ0 + Yt = βj xt,j + Nt j Real data is often very messy. Be Yt = aware of the causes. passenger class. observed data for one ∗ Yt = Get an answer data. if it isn’t pretty. reconstructed even Zt = What to do with the non-integer zero). latent process (usually equal to xt,j are covariates (average 52.19) seasonality? and dummy variables. Nt = How to deal with process of period 52. seasonal ARIMA the correlations between classes and between routes? You often think of better approaches long after the project is finished. Man vs Wild Data Getting dirty with data 21
  • 58. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Going to extremes 22
  • 59. Extreme electricity demand Man vs Wild Data Going to extremes 23
  • 60. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24
  • 61. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24
  • 62. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24
  • 63. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24
  • 64. The problem We want to forecast the peak electricity demand in a half-hour period in ten years time. We have twelve years of half-hourly electricity data, temperature data and some economic and demographic data. The location is South Australia: home to the most volatile electricity demand in the world. Sounds impossible? Man vs Wild Data Going to extremes 24
  • 65. South Australian demand data Man vs Wild Data Going to extremes 25
  • 66. South Australian demand data Black Saturday → Man vs Wild Data Going to extremes 25
  • 67. South Australian demand data South Australia state wide demand (summer 10/11) 3.5 South Australia state wide demand (GW) 3.0 2.5 2.0 1.5 Oct 10 Nov 10 Dec 10 Jan 11 Feb 11 Mar 11 Man vs Wild Data Going to extremes 25
  • 68. South Australian demand data South Australia state wide demand (January 2011) 3.5 3.0 South Australian demand (GW) 2.5 2.0 1.5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Date in January Man vs Wild Data Going to extremes 25
  • 69. Demand boxplots (Sth Aust) Time: 12 midnight 3.5 3.0 2.5 Demand (GW) q q q q q q q q q q 2.0 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 1.5 q q q q q q q q q q q q q q q q q q 1.0 q q Mon Tue Wed Thu Fri Sat Sun Day of week Man vs Wild Data Going to extremes 26
  • 70. Temperature data (Sth Aust) Time: 12 midnight 3.5 Workday Non−workday 3.0 2.5 Demand (GW) 2.0 1.5 1.0 10 20 30 40 Temperature (deg C) Man vs Wild Data Going to extremes 27
  • 71. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28
  • 72. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28
  • 73. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28
  • 74. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28
  • 75. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 yt denotes per capita demand at time t (measured in half-hourly intervals) and p denotes the time of day p = 1, . . . , 48; hp (t ) models all calendar effects; fp (w1,t , w2,t ) models all temperature effects where w1,t is a vector of recent temperatures at location 1 and w2,t is a vector of recent temperatures at location 2; zj,t is a demographic or economic variable at time t nt denotes the model error at time t. Man vs Wild Data Going to extremes 28
  • 76. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 77. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 78. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 79. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 80. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 81. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 hp (t ) includes handle annual, weekly and daily seasonal patterns as well as public holidays: hp (t ) = p (t) + αt,p + βt,p + γt,p + δt,p p (t) is “time of summer” effect (a regression spline); αt,p is day of week effect; βt,p is “holiday” effect; γt,p New Year’s Eve effect; δt,p is millennium effect; Man vs Wild Data Going to extremes 29
  • 82. Fitted results (Summer 3pm) Time: 3:00 pm 0.4 0.4 Effect on demand Effect on demand 0.0 0.0 −0.4 −0.4 0 50 100 150 Mon Tue Wed Thu Fri Sat Sun Day of summer Day of week 0.4 Effect on demand 0.0 −0.4 Normal Day before Holiday Day after Holiday Man vs Wild Data Going to extremes 30
  • 83. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 84. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 85. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 86. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 87. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 88. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 89. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 6 + − fp (w1,t , w2,t ) = ¯ fk,p (xt−k ) + gk,p (dt−k ) + qp (xt ) + rp (xt ) + sp (xt ) k =0 6 + Fj,p (xt−48j ) + Gj,p (dt−48j ) j=1 xt is ave temp across two sites (Kent Town and Adelaide Airport) at time t; dt is the temp difference between two sites at time t; + xt is max of xt values in past 24 hours; − xt is min of xt values in past 24 hours; ¯ xt is ave temp in past seven days. Each function is smooth & estimated using regression splines. Man vs Wild Data Going to extremes 31
  • 90. 0.4 Fitted results (Summer 3pm) Time: 3:00 pm 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Effect on demand Effect on demand Effect on demand Effect on demand 0.0 0.0 0.0 0.0 −0.2 −0.2 −0.2 −0.2 −0.4 −0.4 −0.4 −0.4 10 20 30 40 10 20 30 40 10 20 30 40 10 20 30 40 Temperature Lag 1 temperature Lag 2 temperature Lag 3 temperature 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 Effect on demand Effect on demand Effect on demand Effect on demand 0.0 0.0 0.0 0.0 −0.2 −0.2 −0.2 −0.2 −0.4 −0.4 −0.4 −0.4 10 20 30 40 10 15 20 25 30 15 25 35 10 15 20 25 Lag 1 day temperature Last week average temp Previous max temp Previous min temp Man vs Wild Data Going to extremes 32
  • 91. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Same predictors used for all 48 models. Predictors chosen by cross-validation on summer of 2007/2008 and 2009/2010. Each model is fitted to the data twice, first excluding the summer of 2009/2010 and then excluding the summer of 2010/2011. The average out-of-sample MSE is calculated from the omitted data for the time periods 12noon–8.30pm. Man vs Wild Data Going to extremes 33
  • 92. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Same predictors used for all 48 models. Predictors chosen by cross-validation on summer of 2007/2008 and 2009/2010. Each model is fitted to the data twice, first excluding the summer of 2009/2010 and then excluding the summer of 2010/2011. The average out-of-sample MSE is calculated from the omitted data for the time periods 12noon–8.30pm. Man vs Wild Data Going to extremes 33
  • 93. Monash Electricity Forecasting Model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Same predictors used for all 48 models. Predictors chosen by cross-validation on summer of 2007/2008 and 2009/2010. Each model is fitted to the data twice, first excluding the summer of 2009/2010 and then excluding the summer of 2010/2011. The average out-of-sample MSE is calculated from the omitted data for the time periods 12noon–8.30pm. Man vs Wild Data Going to extremes 33
  • 94. Half-hourly models x x1 x2 x3 x4 x5 x6 x48 x96 x144 x192 x240 x288 d d1 d2 d3 d4 d5 d6 d48 d96 d144 d192 d240 d288 x+ x− x dow hol dos MSE ¯ 1 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.037 2 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.034 3 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.031 4 • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.027 5 • • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.025 6 • • • • • • • • • • • • • • • • • • • • • • • • • • • 1.020 7 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.025 8 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.026 9 • • • • • • • • • • • • • • • • • • • • • • • • • 1.035 10 • • • • • • • • • • • • • • • • • • • • • • • • 1.044 11 • • • • • • • • • • • • • • • • • • • • • • • 1.057 12 • • • • • • • • • • • • • • • • • • • • • • 1.076 13 • • • • • • • • • • • • • • • • • • • • • 1.102 14 • • • • • • • • • • • • • • • • • • • • • • • • • • 1.018 15 • • • • • • • • • • • • • • • • • • • • • • • • • 1.021 16 • • • • • • • • • • • • • • • • • • • • • • • • 1.037 17 • • • • • • • • • • • • • • • • • • • • • • • 1.074 18 • • • • • • • • • • • • • • • • • • • • • • 1.152 19 • • • • • • • • • • • • • • • • • • • • • 1.180 20 • • • • • • • • • • • • • • • • • • • • • • • • • 1.021 21 • • • • • • • • • • • • • • • • • • • • • • • • 1.027 22 • • • • • • • • • • • • • • • • • • • • • • • 1.038 23 • • • • • • • • • • • • • • • • • • • • • • 1.056 24 • • • • • • • • • • • • • • • • • • • • • 1.086 25 • • • • • • • • • • • • • • • • • • • • 1.135 26 • • • • • • • • • • • • • • • • • • • • • • • • • 1.009 27 • • • • • • • • • • • • • • • • • • • • • • • • • 1.063 28 • • • • • • • • • • • • • • • • • • • • • • • • • 1.028 29 • • • • • • • • • • • • • • • • • • • • • • • • • 3.523 30 • • • • • • • • • • • • • • • • • • • • • • • • • 2.143 31 • • • • • • • • • • • • • • • • • • • • • • • • • 1.523 Man vs Wild Data Going to extremes 34
  • 95. Half-hourly models R−squared 90 R−squared (%) 80 70 60 12 midnight 3:00 am 6:00 am 9:00 am 12 noon 3:00 pm 6:00 pm 9:00 pm 12 midnight Time of day Man vs Wild Data Going to extremes 35
  • 96. Half-hourly models South Australian demand (January 2011) 4.0 Actual Fitted 3.5 South Australian demand (GW) 3.0 2.5 2.0 1.5 1.0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Man vs Wild Data Date in January Going to extremes 35
  • 97. Half-hourly models Man vs Wild Data Going to extremes 35
  • 98. Half-hourly models Man vs Wild Data Going to extremes 35
  • 99. Adjusted model Original model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Model allowing saturated usage J qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 qt if qt ≤ τ ; log(yt ) = τ + k(qt − τ ) if qt > τ . Man vs Wild Data Going to extremes 36
  • 100. Adjusted model Original model J log(yt ) = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j =1 Model allowing saturated usage J qt = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 qt if qt ≤ τ ; log(yt ) = τ + k(qt − τ ) if qt > τ . Man vs Wild Data Going to extremes 36
  • 101. Peak demand forecasting J qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 Multiple alternative futures created: hp (t ) known; simulate future temperatures using double seasonal block bootstrap with variable blocks (with adjustment for climate change); use assumed values for GSP, population and price; resample residuals using double seasonal block bootstrap with variable blocks. Man vs Wild Data Going to extremes 37
  • 102. Peak demand backcasting J qt,p = hp (t ) + fp (w1,t , w2,t ) + cj zj,t + nt j=1 Multiple alternative pasts created: hp (t ) known; simulate past temperatures using double seasonal block bootstrap with variable blocks; use actual values for GSP, population and price; resample residuals using double seasonal block bootstrap with variable blocks. Man vs Wild Data Going to extremes 37
  • 103. Peak demand backcasting PoE (annual interpretation) 4.0 10 % 50 % 90 % 3.5 q q q PoE Demand q 3.0 q q q q q q q q 2.5 q q 2.0 98/99 00/01 02/03 04/05 06/07 08/09 10/11 Year Man vs Wild Data Going to extremes 38
  • 104. Peak demand forecasting South Australia GSP 120 High billion dollars (08/09 dollars) Base 100 Low 80 60 40 1990 1995 2000 2005 2010 2015 2020 Year South Australia population 2.0 High Base Low 1.8 million 1.6 1.4 1990 1995 2000 2005 2010 2015 2020 Year Average electricity prices High 22 Base Low 20 c/kWh 18 16 14 12 1990 1995 2000 2005 2010 2015 2020 Year Man vs Wild Data Major industrial offset demand Going to extremes 39 0
  • 105. Peak demand distribution Annual POE levels 6 1 % POE 5 % POE 10 % POE 50 % POE 5 90 % POE q Actual annual maximum PoE Demand 4 q q q q 3 q q q q q q q q q 2 98/99 00/01 02/03 04/05 06/07 08/09 10/11 12/13 14/15 16/17 18/19 20/21 Year Man vs Wild Data Going to extremes 40
  • 106. Results We have successfully forecast the extreme upper tail in ten years time using only twelve years of data! This method has now been adopted for the official long-term peak electricity demand forecasts for all states except WA. Some lessons Cross-validation is very useful in prediction problems. Statistical modelling is an iterative process. Getting client understanding of percentiles is extremely difficult. Beware of clients who think they know more than you! Man vs Wild Data Going to extremes 41
  • 107. Results We have successfully forecast the extreme upper tail in ten years time using only twelve years of data! This method has now been adopted for the official long-term peak electricity demand forecasts for all states except WA. Some lessons Cross-validation is very useful in prediction problems. Statistical modelling is an iterative process. Getting client understanding of percentiles is extremely difficult. Beware of clients who think they know more than you! Man vs Wild Data Going to extremes 41
  • 108. Results We have successfully forecast the extreme upper tail in ten years time using only twelve years of data! This method has now been adopted for the official long-term peak electricity demand forecasts for all states except WA. Some lessons Cross-validation is very useful in prediction problems. Statistical modelling is an iterative process. Getting client understanding of percentiles is extremely difficult. Beware of clients who think they know more than you! Man vs Wild Data Going to extremes 41
  • 109. Results We have successfully forecast the extreme upper tail in ten years time using only twelve years of data! This method has now been adopted for the official long-term peak electricity demand forecasts for all states except WA. Some lessons Cross-validation is very useful in prediction problems. Statistical modelling is an iterative process. Getting client understanding of percentiles is extremely difficult. Beware of clients who think they know more than you! Man vs Wild Data Going to extremes 41
  • 110. Outline 1 Where fools fear to tread 2 Working with inadequate tools 3 When you can’t lose 4 Getting dirty with data 5 Going to extremes 6 Final thoughts Man vs Wild Data Final thoughts 42
  • 111. Crazy clients The client who wouldn’t tell me the problem. The client who wanted all meetings held at random locations for security reasons. The client who didn’t like the answer. Expert witnessing on the color purple (and now yellow). Man vs Wild Data Final thoughts 43
  • 112. Crazy clients The client who wouldn’t tell me the problem. The client who wanted all meetings held at random locations for security reasons. The client who didn’t like the answer. Expert witnessing on the color purple (and now yellow). Man vs Wild Data Final thoughts 43
  • 113. Crazy clients The client who wouldn’t tell me the problem. The client who wanted all meetings held at random locations for security reasons. The client who didn’t like the answer. Expert witnessing on the color purple (and now yellow). Man vs Wild Data Final thoughts 43
  • 114. Crazy clients The client who wouldn’t tell me the problem. The client who wanted all meetings held at random locations for security reasons. The client who didn’t like the answer. Expert witnessing on the color purple (and now yellow). Man vs Wild Data Final thoughts 43
  • 115. Go forth and consult A good statistician is not smarter than everyone else, he merely has his ignorance better organised. (Anonymous) Man vs Wild Data Final thoughts 44
  • 116. Go forth and consult All models are wrong, some are useful. (George E P Box) Man vs Wild Data Final thoughts 44
  • 117. Go forth and consult It is better to solve the right problem the wrong way than the wrong problem the right way. (John W Tukey) Man vs Wild Data Final thoughts 44
  • 118. Go forth and consult It is better to solve the right problem the wrong way than the wrong problem the right way. (John W Tukey) Slides available from robjhyndman.com Man vs Wild Data Final thoughts 44