An explanation of how Can I Solar? (www.canisolar.com) works. Discussion of linear regression and time series forecasting. Verification of linear regression assumptions; discussion of an automatic model selection pipeline for linear, ARIMA, and exponential smoothing time series models.
2. MOTIVATION
β’ Residential solar sector grew 51% from
2013 to 2014
β’ Projected market value of $3.7 billion in
2015
β’ Complex decision with many variables
β’ Homeowners want to know:
β’ How much money can I save?
β’ When will I break even?
3. CAN I SOLAR?
A DATA-DRIVEN WEB APPLICATION
http://www.canisolar.com
4. MODELING INSTALLATION COSTS
β’ Data on 400,000 installs obtained from β¨
National Renewable Energy Laboratory
β’ Cost of solar installations varies by:
β’ size of the array
β’ year of installation
β’ location of installation
β’ Multiple linear regression provides good ο¬t and
is easily interpretable
β’ Also tried multilevel modeling and random
forest regression
5. MODELING FUTURE ELECTRICITY PRICES
β’ 15 years of monthly historical electricity prices by state obtained from Energy
Information Administration
β’ Prices and trends vary signiο¬cantly by state, so no one model works best for all
states
β’ Developed a pipeline to
automatically test, validate,
and select an appropriate
time-series model for each
state, e.g.:
β’ linear
β’ ARIMA
β’ exponential smoothing
9. GABRIEL J. MICHAEL
β’ Ph.D., Political Science, George Washington
University
β’ Used survival regression to model countries'
adoption of intellectual property laws
β’ Postdoc, Yale Law School
β’ Used NLP with SVMs to classify tweets and
regulatory comments on political topics
Exploring the since-demolished PEPCO
Benning Generating Station,Washington, DC
Urban explorer, electronics hobbyist
Visualization ofTwitter users' connections
and sentiment about net neutrality
10. MODELS OF INSTALLATION COSTS
Simple Linear
Regression
Multiple Linear
Regression
Multilevel
Model
Random Forest
Regression
Model Form
log(cost) ~
log(size_kw)
log(cost) ~
log(size_kw) + state
+ year
log(cost) ~
log(size_kw) +
(log(size_kw) | state/
year_installed)
log(cost) ~
log(size_kw)
Notes
easy to interpret
and explain
conο¬dence and
prediction intervals for
multilevel models are
difο¬cult to interpret
scikit-learn's random
forest regressor doesn't
support factors, and the
R packages are too slow
R2 or Pseudo R2 0.81 0.89 0.89 0.93
10-fold CV MSE 0.089 0.053 0.050 0.050
11. Per-capita electricity consumption has ο¬attened and
even declined in recent years
United States: kWh per capita
0
4000
8000
12000
16000
1960 1963 1966 1969 1972 1975 1978 1981 1984 1987 1990 1993 1996 1999 2002 2005 2008 2011
12. β’ Industry standard warranties
offer guaranteed 90% output
at 10 years, 80% output at 25
years
β’ I use a simple exponential
decay curve to calculate
performance in month 0 to
month 360 (30 years)
PHOTOVOLTAIC PERFORMANCE
DECLINE OVERTIME
0 5 10 15 20 25 30
0.00.20.40.60.81.0
Performance = e^(β0.005322 + β0.008935 * Years)
YearPerformance
15. BACKEND
β’ Python 3 + pandas for core classes and program logic
β’ R for modeling + rpy2 Python interface to R
β’ MySQL for storage of electricity consumption and
price data, and solar installation cost/size data
β’ MongoDB for storage and retrieval of geolocated
insolation data
β’ Code on GitHub: https://github.com/langelgjm/canisolar
18. ASSUMPTIONS OF LINEAR REGRESSION
β’ Homoskedasticity
(constant variance of
errors)
β’ Some evidence of
heteroskedasticity
β’ Could use robust
standard errors for
intervals, although the
conο¬dence intervals are
not much wider
19. ASSUMPTIONS OF LINEAR REGRESSION
β’ Normality of residuals
β’ Evidence of non-normal
(heavy tailed) error
distribution
β’ This assumption only
necessary for conο¬dence
intervals/p-values, not best
linear unbiased estimates
β’ Could use robust regression
with t-distribution
20. ASSUMPTIONS OF LINEAR REGRESSION
β’ True linear relationship
β’ True with simple
regression of cost ~ size
β’ No signiο¬cant
multicollinearity
β’ Variance inο¬ation factors
relatively low
21. TIME SERIES MODELING
β’ No other predictors (time is the only variable)
β’ Strong a priori reason to believe most states will have an increasing,
roughly linear trend in future electricity prices, often with seasonality
22. TIME SERIES MODELING
β’ States vary signiο¬cantly from one another in historical prices,
trends, and seasonality
β’ We cannot expect the same model to perform well for all states!
24. 1. Create a handcrafted list of 7 possible models (1 linear, 4 ARIMA, and 2
exponential smoothing)
LONGTERM FORECASTING:A SOLUTION
Parameters Seasonal Parameters Note
Linear n/a n/a
ARIMA (1,0,0) None include drift
ARIMA (1,1,0) None include drift
ARIMA (1,0,0) (1,0,0)
ARIMA (1,0,0) (1,1,0)
Exponential Smoothing M M no damping
Exponential Smoothing A A no damping
25. 2. Train each model on 1/3, 1/2, & 2/3 of historical data; test on the respective
remaining proportion of historical data (2 models shown)
LONGTERM FORECASTING:A SOLUTION
26. 3. Select the model with the lowest MSE across all tests
4. Repeat for every U.S. state + DC
5. Sanity check the resulting models
LONGTERM FORECASTING:A SOLUTION
Forecasts from ARIMA(1,0,0)(1,0,0)[12] with nonβzero mean
2000 2010 2020 2030 2040
101520
Forecasts from ETS(A,A,A)
2000 2010 2020 2030 2040
050100150
NH MS