TechnicalReport_NFLProject_Austin&Ovais

Austin, Siddiqui
Analyzing NFL Drive Success
Austin Grosel, Ovais Siddiqui
Table of Contents
1. ABSTRACT ---------------------------------------------------------------------------------------------- pg 2
2. INTRODUCTION ----------------------------------------------------------------------------------------pg 2
3. METHODOLOGY --------------------------------------------------------------------------------------pg 3
4. ANALYSIS ----------------------------------------------------------------------------------------------- pg 4
5. FUTURE WORK---------------------------------------------------------------------------------------- pg 7
6. APPENDIX------------------------------------------------------------------------------------------------ pg 8
7. SAS CODE----------------------------------------------------------------------------------------------pg 15
8. R CODE--------------------------------------------------------------------------------------------------pg 18
9. REFERENCES-----------------------------------------------------------------------------------------pg 19

Austin, Siddiqui
ABSTRACT
Sports has seen an analytical revolution in the past couple of years. Many
franchises in the sports world are turning to data analysis for decision making. The
goal of this project is to look at American Football data from the NFL and try to
determine the characteristics of an efficient team offense. The research data had
every team’s drive statistics for the past sixteen years, with the dependent variable
as total points per drive (PPD). Using exploratory data analysis along with multiple
linear regression, two models were developed predicting PPD, one with interaction
terms and one without them. The model without the interaction terms seemed to
make more sense from a business perspective, and significant predictors of PPD
turned out to be related to the passing plays on offense. By understanding the
important factors in this PPD model, NFL front offices and coaches may want to
invest more capital in developing a premier passing offense.
INTRODUCTION
Since sports have been around, franchises have tried to find ways to gain a competitive
advantage. Around the start of the 2000’s, a few professional baseball teams started using
mathematics, statistics, and data analysis over the traditional scouting techniques. This sparked a
best-selling novel Moneyball by Michael Lewis to follow the Oakland Athletics 2002 season as
they transitioned to using numbers over film. Because of Oakland and other MLB team’s success
in their adoption of statistics, other teams followed, and soon, other sports followed. Out of the
top 3 American sports to adopt analytics, American Football, specifically the NFL, has seen the
slowest transition to analytics. American Football is different than other sports: there are 22 total
players on the field, coaches have a much stronger influence on the game than baseball or
basketball, and there are so many styles of play that it’s extremely hard to quantify how
important some “box score” statistics actually are. However, recently there have been
statisticians, fans, and writers who have tried to make sense of this unique game.
An advanced analytics site called Football Outsiders have put together NFL drive
statistics (Football Outsiders), which takes a look at each team’s drive statistics throughout the
season. A drive in the NFL is a series of plays by an offense with the objective to score points.
Each drive has a starting field position of where the drive starts. The outcomes on a drive are
score points (six for a touchdown, three for a field goal), turn the ball over to the other team or
the game clock expires.
There are two types of plays that an offense can run on their drives: a rushing play,
which usually happens when the quarterback hands the ball to a player or runs in himself, or a

Austin, Siddiqui
passing play, which happens when a player (usually the quarterback) throws the ball to another
player. Conventional wisdom has shown rushing plays gets less yards on average but is a more
conservative play call than passing plays. Generally, teams have passed a little more than 50% of
the time (NFL Team Rankings), but this may be influenced because when a pass play is not
complete (that is, when a pass is thrown but is not caught) the clock stops. So passing plays are
generally used when you need to get down the field in a short amount of time. Data has also
shown that passing the ball well is more important to winning games than rushing the ball well
(Burke).
The data used in this project has every team’s drive statistics for the past seven years. Our
goal is to develop a multiple linear regression model to predict how many points per drive (PPD)
a team should have, based on other drive statistics such as passing stats, rushing stats, starting
field position, time of drive, and plays per drive. Our hypothesis is, according to our research, the
passing statistics will be much more important to our model than rushing statistics, starting field
position will be a significant factor (the further down the field you start, the more chance you’ll
score points) and that more time on a drive results in more points. We feel that our results can
help teams invest their resources in an effective manner, whether it’d be in the passing game or
the rushing game.
METHODOLOGY
We were able to obtain the data by subscribing to the Armchair Analysis NFL Database
(Armchair). This database had a zip file that contained CSV files for everything NFL related:
team and player data, schedules, history, etc. We observed the Drive.csv file which contained
data of every single drive for the past sixteen years. The different columns of this dataset were
the starting field position, the total time spent on the drive in seconds, the amount of passing
plays and yards, the amount of rushing plays and yards, the number of rushing and passing first
downs on that drive, and the number of plays for that drive. Along with these numbers, it had the
result of the drive.
For the pre-processing step, we first went into the dataset and created a new column
labeled “Points”. If the result was labeled a TD (touchdown), we’d put 6 points. If it was labeled
FG (field goal), we’d put 3 points. If we saw that the result was labeled ENDQ (end of quarter),
we decided to remove these observations from the dataset. This is because end of quarter drives
usually happen when a team decides to run the clock out, and thus their objective was never to
score in the first place. The rest of the results were given a 0 because there were no points scored
on that drive. We then decided to create new fields for passing efficiency and rushing efficiency.
These were calculated by dividing the amount of passing yards by passing attempts and rushing
yards by rushing attempts.

Austin, Siddiqui
To further remove any bias, we also decided to remove all fourth quarter drives. This was
because if a team is trailing in the fourth quarter, they will pass much more than run because they
are losing and want to score quickly. The fourth quarter numbers would probably skew to more
passing attempts than a normal drive at any given time, so those drives were completely
removed.
The next step was to aggregate these drives together based on each team and season.
These operations are found in the SetUp.R code in the R language. Now we had 510 seasonal
drive statistics for every team in the past 16 years. The variable we decided to be our dependent
variable was points per drive, or PPD. PPD would give us good insight on how efficient an
offense of a particular team was in that particular season.
Our approach started with an exploratory analysis, which included plotting descriptive
statistics and looking at the correlation values. When no multicollinearity was found, we divided
the data up into test and training data sets. Then, we used the training sets to come up with our
first model using the adjusted r-squared selection method. We performed a residual analysis on
the model, and decided to see if interaction terms would help increase the Adjusted R-Squared
value. Afterwards, we used the test data to see which model was more successful at predicting
the test data. Finally, after weighing all options (model accuracy, simplicity from a business
perspective, etc.), we concluded with the model we felt was best.
ANALYSIS
After the pre-processing phase, our dataset consists of nine predictors which are given as:
time = Total time of the drive (in seconds)
start_pos = Starting field position
pass_eff = Passing efficiency (pass yards/pass attempts)
rush_eff = Rushing efficiency (rush yards/rush attempts)
pass_att = , Passing attempts per drive
rush_att = Rushing attempts per drive
pass_fd = First downs from pass plays per drive
rush_fd = first downs from rushing plays per drive
total_plays = Plays per drive
In order to check the distribution of our dependent variable, we created a histogram with
normal density curve plotted over it (See Appendix A). The Graph shows a normal distribution
with most of the values concentrated around the mean value of µ = 1.668 with a standard
deviation of Ω = 0.4035. This statistic indicates that 68% of the teams score between 1.2 and 2

Austin, Siddiqui
points per drive on average and only 5% of the teams having an average score of more than 2.5
points per drive.
Then we focused on any kind of transformation that might be needed on our dataset. We
made use of Pearson Correlation matrix and the scatterplot matrix to look for linearity between
response and independent variables. Both plots suggest a linear relationship between the Y and
X variables (See Appendix B). The variables Passing Efficiency (Pass_Eff) and Passing First
down (Pass_fd) appeared to have a strong correlation with our Y variable, having values of 0.80
and 0.79 respectively. These were followed by Total number of plays (Total_plays) with a value
of 0.67. We also took notice of multicollinearity issues and the pair of passing efficiency
(pass_fd) and passing first down (pass_fd) had a high correlation value of 0.87. Therefore, this
issue would be discussed at a later stage after the selection of our final predictors significant to
the model.
One important thing to note here is that we tried several interaction terms to see if our
model can be improved or not. As discussed during the presentation, Even though the interaction
model has the same number of predictors as our selected model and the Adj R2 was almost the
same as well, we decided to choose a model that is more simple and supports our hypothesis.
The interaction model consisted of the combination of all the original nine predictors whereas the
simpler model will have only six single predictors giving us the same result (See Appendix H).
Hence we chose the simpler model without any interaction terms. Here, it is worth mentioning
the interaction variable of Starting Position (start_pos) and Pass Attempt (Pass_att). In the
normal model, starting position gives you an advantage and is positively associated with Points
per drive (PPD). The closer you are to the opposition’s end zone, the more chances there are to
score. However, the negative association of the interaction term with PPD, reveals the fact that
the as the offensive team gets closer to the opposition’s end zone, they should pass less and try to
rush to the end zone to score the points. Other than this fact, we didn’t find anything interesting
related to our hypothesis, so we carried with our analysis on the simpler model without any
interaction terms.
At this point, we could have either divided the data into testing and training sets for
model validation or we could have carried on with the steps of data analysis to get the final
model. We decided to adopt the latter approach and divide the data into train/test sets once we
have come up with our final predictors of the model. Therefore, we adopted multiple selection
methods to come up with final predictors. Three model selection methods namely; Stepwise,
Adjusted R-square and CP were used (see Appendix C). All three methods had six predictors in
common which are Time, starting position (start_pos), passing efficiency (pass_eff), rushing
efficiency (rush_eff), passing first down (pass_fd), rushing first down (rush_fd). The stepwise
suggested an additional variable of Total_plays to be included as well. However, we decided to
go with the Adjusted R-square method in which the model with six predictors had Adj R2
=

Austin, Siddiqui
0.873 whereas the complete model (nine predictors) had the value of Adj R2
= 0.8749. Hence
having more than six predictors overestimates our model.
Next, we checked the model for residual Analysis. Our selected model was able to satisfy
all of the assumptions. The Residuals vs Predicted plot (See Appendix D) showed a random
pattern with values centered around the mean showing constant variance and independence with
only a couple of points as outliers. Similarly, the normal probability plot showed a 45-degree line
pattern satisfying our normality assumption.
Since we saw a couple of outliers during our residual analysis, we further checked for any
outliers and influential points that may affect our model. We used the Studentized Residuals and
Cook’s D Distance to look for such points. Our cut off point for both the statistics were values
that had |Studentized Residual| >= |3| and Cook’s D >= 4/n. One point in particular, Observation
449 stood out in our data with both values exceeding the minimum criterion (See Appendix E).
Therefore, we checked if it had any significant effect on the model. We used the trial and error
method by removing this observation and testing the significance of the whole model. We reran
the complete model, but the result did not vary and same predictors proved to be significant. In
addition, by removing the observation, the Adj R2
increased from 0.873 to 0.876 for the six
significant predictors. Therefore, both of us agreed that the increase was not worth discarding a
value from the dataset. As far as the multicollinearity issue is concerned, we used the VIF
statistic with a cutoff point of 10; however, all the values had VIF much less than that.
At this point, we were satisfied with our model and selected predictors. We calculated
standardized estimates to analyzed the which predictor has the greatest influence on Points per
drive (PPD) (See Appendix F). The Passing First Down (Pass_fd) had the strongest influence on
the PPD with a value of 0.606 followed by the next Passing statistic Passing Efficiency
(pass_eff). The Rushing statistics: Rush_eff and Rush_fd had the values of 0.14 and 0.27. This
outcome also supports our hypothesis that Passing statistics are much more valuable than rushing
statistics when it comes to scoring points.
It was time for us to see the predictive performance of our model by dividing the data
into Training and Testing sets. We used a 75/25 split to divide the data into training and testing
sets using the PROC SURVEYSELECT command. The comparison between the training and
testing set can be seen in Appendix G: DIVIDING THE DATA INTO TRAINING AND
TESTING SET TO CHECK THE PREDICTIVE PERFORMANCE. The RMSE values of the
training and testing set proved to be very close i.e. 0.143 and 0.144. Similarly, The R2
values
were 0.87 and 0.92 respectively. To further strengthen the reliability of our model, we used the
cross validated R2
method by computing |model R2 ‐R2 CV|. In our case, we assumed that a
good model will have a value of less than or equal to 0.3.

Austin, Siddiqui
R2 Train (per model o/p) = 0.8786
R2 Test (yhat^2) = 0.9297^2
|model R2 ‐R2 CV| = 0.0143
The value of 0.0143 substantiates our model’s impressive predictive performance with
the unseen data. So our final model equation is:
Points per Drive =
-1.4568 – 0.00337 * time + 0.03602 * start_pos +0.11615 * pass_eff
+0.08610 * rush_eff + 1.087 * pass_fd + 0.877 * rush_fd + e
FUTURE WORK
We were very satisfied with the findings of our regression analysis, however, there are
some things that we could look at in the future if more time was given. One of these tasks would
be analyzing what predicts first downs. It makes reasonable sense that an offense with more first
downs will score more points because they are in a sense hitting their “checkpoints” abundantly,
therefore they’re moving down the field. It’d be interesting to create other models on predicting
rushing first downs and passing first downs.
Another idea we could look into is comparing the different seasons of this data. Sports
are always evolving, and data in the year 2000 may not be relevant to how American Football is
played in 2015. Studies show the running back position in the NFL (the player who tends to get
the most rushing attempts) has decreased importance dramatically over the years. Our hypothesis
would be that recent data would give even more importance to the passing game when looking at
PPD.

Austin, Siddiqui
APPENDIX
APPENDIX A: NORMAL DENSITY CURVE PLOTTED ON TOP OF HISTOGRAM FOR
POINTS PER DRIVE
APPENDIX B: CORRELATION TABLE AND SCATTERPLOT MATRIX

Austin, Siddiqui
APPENDIX C: MODEL SELECTION
STEPWISE OUTPUT

Austin, Siddiqui
ADJUSTED R-SQUARE OUTPUT
CP OUTPUT:

Austin, Siddiqui
APPENDIX D: RESIDUAL ANALYSIS
PREDICTED VS STUDENTIZED
NORMAL PROBABILITY PLOT

Austin, Siddiqui
APPENDIX E: INFLUENTIAL POINTS AND OUTLIERS
APPEDIX F: STANDARDIZED ESTIMATES

Austin, Siddiqui
APPENDIX G: DIVIDING THE DATA INTO TRAINING AND TESTING SET TO CHECK
THE PREDICTIVE PERFORMANCE

Austin, Siddiqui
APPENDIX H: OUTPUT OF MODEL WITH INTERACTION TERMS

Austin, Siddiqui
SAS CODE
*----- GET DATA FROM EXTERNAL FILE USING "INFILE METHOD" ----;
DATA NFLproject;
INFILE "NFLproject.csv" DELIMITER = ',' FIRSTOBS=2 MISSOVER;
INPUT id $ tname $ seas ppd time start_pos pass_eff rush_eff pass_att
rush_att pass_fd rush_fd total_plays;
RUN;
PROC PRINT;
RUN;
TITLE "Descriptive Statistics";
proc means mean std stderr clm p25 p50 p75;
var ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd
total_plays;
run;
/* creates histogram with normal density plotted on top of histogram;*/
TITLE "Histogram - PPD";
proc univariate normal;
var ppd;
histogram / normal(mu=est sigma=est);
run;
/* Proc correlation */
TITLE "RELATIONSHIP BETWEEN VARIABLES";
proc corr;
var ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd
total_plays;
run;
*Creating scatter plot matrix;
TITLE "RELATIONSHIP BETWEEN VARIABLES";
PROC SGSCATTER;
matrix ppd time start_pos pass_eff rush_eff pass_att rush_att pass_fd rush_fd
total_plays;
run;
*Selecting a model using STEPWISE selection Method;
TITLE "RUNNING A SELECTION METHOD";
PROC REG data = NFLproject;
model ppd = time start_pos pass_eff rush_eff pass_att rush_att pass_fd
rush_fd total_plays / selection = stepwise;
run;
*Selecting a model using ADJRSQ selection Method;
rush_fd total_plays / selection = ADJRSQ;
run;

Austin, Siddiqui
*Selecting a model using CP selection Method;
rush_fd total_plays / selection = CP;
run;
*Checking for Model Assumptions;
TITLE "RESIDUAL ANALYSIS";
* studentized residuals;
plot student.*predicted.;
* studentized residuals with every x-var;
plot student.*(time start_pos pass_eff rush_eff pass_fd rush_fd);
* normal probablity plot of studentized residuals;
plot npp.*student.;
run;
*Testing for outliers and influential points;
TITLE "Checking for Outliers/Influential Points";
PROC REG;
model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd/ influence r
vif;
run;
*TO REMOVE OBSERVATION 449;
TITLE "Model after removing the influential point";
data NFLproject_new;
* write to a different dataset;
set NFLproject;
*remove 449th observation;
if _n_=449 then delete;
run;
*Checking the complete model with the new dataset w/o influential point;
*Using ADJRSQ selection Method;
PROC REG data = NFLproject_new;
rush_fd total_plays / selection = ADJRSQ;
run;
*the ADJRSQ only increased by 0.03 with six predictors which is not worth
discarding any value from the dataset;
*Running the proc reg of the old dataset with the six significant predictors;
TITLE "Checking for the Most Influential Predictor";
Proc reg data = NFLproject;
model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd / stb;
run;
* Get training and testing data;
title "Test and Train Sets for PPD";
proc surveyselect data=NFLproject out=val_NFLproject seed=123456
samprate=0.75 outall;
* outall - show all the data selected (1) and not selected (0) for training;

Austin, Siddiqui
run;
title "Train Sets for PPD";
data train_NFLproject (where= (Selected = 1));
set val_NFLproject;
run;
proc print data=train_NFLproject;
run;
title "Test Sets for PPD";
data test_NFLproject (where= (Selected = 0));
set val_NFLproject;
run;
proc print data=test_NFLproject;
run;
TITLE "Creating new variable";
data val_NFLproject;
set val_NFLproject;
if selected then new_y=ppd;
run;
proc print data=val_NFLproject;
run;
* Building a model with Train Data;
title "Validation - Train Set";
proc reg data = train_NFLproject;
model ppd = time start_pos pass_eff rush_eff pass_fd rush_fd;
run;
* Compare the two dataset outputs and predictions;
TITLE "COMPARING TRAINING AND TESTING DATASET RESULTS";
proc reg data = val_drives;
model ppd=time start_pos pass_eff rush_eff pass_fd rush_fd;
output out=outm1(where=(new_y=.)) p=yhat;
run;
proc print data=outm1;
run;
title "Difference between Observed and Predicted Test Set";
data outm1_sum;
set outm1;
diff = ppd - yhat;
abs_diff = abs(diff);
run;
proc summary data = outm1_sum;
var diff abs_diff;
output out=outm1_stats std(diff)=rmse mean(abs_diff)=mae;
run;
proc print data=outm1_stats;
title "Validation Stats for Model 1";
run;
proc corr data=outm1;
var ppd yhat;
run;

Austin, Siddiqui
R CODE
File - SetUp.R
drive = read.csv(“~/Downloads/Drive.csv”)
df = aggregate(points~tname+seas, data = drive, FUN = "mean")
df$id = paste0(tolower(df_yfog$tname), df_yfog$seas)
df_yfog = aggregate(yfog~tname+seas, data = drive, FUN = "mean")
df_yfog$id = paste0(tolower(df_yfog$tname), df_yfog$seas)
df_time = aggregate(time~tname+seas, data = drive, FUN = "mean")
df_time$id = paste0(tolower(df_time$tname), df_time$seas)
df_pfd = aggregate(pfd~tname+seas, data = drive, FUN = "mean")
df_pfd$id = paste0(tolower(df_pfd$tname), df_pfd$seas)
df_rfd = aggregate(rfd~tname+seas, data = drive, FUN = "mean")
df_rfd$id = paste0(tolower(df_rfd$tname), df_rfd$seas)
df_pa = aggregate(pa~tname+seas, data = drive, FUN = "mean")
df_pa$id = paste0(tolower(df_pa$tname), df_pa$seas)
df_ra = aggregate(ra~tname+seas, data = drive, FUN = "mean")
df_ra$id = paste0(tolower(df_ra$tname), df_ra$seas)
df_passeff = aggregate(passeff~tname+seas, data = drive, FUN = "mean")
df_passeff$id = paste0(tolower(df_passeff$tname), df_passeff$seas)
df_rusheff = aggregate(rusheff~tname+seas, data = drive, FUN = "mean")
df_rusheff$id = paste0(tolower(df_rusheff$tname), df_rusheff$seas)
df_plays = aggregate(plays~tname+seas, data = drive, FUN = "mean")
df_plays$id = paste0(tolower(df_plays$tname), df_plays$seas)

Austin, Siddiqui
REFERENCES
"Football Outsiders." Football Outsiders Everything. N.p., n.d. Web. 23 Nov. 2016.
"NFL Team Passing Play Percentage." NFL Football Stats - NFL Team Passing Play Percentage on
TeamRankings.com. N.p., n.d. Web. 23 Nov. 2016.
Burke, Brian. "Why Passing Is More Important Than Running in the N.F.L." The New York Times.
The New York Times, 31 Aug. 2010. Web. 23 Nov. 2016.
"Armchair Analysis.com." NFL Play Data. 697,180 Plays. Daily Updates. Armchair Analysis.com.
N.p., n.d. Web. 23 Nov. 2016.

TechnicalReport_NFLProject_Austin&Ovais

Recomendados

Recomendados

Más contenido relacionado

Similar a TechnicalReport_NFLProject_Austin&Ovais

Similar a TechnicalReport_NFLProject_Austin&Ovais (20)

TechnicalReport_NFLProject_Austin&Ovais