This document discusses analyzing play-by-play data from NFL games from 2002-2012 that was released by Advanced NFL Stats. It contains over 470,000 individual play entries and details analyzing the data using tools like MapReduce, Hive, and examining factors like weather, stadiums, player arrests and their effects on games. Examples of analyses shown include the percentage of different play types in varying game situations and weather conditions.
Doing Data Science on the NFL Play by Play Dataset
1. 1
Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Doing Data Science on the
NFL Play by Play Dataset
Jesse Anderson | Curriculum Developer and Instructor
July 2013 v2
9. From the Data: Fourth Downs
9
15% of 4th down
plays weren't kicks
10. Play by Play Pieces
10
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
11. From the Data: Sacks
11
QB sacks and
scrambles
double on 3rd downs
12. Hive
• Abstraction on top of
MapReduce
• Allows queries using a SQL-like
language
12
13. Hive Query
13
Give me every run by
New Orleans in the
2010 season:
SELECT * FROM
playbyplay WHERE
playtype = "RUN"
and year = 2010
and game like
"%NO%";
14. From the Data: Yards to Go
14
With 1 yard to go, 65%
of plays are runs
20. From the Data: Stadium Attendance
20
Stadiums with the smallest
capacities average the best
scores 20.55-17.79
21. Stadium Data
21
Stadium The capacity of the stadium
Expanded Capacity The expanded capacity of the stadium
Location The location of the stadium
Playing Surface The type of grass, etc that the stadium has
Is Artificial Is the playing surface artificial
Team The name of the team that plays at the stadium
Roof Type The type of roof in the stadium (None, Retractable, Dome)
Elevation The elevation of the stadium
22. From the Data: Stadium Elevation
22
There is a 1%
increase in passes at
Mile High versus sea
level stadiums
24. From the Data: Fumble
24
Games with weather
have a fumble 93%
of the time
compared to 56%
without
25. Weather Data
25
STATION Station identifier
STATION NAME Station location name
READING DATE Date of reading
PRCP Precipitation
AWND Average daily wind speed
WV20 Fog, ice fog, or freezing fog (may include heavy fog)
TMAX
Maximum temperature
TMIN Minimum temperature
26. From the Data: Home Field Advantage
26
Baltimore has the
biggest weather
advantage 22-14
28. Arrest Data
28
Season Player Arrested in (February to February)
Team Team person played on
Player Name of player Arrested
Player Arrested Was a player in the play arrested that season
Offense Player Arrested Offense had player arrested in season
Defense Player Arrested Defense had player arrested in season
Home Team Player Arrested Home Team had player arrested in season
Away Team Player Arrested Away Team had player arrested in season
29. Whenever there are
arrests either in the
home team, away team
or both, the home team
29
From 2002 to 2012, each
team had many arrests.
From to a low in 2002 of
56% to a
HIGH OFWINS
Arrest = Win?
32. 32
The Low Downs
• /me - http://www.jesse-anderson.com
• @jessetanderson
• Code - https://github.com/eljefe6a/nfldata
*I am not in any way affiliated with the NFL or any Team
34. From the Data: Weather
34
Wind had the most effect on
games
At calm winds 41% pass and
37% run
At >30 MPH 34% pass and 46%
run
35. From the Data: Field Goals
35
Weather only increases
misses by %1
14% of Field Goals are
missed
21% of Field Goals are
missed 30-39 MPH
average winds
Notas del editor
Extract value and insight.http://www.flickr.com/photos/billlublin/3972999678/sizes/o/
Unstructured data. Human generated.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
Incomplete passes to a receiver averaged over seasons togetherA.Luck to R.WayneG.Ferotte to C.ChambersJ.Freeman to V.JacksonT.Brady to R.MossA.Luck to D.Avery
This break up creates 96 different queryablecolumsnhttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
1st downs are 52% runs and 42% pass2nd downs are 45% runs and 49% pass3rd downs are 26% runs and 66% passhttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
Easy for humans to parse data, hard for computers.Natural language processingWhile breaking down the data, we need to know what questions we want to answer.Look back at my commits to see what I've added.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
This break up creates 96 different queryable columns.Limited to data about playshttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
1 yard is 65% runX and 24 has the highest chance of a sack at 4.6%X and 21 has the highest chance of a QB scramble 1.7%X and 10 is about even between pass and run at high 40'shttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
6% of plays lack weather dataHours spent diagnosing missing or bad dataHours spent downloading datahttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
No direct key between stadium and weather station.The average for weather scoring is 21-18 and without weather is 21-19
Miami has the worst 14-18Pittsburgh has the biggest non-weather advantage 24-14http://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
Used by permission of Lego Police Force https://www.facebook.com/LegoPD
2008 was the peak with 29 or 32 teams with an arrest.Commissioner Goodell implemented a personal conduct policy in 2007 for the 2008 season.http://www.thebiglead.com/index.php/2013/07/01/nfl-offseason-arrests-are-up-61-since-roger-goodell-implemented-personal-conduct-policy-in-2007/
Weather not as big as issue.Arrests not a big issueWe need to use data to make decisions.