SlideShare a Scribd company logo
1 of 40
Introduction to Data Science
Lecture 6
Exploratory Data Analysis
CS 194 Spring 2014
Michael Franklin
Dan Bruckner, Evan Sparks,
Shivaram Venkataraman
Outline for this Evening
• Class Lecture
• Exploratory Data Analysis
• Hypothesis Testing
• Exercise – EDA and HT in Python
(Evan: Tutorial and Lab)
next week: we’ll play with “R”
• Review of exercise
• Time for Project Group Discussions
Topics Today and Next Time
• Exploratory Data Analysis
• Data Diagnosis
• Graphical/Visual Methods
• Data Transformation
• Confirmatory Data Analysis
• Statistical Hypothesis Testing
• Graphical Inference
Descriptive vs. Inferential
• Descriptive: e.g., Mean; describes data you
have but can't be generalized beyond that
• We’ll talk about Exploratory Data Analysis
• Inferential: e.g., t-test, that enable inferences
about the population beyond our data
• These are the techniques we’ll leverage for
Machine Learning and Prediction
Examples of Business Questions
• Simple (descriptive) Stats
• “Who are the most profitable customers?”
• Hypothesis Testing
• “Is there a difference in value to the company of these
customers?”
• Segmentation/Classification
• What are the common characteristics of these
customers?
• Prediction
• Will this new customer become a profitable
customer? If so, how profitable?
adapted from Provost and Fawcett, “Data Science for Business”
Applying techniques
• What models/techniques to use depends on
the problem context, data and underlying
assumptions.
• e.g., Classification problem with binary
outcome? -> logistic regression, Naïve Bayes,
…
• e.g., Classification problem but no labels?
• -> Perhaps use K-means clustering
Exploratory Data Analysis
1977
• Based on insights developed at Bell Labs
in the 60’s
• Techniques for visualizing and
summarizing data
• What can the data tell us? (in contrast to
“confirmatory” data analysis)
• Introduced many basic techniques:
• 5-number summary, box plots, stem
and leaf diagrams,…
• 5 Number summary:
• extremes (min and max)
• median & quartiles
• More robust to skewed & longtailed
distributions
The Trouble with Summary Stats
Looking at Data
10
Data Presentation
• Dashboard
11
Data Presentation
• Data Art
12
Chart types
• Single variable
• Dot plot
• Jitter plot
• Box plot
• Histogram
• Kernel density estimate
• Cumulative distribution function
(note: examples using qplot library from R)
Chart examples from Jeff Hammerbacher’s 2012 CS194 class
13
Chart types
• Dot plot
14
Chart types
• Jitter plot
15
Chart types
• Box plot
16
Chart types
• Box plot
17
Chart types
• Histogram
18
Chart types
• Kernel density estimate
19
Chart types
• Histogram and Kernel Density Estimates
• Histogram
• Proper selection of bin width is important
• Outliers should be discarded
• KDE
• Kernel function
• Box, Epanechnikov, Gaussian
• Kernel bandwidth
20
Chart types
• Cumulative distribution function
21
Chart types
• Two variables
• Scatter plot
• Line plot
• Log-log plot
• Cut-and-stack plot
• Pairs plot
22
Chart types
• Scatter plot
23
Chart types
• Line plot
24
Chart types
• Log-log plot
25
Chart types
• Coxcomb plot
26
Chart types
• Treemap
27
Chart types
• Heatmap
28
Chart types
• Gapminder
The Need for Models
“All models are wrong, but some models are useful.” George
Box
• Data represents the traces of the real-world processes.
• Two sources of randomness and uncertainty:
1) those underlying the process themselves
2) those associated with the data collection methods
• To simplify the traces into something more
comprehensible you need:
• mathematical models or functions of the data -> Statistical
estimators
More on Models
• N is size of population
• n is sample size (subset of the population)
• Getting the subset (i.e. sampling) can
introduce "bias" leading to incorrect
conclusions
Probability Distributions
• Natural processes tend to generate
measurements whose empirical shape could
be approximated by mathematical functions
with a few parameters that could be
estimated from the data.
Note on ML Algos vs. Stat Models
• Techniques and underlying concepts in common
• Difference in goals/use:
• ML Algos – goal: predict or classify with high
accuracty.
• basis of many data products
• Models – get at the underlying generative process
• “Black box” vs. “White box”
• Dealing with uncertainty (at the heart of stats)
• Distributions vs. non-parametic approaches
More on Hypothesis Testing
• Null Hypothesis is given the benefit of the
doubt (e.g., innocent until proven guilty).
• Alternative Hypothesis directly contradicts the
Null Hypothesis
• "Step 1: State the hypotheses."
• "Step 2: Set the criteria for a decision."
• "Step 3: Compute the test statistic."
• "Step 4: Make a decision."
p Value
• A p value is the probability of obtaining a
sample outcome, given that the value stated
in the null hypothesis is true.
• In many cases: when the p value is less than
5% (p < .05), we reject the null hypothesis
• Note this means that 1 out of 20 times we
incorrectly reject the null hypothesis
• Do “green jelly beans cause acne?” (see XKCD)
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Two-tailed Significance
When the p value is less than 5% (p < .05), we
reject the null hypothesis
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Hypothesis Testing
From G.J. Primavera, “Statistics for the Behavioral Sciences”
Are Two Sets of Data Really Different?
• Null Hypothesis: The differences we see are
due to “chance”
• For Small Sample sizes: use T-test
• We’ll do this next in the lab.
Some Notes on the Class
• 3/17 Intro to Supervised Learning
• HW2 coming out tomorrow night
• Due after Spring Break but do it before!
• FINAL PROJECTS
• Group size = 3
• What’s expected – find data, build a COOL Data
Product, integration & viz or good reason why not
• Schedule:
• Groups Formed
• 1-2page proposal DUE 3/11 Midnight
• Midway review meeting with Prof or GSIs following 1-2
weeks
• Final Presentation (Posters and/or Lightning talks)
• Final Report

More Related Content

Similar to CS194Lec0hbh6EDA.pptx

Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfAbdullahOmar64
 
Introduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfIntroduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfAfframHspt
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studywolf vanpaemel
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetupmortardata
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
Data Science 101
Data Science 101Data Science 101
Data Science 101ideatoipo
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share ilmideas
 
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...ilmideas
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012srosenblatt
 
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...statisfactions
 

Similar to CS194Lec0hbh6EDA.pptx (20)

Lecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdfLecture_4_Data_Gathering_and_Analysis.pdf
Lecture_4_Data_Gathering_and_Analysis.pdf
 
AL slides.ppt
AL slides.pptAL slides.ppt
AL slides.ppt
 
Introduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdfIntroduction_to_Quantitative_Research_Me.pdf
Introduction_to_Quantitative_Research_Me.pdf
 
sience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real studysience 2.0 : an illustration of good research practices in a real study
sience 2.0 : an illustration of good research practices in a real study
 
Chap4 part 1
Chap4 part 1Chap4 part 1
Chap4 part 1
 
Max Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science MeetupMax Shron, Thinking with Data at the NYC Data Science Meetup
Max Shron, Thinking with Data at the NYC Data Science Meetup
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRIICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
ICAR-IFPRI - Basic Research Questions lecture 1 - Devesh Roy, IFPRI
 
Data Mining Lecture_2.pptx
Data Mining Lecture_2.pptxData Mining Lecture_2.pptx
Data Mining Lecture_2.pptx
 
Data Science 101
Data Science 101Data Science 101
Data Science 101
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share How to Design Research from Ilm Ideas on Slide Share
How to Design Research from Ilm Ideas on Slide Share
 
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
How to Develop and Implement Effective Research Tools from Ilm Ideas on Slide...
 
Ml - A shallow dive
Ml  - A shallow diveMl  - A shallow dive
Ml - A shallow dive
 
Intro scikitlearnstatsmodels
Intro scikitlearnstatsmodelsIntro scikitlearnstatsmodels
Intro scikitlearnstatsmodels
 
4646150.ppt
4646150.ppt4646150.ppt
4646150.ppt
 
Statistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignoreStatistics in the age of data science, issues you can not ignore
Statistics in the age of data science, issues you can not ignore
 
Action research for_librarians_carl2012
Action research for_librarians_carl2012Action research for_librarians_carl2012
Action research for_librarians_carl2012
 
Weka bike rental
Weka bike rentalWeka bike rental
Weka bike rental
 
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
CATALST intro stats course presentation at JMM 2013 (Elizabeth Fry, Laura Zie...
 

Recently uploaded

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1ranjankumarbehera14
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...nirzagarg
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfSayantanBiswas37
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...gajnagarg
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabiaahmedjiabur940
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 

Recently uploaded (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 

CS194Lec0hbh6EDA.pptx

  • 1. Introduction to Data Science Lecture 6 Exploratory Data Analysis CS 194 Spring 2014 Michael Franklin Dan Bruckner, Evan Sparks, Shivaram Venkataraman
  • 2. Outline for this Evening • Class Lecture • Exploratory Data Analysis • Hypothesis Testing • Exercise – EDA and HT in Python (Evan: Tutorial and Lab) next week: we’ll play with “R” • Review of exercise • Time for Project Group Discussions
  • 3. Topics Today and Next Time • Exploratory Data Analysis • Data Diagnosis • Graphical/Visual Methods • Data Transformation • Confirmatory Data Analysis • Statistical Hypothesis Testing • Graphical Inference
  • 4. Descriptive vs. Inferential • Descriptive: e.g., Mean; describes data you have but can't be generalized beyond that • We’ll talk about Exploratory Data Analysis • Inferential: e.g., t-test, that enable inferences about the population beyond our data • These are the techniques we’ll leverage for Machine Learning and Prediction
  • 5. Examples of Business Questions • Simple (descriptive) Stats • “Who are the most profitable customers?” • Hypothesis Testing • “Is there a difference in value to the company of these customers?” • Segmentation/Classification • What are the common characteristics of these customers? • Prediction • Will this new customer become a profitable customer? If so, how profitable? adapted from Provost and Fawcett, “Data Science for Business”
  • 6. Applying techniques • What models/techniques to use depends on the problem context, data and underlying assumptions. • e.g., Classification problem with binary outcome? -> logistic regression, Naïve Bayes, … • e.g., Classification problem but no labels? • -> Perhaps use K-means clustering
  • 7. Exploratory Data Analysis 1977 • Based on insights developed at Bell Labs in the 60’s • Techniques for visualizing and summarizing data • What can the data tell us? (in contrast to “confirmatory” data analysis) • Introduced many basic techniques: • 5-number summary, box plots, stem and leaf diagrams,… • 5 Number summary: • extremes (min and max) • median & quartiles • More robust to skewed & longtailed distributions
  • 8. The Trouble with Summary Stats
  • 12. 12 Chart types • Single variable • Dot plot • Jitter plot • Box plot • Histogram • Kernel density estimate • Cumulative distribution function (note: examples using qplot library from R) Chart examples from Jeff Hammerbacher’s 2012 CS194 class
  • 18. 18 Chart types • Kernel density estimate
  • 19. 19 Chart types • Histogram and Kernel Density Estimates • Histogram • Proper selection of bin width is important • Outliers should be discarded • KDE • Kernel function • Box, Epanechnikov, Gaussian • Kernel bandwidth
  • 20. 20 Chart types • Cumulative distribution function
  • 21. 21 Chart types • Two variables • Scatter plot • Line plot • Log-log plot • Cut-and-stack plot • Pairs plot
  • 29. The Need for Models “All models are wrong, but some models are useful.” George Box • Data represents the traces of the real-world processes. • Two sources of randomness and uncertainty: 1) those underlying the process themselves 2) those associated with the data collection methods • To simplify the traces into something more comprehensible you need: • mathematical models or functions of the data -> Statistical estimators
  • 30. More on Models • N is size of population • n is sample size (subset of the population) • Getting the subset (i.e. sampling) can introduce "bias" leading to incorrect conclusions
  • 31. Probability Distributions • Natural processes tend to generate measurements whose empirical shape could be approximated by mathematical functions with a few parameters that could be estimated from the data.
  • 32. Note on ML Algos vs. Stat Models • Techniques and underlying concepts in common • Difference in goals/use: • ML Algos – goal: predict or classify with high accuracty. • basis of many data products • Models – get at the underlying generative process • “Black box” vs. “White box” • Dealing with uncertainty (at the heart of stats) • Distributions vs. non-parametic approaches
  • 33.
  • 34. More on Hypothesis Testing • Null Hypothesis is given the benefit of the doubt (e.g., innocent until proven guilty). • Alternative Hypothesis directly contradicts the Null Hypothesis • "Step 1: State the hypotheses." • "Step 2: Set the criteria for a decision." • "Step 3: Compute the test statistic." • "Step 4: Make a decision."
  • 35. p Value • A p value is the probability of obtaining a sample outcome, given that the value stated in the null hypothesis is true. • In many cases: when the p value is less than 5% (p < .05), we reject the null hypothesis • Note this means that 1 out of 20 times we incorrectly reject the null hypothesis • Do “green jelly beans cause acne?” (see XKCD)
  • 36. From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 37. Two-tailed Significance When the p value is less than 5% (p < .05), we reject the null hypothesis From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 38. Hypothesis Testing From G.J. Primavera, “Statistics for the Behavioral Sciences”
  • 39. Are Two Sets of Data Really Different? • Null Hypothesis: The differences we see are due to “chance” • For Small Sample sizes: use T-test • We’ll do this next in the lab.
  • 40. Some Notes on the Class • 3/17 Intro to Supervised Learning • HW2 coming out tomorrow night • Due after Spring Break but do it before! • FINAL PROJECTS • Group size = 3 • What’s expected – find data, build a COOL Data Product, integration & viz or good reason why not • Schedule: • Groups Formed • 1-2page proposal DUE 3/11 Midnight • Midway review meeting with Prof or GSIs following 1-2 weeks • Final Presentation (Posters and/or Lightning talks) • Final Report

Editor's Notes

  1. Atrributed to Florence Nightingale