SlideShare una empresa de Scribd logo
1 de 14
The University of Sydney Page 1
Exploratory data
analysis
The basics
Presented by
Professor Peter Reimann
Centre for Research on Learning and
Cognition
The University of Sydney Page 2
EDA is a inquiry cycle
Generate
questions
Search for
answers
in the data
Refine
questions
Visualize, transform, model the data
EDA is an important
component of theory-driven,
problem-driven, and
curiosity-driven research.
The University of Sydney Page 3
Where do questions come from?
An important source of questions on data are hypotheses derived from theory:
Data Hypotheses Theory
Another source are problems:
Data Questions
Problem(
s)
Data Questions Data
A third source are data themselves:
The University of Sydney Page 4
Models of data
EDA plays a role in all three scenarios.
– Theories do not get compared with data as such, but with models of data:
Data Hypotheses TheoryData
model(s)
ED
A
Data Questions
Problem(
s)
Data
model(s)
ED
A
Questions
Data
model(s)
And similarly for the other cases:
Data
Data
model(s)
ED
A
The University of Sydney Page 5
Data are not “objective”
– Measurements and observations are not theory- or assumption-free;
– There’s more than one way to build a (statistical) model of any data
set;
– While the data may support a theory, they likely support many other
theories;
– While a data set may support a theory, it could also contain relation
that are contradicting the theory
Hence, even if your data are carefully selected and
measured, and you think you know them well, it is
important to look for the unexpected!
The University of Sydney Page 6
The exploratory perspective
Key assumption: The more one knows about the data, the more effectively
data can used to
– develop, test and refine theory,
– solve problems, and
– ask interesting questions.
To maximise what is learned from data, one needs to adhere to two principles:
– scepticism, and
– openness.
One should be sceptical, for instance about the assumption that specific
statistical parameters (i.e., summaries of data, such as the mean) reflect data
faithfully, and open to different interpretations of what the data say.
The University of Sydney Page 7
Be sceptical! Be open!
One reason to be sceptical
about statistics in particular
is Anscombe’s Quartet:
– Four datasets with (almost)
identical statistics, but
very different shapes.
By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
The University of Sydney Page 8
(cont.)
– Statistics (= summative accounts of data) can be misleading
– Data analysis is not identical with statistics:
– Visual analysis should precede statistical analysis
Stay open to multiple interpretations!
– The confirmatory, or hypothesis-testing mode, to data analysis can
keep one from seeing what other patterns might exist in data.
In addition to asking:
– Do these data confirm or disconfirm my hypothesis about x?
Ask:
– What can these data tell me about x?
The University of Sydney Page 9
Model and outliers
The basic way of thinking about data:
Data = pattern + deviations
(model + outliers)
(smooth + rough)
Data analysis, including statistical analysis, means to partition data into
patterns/models/smooths and deviations/outliers/roughs
For any given data, there are in principle many ways to do this
partitioning, and there is no logical reason to a priori prefer one over the
other  the analysis process is incremental, not one hypothesis testing
step.
The University of Sydney Page 10
Our tools for EDA
– dplyr: selecting, filtering, summarising data
– ggplot2: visualising data, patterns, trends.
The University of Sydney Page 11
Data selection with dplyr
Variable A (…) Variable v
Observation
1
Value 1A (…) Value 1v
Observation
2
Value 2A (…) Value 2v
(…) (…) (…) (…)
Observation
o
Value oA (…) Value ov
(2) filter on values
(3) arrange
by rows
(1) select variables
(4) mutate: create new variables
(5) sum-
marize
over
values
dplyr is made up out of 5 verbs:
The University of Sydney Page 12
“Sentences” in dplyr
General format: verb(data frame, parameters)
– The result is a new data frame: new_frame <- verb(data,
parameter).
Examples:
– filter(flights, month == 1, day == 1)
– arrange(flights, year, month, day)
– select(flights, year, month, day)
– mutate(flights, gain = arr_delay - dep_delay,
speed = distance / air_time * 60)
– summarize(flights, delay = mean(dep_delay))
The University of Sydney Page 13
Boolean operations are supported for filtering
and selecting
! Is “not”, | is ”or”, & is
“and”
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
These two return the same observations:
For more on these commands, see for instance
https://www.youtube.com/watch?v=aywFompr1F4
The University of Sydney Page 14
Workbook
– The rest of this module is mainly in the workbook.

Más contenido relacionado

La actualidad más candente

Exploratory data analysis v1.0
Exploratory data analysis v1.0Exploratory data analysis v1.0
Exploratory data analysis v1.0Vishy Chandra
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisYabebal Ayalew
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introductionManokamnaKochar1
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysismlong24
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data ScienceMaloy Manna, PMP®
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data VisualizationStephen Tracy
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisVishwas N
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysisgokulprasath06
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptxSarojkumari55
 
EDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceEDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceSumit Pandey
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Data visualization
Data visualizationData visualization
Data visualizationHoang Nguyen
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualizationZach Gemignani
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data AnalysisUmair Shafique
 
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ..."Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...Hunter Whitney
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionDerek Kane
 

La actualidad más candente (20)

Exploratory data analysis v1.0
Exploratory data analysis v1.0Exploratory data analysis v1.0
Exploratory data analysis v1.0
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Data visualization introduction
Data visualization introductionData visualization introduction
Data visualization introduction
 
3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis3.5 Exploratory Data Analysis
3.5 Exploratory Data Analysis
 
Data Visualization in Data Science
Data Visualization in Data ScienceData Visualization in Data Science
Data Visualization in Data Science
 
Introduction to Data Visualization
Introduction to Data VisualizationIntroduction to Data Visualization
Introduction to Data Visualization
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
Exploratory data analysis
Exploratory data analysisExploratory data analysis
Exploratory data analysis
 
03. Data Exploration.pptx
03. Data Exploration.pptx03. Data Exploration.pptx
03. Data Exploration.pptx
 
EDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data ScienceEDA | Exploratory Data Analysis | Machine Learning | Data Science
EDA | Exploratory Data Analysis | Machine Learning | Data Science
 
Data Analysis
Data AnalysisData Analysis
Data Analysis
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Data visualization
Data visualizationData visualization
Data visualization
 
Data analysis
Data analysisData analysis
Data analysis
 
Brief introduction to data visualization
Brief introduction to data visualizationBrief introduction to data visualization
Brief introduction to data visualization
 
Exploratory Data Analysis
Exploratory Data AnalysisExploratory Data Analysis
Exploratory Data Analysis
 
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ..."Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
"Introduction to Data Visualization" Workshop for General Assembly by Hunter ...
 
Data Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model SelectionData Science - Part III - EDA & Model Selection
Data Science - Part III - EDA & Model Selection
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
3 data visualization
3 data visualization3 data visualization
3 data visualization
 

Similar a Exploratory data analysis

Business research (1)
Business research (1)Business research (1)
Business research (1)007donmj
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.pptKaneezElahi
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareSanjeev Deshmukh
 
business-research
business-researchbusiness-research
business-researchMbabba2
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptxSalmaNiazi2
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data ManagementMahmoud91Tx
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationDr. Abdul Ahad Abro
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchKarla Adamson
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewTrident University
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxahamedaslambasha1
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfStanleyChivandire1
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Daberkow
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methodsjdubrow2000
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Stats Statswork
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web versionMichael Brodie
 

Similar a Exploratory data analysis (20)

Business research (1)
Business research (1)Business research (1)
Business research (1)
 
Lesson 6 chapter 4
Lesson 6   chapter 4Lesson 6   chapter 4
Lesson 6 chapter 4
 
EDM405 4.pptx
EDM405 4.pptxEDM405 4.pptx
EDM405 4.pptx
 
Unit 1.pptx
Unit 1.pptxUnit 1.pptx
Unit 1.pptx
 
business-research.ppt
business-research.pptbusiness-research.ppt
business-research.ppt
 
Relevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshareRelevance of statistics sgd-slideshare
Relevance of statistics sgd-slideshare
 
business-research
business-researchbusiness-research
business-research
 
Research EDU821-1.pptx
Research EDU821-1.pptxResearch EDU821-1.pptx
Research EDU821-1.pptx
 
Research Data Management
Research  Data ManagementResearch  Data Management
Research Data Management
 
Data mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, ClassificationData mining , Knowledge Discovery Process, Classification
Data mining , Knowledge Discovery Process, Classification
 
Thirupathi.ppt
Thirupathi.pptThirupathi.ppt
Thirupathi.ppt
 
Approaches To Data Analysis In Social Research
Approaches To Data Analysis In Social ResearchApproaches To Data Analysis In Social Research
Approaches To Data Analysis In Social Research
 
Aishwarya.ppt
Aishwarya.pptAishwarya.ppt
Aishwarya.ppt
 
CORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An OverviewCORE: Quantitative Research Methodology: An Overview
CORE: Quantitative Research Methodology: An Overview
 
Research Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptxResearch Methodologyfull and complete.pptx
Research Methodologyfull and complete.pptx
 
GBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdfGBS MSCBDA - Dissertation Guidelines.pdf
GBS MSCBDA - Dissertation Guidelines.pdf
 
Merriam ch 8 5.26.10
Merriam ch 8 5.26.10Merriam ch 8 5.26.10
Merriam ch 8 5.26.10
 
Practical Issues in Social Research Methods
Practical Issues in Social Research MethodsPractical Issues in Social Research Methods
Practical Issues in Social Research Methods
 
Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...Research design decisions and be competent in the process of reliable data co...
Research design decisions and be competent in the process of reliable data co...
 
Dia sds2015 web version
Dia sds2015 web versionDia sds2015 web version
Dia sds2015 web version
 

Último

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLkantirani197
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 

Último (20)

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 

Exploratory data analysis

  • 1. The University of Sydney Page 1 Exploratory data analysis The basics Presented by Professor Peter Reimann Centre for Research on Learning and Cognition
  • 2. The University of Sydney Page 2 EDA is a inquiry cycle Generate questions Search for answers in the data Refine questions Visualize, transform, model the data EDA is an important component of theory-driven, problem-driven, and curiosity-driven research.
  • 3. The University of Sydney Page 3 Where do questions come from? An important source of questions on data are hypotheses derived from theory: Data Hypotheses Theory Another source are problems: Data Questions Problem( s) Data Questions Data A third source are data themselves:
  • 4. The University of Sydney Page 4 Models of data EDA plays a role in all three scenarios. – Theories do not get compared with data as such, but with models of data: Data Hypotheses TheoryData model(s) ED A Data Questions Problem( s) Data model(s) ED A Questions Data model(s) And similarly for the other cases: Data Data model(s) ED A
  • 5. The University of Sydney Page 5 Data are not “objective” – Measurements and observations are not theory- or assumption-free; – There’s more than one way to build a (statistical) model of any data set; – While the data may support a theory, they likely support many other theories; – While a data set may support a theory, it could also contain relation that are contradicting the theory Hence, even if your data are carefully selected and measured, and you think you know them well, it is important to look for the unexpected!
  • 6. The University of Sydney Page 6 The exploratory perspective Key assumption: The more one knows about the data, the more effectively data can used to – develop, test and refine theory, – solve problems, and – ask interesting questions. To maximise what is learned from data, one needs to adhere to two principles: – scepticism, and – openness. One should be sceptical, for instance about the assumption that specific statistical parameters (i.e., summaries of data, such as the mean) reflect data faithfully, and open to different interpretations of what the data say.
  • 7. The University of Sydney Page 7 Be sceptical! Be open! One reason to be sceptical about statistics in particular is Anscombe’s Quartet: – Four datasets with (almost) identical statistics, but very different shapes. By Ascombe https://commons.wikimedia.org/w/index.php?curid=9838454
  • 8. The University of Sydney Page 8 (cont.) – Statistics (= summative accounts of data) can be misleading – Data analysis is not identical with statistics: – Visual analysis should precede statistical analysis Stay open to multiple interpretations! – The confirmatory, or hypothesis-testing mode, to data analysis can keep one from seeing what other patterns might exist in data. In addition to asking: – Do these data confirm or disconfirm my hypothesis about x? Ask: – What can these data tell me about x?
  • 9. The University of Sydney Page 9 Model and outliers The basic way of thinking about data: Data = pattern + deviations (model + outliers) (smooth + rough) Data analysis, including statistical analysis, means to partition data into patterns/models/smooths and deviations/outliers/roughs For any given data, there are in principle many ways to do this partitioning, and there is no logical reason to a priori prefer one over the other  the analysis process is incremental, not one hypothesis testing step.
  • 10. The University of Sydney Page 10 Our tools for EDA – dplyr: selecting, filtering, summarising data – ggplot2: visualising data, patterns, trends.
  • 11. The University of Sydney Page 11 Data selection with dplyr Variable A (…) Variable v Observation 1 Value 1A (…) Value 1v Observation 2 Value 2A (…) Value 2v (…) (…) (…) (…) Observation o Value oA (…) Value ov (2) filter on values (3) arrange by rows (1) select variables (4) mutate: create new variables (5) sum- marize over values dplyr is made up out of 5 verbs:
  • 12. The University of Sydney Page 12 “Sentences” in dplyr General format: verb(data frame, parameters) – The result is a new data frame: new_frame <- verb(data, parameter). Examples: – filter(flights, month == 1, day == 1) – arrange(flights, year, month, day) – select(flights, year, month, day) – mutate(flights, gain = arr_delay - dep_delay, speed = distance / air_time * 60) – summarize(flights, delay = mean(dep_delay))
  • 13. The University of Sydney Page 13 Boolean operations are supported for filtering and selecting ! Is “not”, | is ”or”, & is “and” filter(flights, !(arr_delay > 120 | dep_delay > 120)) filter(flights, arr_delay <= 120, dep_delay <= 120) These two return the same observations: For more on these commands, see for instance https://www.youtube.com/watch?v=aywFompr1F4
  • 14. The University of Sydney Page 14 Workbook – The rest of this module is mainly in the workbook.

Notas del editor

  1. https://en.wikipedia.org/wiki/Anscombe's_quartet. The reason for some of this is that many statistics are very sensitive towards outliers. See in particular 3 and 4.