SlideShare una empresa de Scribd logo
1 de 10
Advanced Data Analytics:
 Getting Started with R

         Jeffrey Stanton
  School of Information Studies
      Syracuse University
Analytics: Key Steps
• Learn the application domain
• Locate or develop a data source or data set
• Clean and preprocess data: May take 60% of effort!
• Data reduction and transformation
   – Find useful pieces, squeeze out redundancies
• Choose analytical approaches
   – summarize, visualize, organize, describe, explore, find
     patterns, predict, test, infer
• Communicate the results and implications to data users
• Deploy discovered knowledge in a system
• Monitor and evaluate the effectiveness of the system
                                                               2
First Example: Ice Cream Consumption
• We all know the domain, we have all eaten ice cream
• Public data set obtained from supplement to Verbeek’s text:
  http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html
• Let’s read the data into R and summarize it:

ICECREAM=read.csv("[pathname]/icecream.csv",header=T)
summary(ICECREAM)


• What do these two R commands do? Did you get a mean of
  84.6 for Income? What are “Min,” “1st Qu.” and all of those
  other things?

                                                                 3
Metadata
• There is a text file that goes with the CSV dataset:
  “icecream.txt”
• This describes the meaning of the variables provided in the
  dataset; essential if we are to make sense of these data:
Variable labels:

cons:         consumption of ice cream per head (in pints);
income:        average family income per week (in US Dollars);
price:        price of ice cream (per pint);
temp:         average temperature (in Fahrenheit);
Time:           index from 1 to 30

• We also learn from the metadata that these are time series
  data with monthly observations from 18 March 1951 to 11
  July 1953
                                                               4
“Sanity Check” Using Histograms and Boxplots

• Cleaning, screening, and preprocessing is essential to ensure
  that you understand what your data set contains and that it
  does not contain garbage; it is impractical to look at every
  data point so we use histograms and boxplots to overview
  our data:

hist(ICECREAM$income)
boxplot(ICECREAM$income)

• What is the purpose of the “$” notation in the commands
  above? Is there any other way of referring to these
  variables?
                                                             5
Interpret These Graphics




                           6
Explore
• Perhaps a family with greater income can afford to purchase
  more ice cream:

plot(ICECREAM$income,ICECREAM$cons)


• How do you interpret a
  scatterplot?
• Is there a pattern here?
• Does our intuitive hypothesis
  fit the scatterplot?
• What else could scatterplots
  show?
                                                           7
More Tools to Support Exploration
results=lm(ICECREAM$cons~ICECREAM$temp)
# This is a comment line
# The previous command calculates a line
# that best fits the scatterplot with temp
# on the X axis and cons on the Y axis

plot(ICECREAM$temp,ICECREAM$cons)
abline(results) # Plots the best fit line

# The new data structure “results” has
# lots of information about the analysis.
# What does this list contain:
results$residuals

                                             8
What is the effect of time on these data?
plot(ICECREAM$time,ICECREAM$temp)
plot(ICECREAM$time,ICECREAM$cons)

• What do these plots show? Can you explain why these are
  shaped the way they are?
• Based on your answer to the previous question, how does
  the situation affect your strategies for understanding ice
  cream consumption?




                                                               9
Demonstrating Mastery
• Find a small numeric dataset; try starting at the Journal of
  Statistical Education data website:
  http://www.amstat.org/publications/jse/jse_data_archive.htm
• Read the dataset into R
• Summarize the variables in that dataset
• Use histograms and boxplots to check and understand your
  data; use the metadata description that came with the dataset
  to make sure that you know the variables
• Explore the data using plot; look for something interesting
• Put your findings in a slide and communicate them to me or
  someone else

                                                                 10

Más contenido relacionado

Destacado

Destacado (9)

Introduction to R
Introduction to RIntroduction to R
Introduction to R
 
R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)R at Microsoft (useR! 2016)
R at Microsoft (useR! 2016)
 
Presentation R basic teaching module
Presentation R basic teaching modulePresentation R basic teaching module
Presentation R basic teaching module
 
R at Microsoft
R at MicrosoftR at Microsoft
R at Microsoft
 
Intro to RStudio
Intro to RStudioIntro to RStudio
Intro to RStudio
 
Language R
Language RLanguage R
Language R
 
Grouping & Summarizing Data in R
Grouping & Summarizing Data in RGrouping & Summarizing Data in R
Grouping & Summarizing Data in R
 
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)Data mining tools (R , WEKA, RAPID MINER, ORANGE)
Data mining tools (R , WEKA, RAPID MINER, ORANGE)
 
Rsplit apply combine
Rsplit apply combineRsplit apply combine
Rsplit apply combine
 

Similar a Getting Started with R

20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vsIan Feller
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data scienceSubrata Saharia
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk KnowledgeKrishna Sankar
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxAnusuya123
 
Data and Information Details and Differences
Data and Information Details and DifferencesData and Information Details and Differences
Data and Information Details and DifferencesSaurabh846965
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraPooja Ajmera
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huwekineheshete
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data AnalyticsUtkarsh Sharma
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptxBillyMoses1
 
BAEB601 Chapter 4: Findings, Analysis, and SPSS
BAEB601 Chapter 4: Findings, Analysis, and SPSSBAEB601 Chapter 4: Findings, Analysis, and SPSS
BAEB601 Chapter 4: Findings, Analysis, and SPSSDr Nur Suhaili Ramli
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousingEr. Nawaraj Bhandari
 

Similar a Getting Started with R (20)

Metopen 6
Metopen 6Metopen 6
Metopen 6
 
Daming
DamingDaming
Daming
 
EDA
EDAEDA
EDA
 
data mining
data miningdata mining
data mining
 
Part1
Part1Part1
Part1
 
20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs20150814 Wrangling Data From Raw to Tidy vs
20150814 Wrangling Data From Raw to Tidy vs
 
BAS 250 Lecture 2
BAS 250 Lecture 2BAS 250 Lecture 2
BAS 250 Lecture 2
 
Business analyst
Business analystBusiness analyst
Business analyst
 
Introduction to data science
Introduction to data scienceIntroduction to data science
Introduction to data science
 
Lec 3.pptx
Lec 3.pptxLec 3.pptx
Lec 3.pptx
 
Data Science Folk Knowledge
Data Science Folk KnowledgeData Science Folk Knowledge
Data Science Folk Knowledge
 
Unit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptxUnit 1-Data Science Process Overview.pptx
Unit 1-Data Science Process Overview.pptx
 
Data and Information Details and Differences
Data and Information Details and DifferencesData and Information Details and Differences
Data and Information Details and Differences
 
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja AjmeraData Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
Data Science and Data Visualization (All about Data Analysis) by Pooja Ajmera
 
Introducition to Data scinece compiled by hu
Introducition to Data scinece compiled by huIntroducition to Data scinece compiled by hu
Introducition to Data scinece compiled by hu
 
Week_2_Lecture.pdf
Week_2_Lecture.pdfWeek_2_Lecture.pdf
Week_2_Lecture.pdf
 
Introduction to Big Data Analytics
Introduction to Big Data AnalyticsIntroduction to Big Data Analytics
Introduction to Big Data Analytics
 
Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
BAEB601 Chapter 4: Findings, Analysis, and SPSS
BAEB601 Chapter 4: Findings, Analysis, and SPSSBAEB601 Chapter 4: Findings, Analysis, and SPSS
BAEB601 Chapter 4: Findings, Analysis, and SPSS
 
Introduction to data mining and data warehousing
Introduction to data mining and data warehousingIntroduction to data mining and data warehousing
Introduction to data mining and data warehousing
 

Más de Syracuse University

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultySyracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question proSyracuse University
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issuesSyracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics CourseSyracuse University
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Syracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collectionSyracuse University
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internetSyracuse University
 

Más de Syracuse University (20)

Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)Mining tweets for security information (rev 2)
Mining tweets for security information (rev 2)
 
What is Data Science
What is Data ScienceWhat is Data Science
What is Data Science
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Último

IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxiammrhaywood
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 

Último (20)

IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 

Getting Started with R

  • 1. Advanced Data Analytics: Getting Started with R Jeffrey Stanton School of Information Studies Syracuse University
  • 2. Analytics: Key Steps • Learn the application domain • Locate or develop a data source or data set • Clean and preprocess data: May take 60% of effort! • Data reduction and transformation – Find useful pieces, squeeze out redundancies • Choose analytical approaches – summarize, visualize, organize, describe, explore, find patterns, predict, test, infer • Communicate the results and implications to data users • Deploy discovered knowledge in a system • Monitor and evaluate the effectiveness of the system 2
  • 3. First Example: Ice Cream Consumption • We all know the domain, we have all eaten ice cream • Public data set obtained from supplement to Verbeek’s text: http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html • Let’s read the data into R and summarize it: ICECREAM=read.csv("[pathname]/icecream.csv",header=T) summary(ICECREAM) • What do these two R commands do? Did you get a mean of 84.6 for Income? What are “Min,” “1st Qu.” and all of those other things? 3
  • 4. Metadata • There is a text file that goes with the CSV dataset: “icecream.txt” • This describes the meaning of the variables provided in the dataset; essential if we are to make sense of these data: Variable labels: cons: consumption of ice cream per head (in pints); income: average family income per week (in US Dollars); price: price of ice cream (per pint); temp: average temperature (in Fahrenheit); Time: index from 1 to 30 • We also learn from the metadata that these are time series data with monthly observations from 18 March 1951 to 11 July 1953 4
  • 5. “Sanity Check” Using Histograms and Boxplots • Cleaning, screening, and preprocessing is essential to ensure that you understand what your data set contains and that it does not contain garbage; it is impractical to look at every data point so we use histograms and boxplots to overview our data: hist(ICECREAM$income) boxplot(ICECREAM$income) • What is the purpose of the “$” notation in the commands above? Is there any other way of referring to these variables? 5
  • 7. Explore • Perhaps a family with greater income can afford to purchase more ice cream: plot(ICECREAM$income,ICECREAM$cons) • How do you interpret a scatterplot? • Is there a pattern here? • Does our intuitive hypothesis fit the scatterplot? • What else could scatterplots show? 7
  • 8. More Tools to Support Exploration results=lm(ICECREAM$cons~ICECREAM$temp) # This is a comment line # The previous command calculates a line # that best fits the scatterplot with temp # on the X axis and cons on the Y axis plot(ICECREAM$temp,ICECREAM$cons) abline(results) # Plots the best fit line # The new data structure “results” has # lots of information about the analysis. # What does this list contain: results$residuals 8
  • 9. What is the effect of time on these data? plot(ICECREAM$time,ICECREAM$temp) plot(ICECREAM$time,ICECREAM$cons) • What do these plots show? Can you explain why these are shaped the way they are? • Based on your answer to the previous question, how does the situation affect your strategies for understanding ice cream consumption? 9
  • 10. Demonstrating Mastery • Find a small numeric dataset; try starting at the Journal of Statistical Education data website: http://www.amstat.org/publications/jse/jse_data_archive.htm • Read the dataset into R • Summarize the variables in that dataset • Use histograms and boxplots to check and understand your data; use the metadata description that came with the dataset to make sure that you know the variables • Explore the data using plot; look for something interesting • Put your findings in a slide and communicate them to me or someone else 10

Notas del editor

  1. The other way is to ATTACH() the ICECREAM data structure. Then you can refer to the variable names directly.