1. Advanced Data Analytics:
Getting Started with R
Jeffrey Stanton
School of Information Studies
Syracuse University
2. Analytics: Key Steps
• Learn the application domain
• Locate or develop a data source or data set
• Clean and preprocess data: May take 60% of effort!
• Data reduction and transformation
– Find useful pieces, squeeze out redundancies
• Choose analytical approaches
– summarize, visualize, organize, describe, explore, find
patterns, predict, test, infer
• Communicate the results and implications to data users
• Deploy discovered knowledge in a system
• Monitor and evaluate the effectiveness of the system
2
3. First Example: Ice Cream Consumption
• We all know the domain, we have all eaten ice cream
• Public data set obtained from supplement to Verbeek’s text:
http://eu.wiley.com/legacy/wileychi/verbeek2ed/datasets.html
• Let’s read the data into R and summarize it:
ICECREAM=read.csv("[pathname]/icecream.csv",header=T)
summary(ICECREAM)
• What do these two R commands do? Did you get a mean of
84.6 for Income? What are “Min,” “1st Qu.” and all of those
other things?
3
4. Metadata
• There is a text file that goes with the CSV dataset:
“icecream.txt”
• This describes the meaning of the variables provided in the
dataset; essential if we are to make sense of these data:
Variable labels:
cons: consumption of ice cream per head (in pints);
income: average family income per week (in US Dollars);
price: price of ice cream (per pint);
temp: average temperature (in Fahrenheit);
Time: index from 1 to 30
• We also learn from the metadata that these are time series
data with monthly observations from 18 March 1951 to 11
July 1953
4
5. “Sanity Check” Using Histograms and Boxplots
• Cleaning, screening, and preprocessing is essential to ensure
that you understand what your data set contains and that it
does not contain garbage; it is impractical to look at every
data point so we use histograms and boxplots to overview
our data:
hist(ICECREAM$income)
boxplot(ICECREAM$income)
• What is the purpose of the “$” notation in the commands
above? Is there any other way of referring to these
variables?
5
7. Explore
• Perhaps a family with greater income can afford to purchase
more ice cream:
plot(ICECREAM$income,ICECREAM$cons)
• How do you interpret a
scatterplot?
• Is there a pattern here?
• Does our intuitive hypothesis
fit the scatterplot?
• What else could scatterplots
show?
7
8. More Tools to Support Exploration
results=lm(ICECREAM$cons~ICECREAM$temp)
# This is a comment line
# The previous command calculates a line
# that best fits the scatterplot with temp
# on the X axis and cons on the Y axis
plot(ICECREAM$temp,ICECREAM$cons)
abline(results) # Plots the best fit line
# The new data structure “results” has
# lots of information about the analysis.
# What does this list contain:
results$residuals
8
9. What is the effect of time on these data?
plot(ICECREAM$time,ICECREAM$temp)
plot(ICECREAM$time,ICECREAM$cons)
• What do these plots show? Can you explain why these are
shaped the way they are?
• Based on your answer to the previous question, how does
the situation affect your strategies for understanding ice
cream consumption?
9
10. Demonstrating Mastery
• Find a small numeric dataset; try starting at the Journal of
Statistical Education data website:
http://www.amstat.org/publications/jse/jse_data_archive.htm
• Read the dataset into R
• Summarize the variables in that dataset
• Use histograms and boxplots to check and understand your
data; use the metadata description that came with the dataset
to make sure that you know the variables
• Explore the data using plot; look for something interesting
• Put your findings in a slide and communicate them to me or
someone else
10
Notas del editor
The other way is to ATTACH() the ICECREAM data structure. Then you can refer to the variable names directly.