R - what do the numbers mean? #RStats This is the presentation for my Demo at Orlando Live60 AILIve. We go through statistics interpretation with examples
1. R and AI: what do the
numbers mean?
Speaker Name
Job Title,
Organization
Level: Intermediate
2. JenStirrup
• Boutique
Consultancy
Owner of Data
Relish
• Postgraduate
degrees in
Artificial
Intelligence and
Cognitive Science
• Twenty year
career in industry
• Author
JenStirrup.com
DataRelish.com
3. Get in touch!
• http://bit.ly/JenStirrupRD
• http://bit.ly/JenStirrupLinkedIn
• http://bit.ly/JenStirrupMVP
• http://bit.ly/JenStirrupTwitter
11. Correlation r = 0.96
1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Number of
people
who died
by
becoming
tangled in
their
bedsheets
Deaths
(US) (CDC)
327 456 509 497 596 573 661 741 809 717
Total
revenue
generated
by skiing
facilities
(US)
Dollars in
millions
(US
Census)
1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
12. Why R?
• most widely used data analysis software - used by 2M + data scientist,
statisticians and analysts
• Most powerful statistical programming language
• flexible, extensible and comprehensive for productivity
• Create beautiful and unique data visualisations - as seen in New York Times,
Twitter and Flowing Data
• Thriving open-source community - leading edge of analytics research
• Fills the talent gap - new graduates prefer R.
1
2
13. What are we testing?
• We have one or two samples and a
hypothesis, which may be true or false.
• The NULL hypothesis – nothing happened.
• The Alternative hypothesis – something
did happen.
1
3
14. Strategy
• We set out to prove that
something did happen.
• We look at the distribution of the
data.
• We choose a test statistic
• We look at the p value 1
4
15. What do I need to install?
• Install R – www.r-project.org
• Install Rstudio – www.rstudio.com
• AzureML
• AutoML
15
16. “Every American should
have above average
income, and my
Administration is going
to see they get it.” (Bill
Clinton on campaign
trail)
“It’s clearly a budget.
It’s got lots of
numbers in it.”
(George W. Bush)
20. What does the t-test give us?
• The t-test helps us to work out whether
two sets of data are actually different.
• It takes two sets of data, and calculates
the mean, the variance and standard
deviation
21. What does the t-test give us?
• Then it does a more sophisticated test to
tell us if those two means of those two
populations are different.
22. Enter the t-test
• The t-test: simple way of
establishing whether there are
significant differences between
two groups of data.
• The lower the p value, the
more likely that there is a
difference in the two groups
• We want the probability to be
less than 5% to show a
difference between two groups.0
10
20
30
40
50
60
70
80
Ireland Elsewhere
Sample Size Mean StdDev
23. The Results!
• Using the averages, researchers
concluded that Guinness served in Ireland
is significantly better than pints served
elsewhere.
24. Summary
• The t-test is a valuable tool for showing
differences or similarities between groups.
• It has been used here to identify whether
Guinness is better in Ireland or outside of
Ireland.
25. Business and Statistics?
Why?
• Statistical analysis is used widely in
businesses
• Marketing – customer classification,
spending patterns
• Management consulting – efficient use of
resources25
26. Statistically Significant
• If you have significant result, it means that
your results likely did not happen by
chance.
• If you don’t have statistically
significant results, you throw your test data
out (as it doesn’t show anything!); in other
words, you can’t reject the null hypothesis.
27. Numerical Measures – what is
interesting?
• Centre of the data
• Spread of the data
2
7
28.
29. Measures of Central
Tendency
• Mean – this is the average
• Median – splits the data in two halves
• Mode – the most popular value
2
9
30. Measures of Dispersion
• Variance – average squared difference
between the data points and the mean
• Standard Deviation – square root of the
variance, more intuitive
3
0
31. Measures of Dispersion
• Percentiles – dataset is divided into 100
equal parts
• Quartiles – dataset is divided into four
equal parts
• Interquartile range – middle 50% of data
points
3
1
32. Measures of Association
• Covariance – how variables vary together,
rise together, fall together
• Correlation – very similar, shown between
-1 and 1
3
2
33. Measuring Uncertainty
• Probability is based on SETS, which we
use in SQL
• We determine the probability of outcomes:
– Addition Rule
– Multiplication Rule
– Complement Rule
3
3
34. Probability Distributions
• Binomial distribution – one of two outcomes
• Geometric Distribution – probability before success
results
• Poisson Distribution – probability that a number of
events will occur within a time frame
• Uniform Distribution – evenly distributed variables
• Normal Distribution – bell shaped curve
3
4
36. Linear Regression
• We use sample data to work out the
strength and direction of a relationship
between two variables.
37. Linear Regression
• The formula works out the
• X: predictor variable, also known as the
independent variable
• Y: response variable, also known as the
dependent variable
• Lm( y ~ x, data= dataframe)
39. What tools do we have in R?
• In data wrangling, what are the main
tasks?
• – Filtering rows
– Selecting columns of data
– Adding new variables
– Sorting
– Aggregating
39
40. What tools do we have in R?
• 80% of your time will be spent preparing
and wrangling data
• The remainder of your time will be spent
complaining about it.
40
43. Multiple Regression
In simple linear regression, a criterion variable is
predicted from one predictor variable.
In multiple regression, the criterion is predicted by two
or more variables.
49. P value
• Compare the p-value for the F-test to
your significance level.
– If the p-value is less than the significance
level, your sample data provide sufficient
evidence to conclude that your regression
model fits the data better than the model with
no independent variables.
50. F-Test
• An F statistic is a
value you get when
you run an ANOVA
test or a regression
analysis to find out if
the means between
two populations are
significantly different.
51. F-Test
• A-T test will tell you if
a single variable is
statistically significant
and an F test will tell
you if a group of
variables are jointly
significant.
52. The F-Test
• If none of the variables are significant, then the
overall F-test is not significant.
– It’s an early test so you can throw the model out.
• The F-Test can show if the variables are jointly
significant
• F-test sums the predictive power of all variables
53. RMSE
• RMSE measures how accurately the
model predicts the response.
• It is the most important criterion for model
fit if the main purpose of the model is
prediction.
54. Model validation - probability
• Most of the model
validation centers around
the residuals (essentially
the distance of the data
points from the fitted
regression line)
54
55. Model validation – Q-Q
• Quantile-Quantile plots
help evaluate the fit of
sample data to the
normal distribution. Is the
data close to being
normally distributed, or
are there a lot of outliers,
for example?
55
56. How do you interpret the results?
Scale-Location Plot
• The scale-location plot in the
upper right shows the square
root of the standardized
residuals (sort of a square root
of relative error) as a function
of the fitted values.
• We are not hoping to see an
obvious trend in this plot.
57. How do you interpret the
results?
Importance of each Point
• Cook’s Distance
– Measure of the importance
of each observation to the
regression
– Distances larger than 1 are
suspicious
– Outlier
57
59. JenStirrup
• Boutique
Consultancy
Owner of Data
Relish
• Postgraduate
degrees in
Artificial
Intelligence and
Cognitive Science
• Twenty year
career in industry
• Author
JenStirrup.com
DataRelish.com
60. Get in touch!
• http://bit.ly/JenStirrupRD
• http://bit.ly/JenStirrupLinkedIn
• http://bit.ly/JenStirrupMVP
• http://bit.ly/JenStirrupTwitter