SlideShare una empresa de Scribd logo
1 de 62
R and AI: what do the
numbers mean?
Speaker Name
Job Title,
Organization
Level: Intermediate
JenStirrup
• Boutique
Consultancy
Owner of Data
Relish
• Postgraduate
degrees in
Artificial
Intelligence and
Cognitive Science
• Twenty year
career in industry
• Author
JenStirrup.com
DataRelish.com
Get in touch!
• http://bit.ly/JenStirrupRD
• http://bit.ly/JenStirrupLinkedIn
• http://bit.ly/JenStirrupMVP
• http://bit.ly/JenStirrupTwitter
Let your data surprise you!
AutoML
How do you know if your results are correct?
AutoML Demo
What does Anscombe’s
Quartet look like?
8
Looks good, doesn’t it?
9
So, it is correct?
1
0
Correlation r = 0.96
1
1
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
Number of
people
who died
by
becoming
tangled in
their
bedsheets
Deaths
(US) (CDC)
327 456 509 497 596 573 661 741 809 717
Total
revenue
generated
by skiing
facilities
(US)
Dollars in
millions
(US
Census)
1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
Why R?
• most widely used data analysis software - used by 2M + data scientist,
statisticians and analysts
• Most powerful statistical programming language
• flexible, extensible and comprehensive for productivity
• Create beautiful and unique data visualisations - as seen in New York Times,
Twitter and Flowing Data
• Thriving open-source community - leading edge of analytics research
• Fills the talent gap - new graduates prefer R.
1
2
What are we testing?
• We have one or two samples and a
hypothesis, which may be true or false.
• The NULL hypothesis – nothing happened.
• The Alternative hypothesis – something
did happen.
1
3
Strategy
• We set out to prove that
something did happen.
• We look at the distribution of the
data.
• We choose a test statistic
• We look at the p value 1
4
What do I need to install?
• Install R – www.r-project.org
• Install Rstudio – www.rstudio.com
• AzureML
• AutoML
15
“Every American should
have above average
income, and my
Administration is going
to see they get it.” (Bill
Clinton on campaign
trail)
“It’s clearly a budget.
It’s got lots of
numbers in it.”
(George W. Bush)
The Guinness Overall
Enjoyment Score
William Sealy Gossett
What does the t-test give us?
• The t-test helps us to work out whether
two sets of data are actually different.
• It takes two sets of data, and calculates
the mean, the variance and standard
deviation
What does the t-test give us?
• Then it does a more sophisticated test to
tell us if those two means of those two
populations are different.
Enter the t-test
• The t-test: simple way of
establishing whether there are
significant differences between
two groups of data.
• The lower the p value, the
more likely that there is a
difference in the two groups
• We want the probability to be
less than 5% to show a
difference between two groups.0
10
20
30
40
50
60
70
80
Ireland Elsewhere
Sample Size Mean StdDev
The Results!
• Using the averages, researchers
concluded that Guinness served in Ireland
is significantly better than pints served
elsewhere.
Summary
• The t-test is a valuable tool for showing
differences or similarities between groups.
• It has been used here to identify whether
Guinness is better in Ireland or outside of
Ireland.
Business and Statistics?
Why?
• Statistical analysis is used widely in
businesses
• Marketing – customer classification,
spending patterns
• Management consulting – efficient use of
resources25
Statistically Significant
• If you have significant result, it means that
your results likely did not happen by
chance.
• If you don’t have statistically
significant results, you throw your test data
out (as it doesn’t show anything!); in other
words, you can’t reject the null hypothesis.
Numerical Measures – what is
interesting?
• Centre of the data
• Spread of the data
2
7
Measures of Central
Tendency
• Mean – this is the average
• Median – splits the data in two halves
• Mode – the most popular value
2
9
Measures of Dispersion
• Variance – average squared difference
between the data points and the mean
• Standard Deviation – square root of the
variance, more intuitive
3
0
Measures of Dispersion
• Percentiles – dataset is divided into 100
equal parts
• Quartiles – dataset is divided into four
equal parts
• Interquartile range – middle 50% of data
points
3
1
Measures of Association
• Covariance – how variables vary together,
rise together, fall together
• Correlation – very similar, shown between
-1 and 1
3
2
Measuring Uncertainty
• Probability is based on SETS, which we
use in SQL
• We determine the probability of outcomes:
– Addition Rule
– Multiplication Rule
– Complement Rule
3
3
Probability Distributions
• Binomial distribution – one of two outcomes
• Geometric Distribution – probability before success
results
• Poisson Distribution – probability that a number of
events will occur within a time frame
• Uniform Distribution – evenly distributed variables
• Normal Distribution – bell shaped curve
3
4
Statistical Inference
• Process of drawing conclusions about a
population of randomly drawn samples
35
Linear Regression
• We use sample data to work out the
strength and direction of a relationship
between two variables.
Linear Regression
• The formula works out the
• X: predictor variable, also known as the
independent variable
• Y: response variable, also known as the
dependent variable
• Lm( y ~ x, data= dataframe)
First Impressions?
• How do you go about it?
• Check the plot first; how does it look?
What tools do we have in R?
• In data wrangling, what are the main
tasks?
• – Filtering rows
– Selecting columns of data
– Adding new variables
– Sorting
– Aggregating
39
What tools do we have in R?
• 80% of your time will be spent preparing
and wrangling data
• The remainder of your time will be spent
complaining about it.
40
Plotted Example Data
Plotted Example Data
Multiple Regression
In simple linear regression, a criterion variable is
predicted from one predictor variable.
In multiple regression, the criterion is predicted by two
or more variables.
Residuals
Interpreting our Results
Evaluate Model
• Receiver Operator
Characteristic (ROC)
curves
• Precision/Recall
curves
• Lift curves
P value
• Compare the p-value for the F-test to
your significance level.
– If the p-value is less than the significance
level, your sample data provide sufficient
evidence to conclude that your regression
model fits the data better than the model with
no independent variables.
F-Test
• An F statistic is a
value you get when
you run an ANOVA
test or a regression
analysis to find out if
the means between
two populations are
significantly different.
F-Test
• A-T test will tell you if
a single variable is
statistically significant
and an F test will tell
you if a group of
variables are jointly
significant.
The F-Test
• If none of the variables are significant, then the
overall F-test is not significant.
– It’s an early test so you can throw the model out.
• The F-Test can show if the variables are jointly
significant
• F-test sums the predictive power of all variables
RMSE
• RMSE measures how accurately the
model predicts the response.
• It is the most important criterion for model
fit if the main purpose of the model is
prediction.
Model validation - probability
• Most of the model
validation centers around
the residuals (essentially
the distance of the data
points from the fitted
regression line)
54
Model validation – Q-Q
• Quantile-Quantile plots
help evaluate the fit of
sample data to the
normal distribution. Is the
data close to being
normally distributed, or
are there a lot of outliers,
for example?
55
How do you interpret the results?
Scale-Location Plot
• The scale-location plot in the
upper right shows the square
root of the standardized
residuals (sort of a square root
of relative error) as a function
of the fitted values.
• We are not hoping to see an
obvious trend in this plot.
How do you interpret the
results?
Importance of each Point
• Cook’s Distance
– Measure of the importance
of each observation to the
regression
– Distances larger than 1 are
suspicious
– Outlier
57
Thank you!
@jenstirrup
JenStirrup
• Boutique
Consultancy
Owner of Data
Relish
• Postgraduate
degrees in
Artificial
Intelligence and
Cognitive Science
• Twenty year
career in industry
• Author
JenStirrup.com
DataRelish.com
Get in touch!
• http://bit.ly/JenStirrupRD
• http://bit.ly/JenStirrupLinkedIn
• http://bit.ly/JenStirrupMVP
• http://bit.ly/JenStirrupTwitter
Let your data surprise you!
References and Thanks
• R and Data Mining: Examples and Case
Studies by Yanchang Zhao
62

Más contenido relacionado

La actualidad más candente

Lecture 7
Lecture 7Lecture 7
Lecture 7
butest
 
Aron chpt 4 sample and probability f2011
Aron chpt 4 sample and probability f2011Aron chpt 4 sample and probability f2011
Aron chpt 4 sample and probability f2011
Sandra Nicks
 
Lesson 10 rm psych stats & graphs 2013
Lesson 10   rm psych stats & graphs 2013Lesson 10   rm psych stats & graphs 2013
Lesson 10 rm psych stats & graphs 2013
coburgpsych
 

La actualidad más candente (19)

Structural equation modeling in amos
Structural equation modeling in amosStructural equation modeling in amos
Structural equation modeling in amos
 
Module 4: Model Selection and Evaluation
Module 4: Model Selection and EvaluationModule 4: Model Selection and Evaluation
Module 4: Model Selection and Evaluation
 
Lecture 7
Lecture 7Lecture 7
Lecture 7
 
DIY market segmentation 20170125
DIY market segmentation 20170125DIY market segmentation 20170125
DIY market segmentation 20170125
 
DIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slidesDIY Driver Analysis Webinar slides
DIY Driver Analysis Webinar slides
 
Aron chpt 4 sample and probability f2011
Aron chpt 4 sample and probability f2011Aron chpt 4 sample and probability f2011
Aron chpt 4 sample and probability f2011
 
Module 1.2 data preparation
Module 1.2  data preparationModule 1.2  data preparation
Module 1.2 data preparation
 
MLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in MLMLSEV Virtual. State of the Art in ML
MLSEV Virtual. State of the Art in ML
 
The zen of predictive modelling
The zen of predictive modellingThe zen of predictive modelling
The zen of predictive modelling
 
Data cleaning and screening
Data cleaning and screeningData cleaning and screening
Data cleaning and screening
 
Lesson 10 rm psych stats & graphs 2013
Lesson 10   rm psych stats & graphs 2013Lesson 10   rm psych stats & graphs 2013
Lesson 10 rm psych stats & graphs 2013
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values
 
Module 3: Linear Regression
Module 3:  Linear RegressionModule 3:  Linear Regression
Module 3: Linear Regression
 
Slides for automate or die (presentation)
Slides for automate or die (presentation)Slides for automate or die (presentation)
Slides for automate or die (presentation)
 
MLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for AnomaliesMLSEV Virtual. Searching for Anomalies
MLSEV Virtual. Searching for Anomalies
 
Module 6: Ensemble Algorithms
Module 6:  Ensemble AlgorithmsModule 6:  Ensemble Algorithms
Module 6: Ensemble Algorithms
 
DIY Max-Diff webinar slides
DIY Max-Diff webinar slidesDIY Max-Diff webinar slides
DIY Max-Diff webinar slides
 
Machine Learning and Causal Inference
Machine Learning and Causal InferenceMachine Learning and Causal Inference
Machine Learning and Causal Inference
 
Ash bus 308 week 2 problem set new
Ash bus 308 week 2 problem set newAsh bus 308 week 2 problem set new
Ash bus 308 week 2 problem set new
 

Similar a R - what do the numbers mean? #RStats

Qt business statistics-lesson1-2013
Qt business statistics-lesson1-2013Qt business statistics-lesson1-2013
Qt business statistics-lesson1-2013
sonu kumar
 

Similar a R - what do the numbers mean? #RStats (20)

Presentation of Project and Critique.pptx
Presentation of Project and Critique.pptxPresentation of Project and Critique.pptx
Presentation of Project and Critique.pptx
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Intro to data science
Intro to data scienceIntro to data science
Intro to data science
 
Introduction To Data Science Using R
Introduction To Data Science Using RIntroduction To Data Science Using R
Introduction To Data Science Using R
 
Statistics for analytics
Statistics for analyticsStatistics for analytics
Statistics for analytics
 
Kevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data MiningKevin Swingler: Introduction to Data Mining
Kevin Swingler: Introduction to Data Mining
 
Outlier analysis and anomaly detection
Outlier analysis and anomaly detectionOutlier analysis and anomaly detection
Outlier analysis and anomaly detection
 
Basic Statistical Concepts.pdf
Basic Statistical Concepts.pdfBasic Statistical Concepts.pdf
Basic Statistical Concepts.pdf
 
Analysing & interpreting data.ppt
Analysing & interpreting data.pptAnalysing & interpreting data.ppt
Analysing & interpreting data.ppt
 
SQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMiningSQLDay2013_MarcinSzeliga_DataInDataMining
SQLDay2013_MarcinSzeliga_DataInDataMining
 
data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023data analysis in Statistics-2023 guide 2023
data analysis in Statistics-2023 guide 2023
 
1. complete stats notes
1. complete stats notes1. complete stats notes
1. complete stats notes
 
Presentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlalPresentation research- chapter 10-11 istiqlal
Presentation research- chapter 10-11 istiqlal
 
Statistics for UX Professionals
Statistics for UX ProfessionalsStatistics for UX Professionals
Statistics for UX Professionals
 
Pelatihan Data Analitik
Pelatihan Data AnalitikPelatihan Data Analitik
Pelatihan Data Analitik
 
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher TrainingONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
ONS Guide to Social and Economic Research – Welsh Baccalaureate Teacher Training
 
Standard deviation
Standard deviationStandard deviation
Standard deviation
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Qt business statistics-lesson1-2013
Qt business statistics-lesson1-2013Qt business statistics-lesson1-2013
Qt business statistics-lesson1-2013
 
Inferential Statistics
Inferential StatisticsInferential Statistics
Inferential Statistics
 

Más de Jen Stirrup

Más de Jen Stirrup (20)

AI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdfAI Applications in Healthcare and Medicine.pdf
AI Applications in Healthcare and Medicine.pdf
 
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATIONBUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
BUILDING A STRONG FOUNDATION FOR SUCCESS WITH BI AND DIGITAL TRANSFORMATION
 
CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners CuRious about R in Power BI? End to end R in Power BI for beginners
CuRious about R in Power BI? End to end R in Power BI for beginners
 
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
Artificial Intelligence Ethics keynote: With Great Power, comes Great Respons...
 
1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release1 Introduction to Microsoft data platform analytics for release
1 Introduction to Microsoft data platform analytics for release
 
5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics5 Comparing Microsoft Big Data Technologies for Analytics
5 Comparing Microsoft Big Data Technologies for Analytics
 
Comparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform TechnologiesComparing Microsoft Big Data Platform Technologies
Comparing Microsoft Big Data Platform Technologies
 
Introduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and PythonIntroduction to Analytics with Azure Notebooks and Python
Introduction to Analytics with Azure Notebooks and Python
 
Sales Analytics in Power BI
Sales Analytics in Power BISales Analytics in Power BI
Sales Analytics in Power BI
 
Analytics for Marketing
Analytics for MarketingAnalytics for Marketing
Analytics for Marketing
 
Diversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doersDiversity and inclusion for the newbies and doers
Diversity and inclusion for the newbies and doers
 
Artificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspectiveArtificial Intelligence from the Business perspective
Artificial Intelligence from the Business perspective
 
How to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to successHow to be successful with Artificial Intelligence - from small to success
How to be successful with Artificial Intelligence - from small to success
 
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
Artificial Intelligence: Winning the Red Queen’s Race Keynote at ESPC with Je...
 
Data Visualization dataviz superpower
Data Visualization dataviz superpowerData Visualization dataviz superpower
Data Visualization dataviz superpower
 
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and TensorflowArtificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
Artificial Intelligence and Deep Learning in Azure, CNTK and Tensorflow
 
Blockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence ProfessionalsBlockchain Demystified for Business Intelligence Professionals
Blockchain Demystified for Business Intelligence Professionals
 
Examples of the worst data visualization ever
Examples of the worst data visualization everExamples of the worst data visualization ever
Examples of the worst data visualization ever
 
Lighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in AzureLighting up Big Data Analytics with Apache Spark in Azure
Lighting up Big Data Analytics with Apache Spark in Azure
 
Digital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources LeaderDigital Transformation for the Human Resources Leader
Digital Transformation for the Human Resources Leader
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

R - what do the numbers mean? #RStats

  • 1. R and AI: what do the numbers mean? Speaker Name Job Title, Organization Level: Intermediate
  • 2. JenStirrup • Boutique Consultancy Owner of Data Relish • Postgraduate degrees in Artificial Intelligence and Cognitive Science • Twenty year career in industry • Author JenStirrup.com DataRelish.com
  • 3. Get in touch! • http://bit.ly/JenStirrupRD • http://bit.ly/JenStirrupLinkedIn • http://bit.ly/JenStirrupMVP • http://bit.ly/JenStirrupTwitter
  • 4. Let your data surprise you!
  • 5.
  • 6. AutoML How do you know if your results are correct?
  • 10. So, it is correct? 1 0
  • 11. Correlation r = 0.96 1 1 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Number of people who died by becoming tangled in their bedsheets Deaths (US) (CDC) 327 456 509 497 596 573 661 741 809 717 Total revenue generated by skiing facilities (US) Dollars in millions (US Census) 1,551 1,635 1,801 1,827 1,956 1,989 2,178 2,257 2,476 2,438
  • 12. Why R? • most widely used data analysis software - used by 2M + data scientist, statisticians and analysts • Most powerful statistical programming language • flexible, extensible and comprehensive for productivity • Create beautiful and unique data visualisations - as seen in New York Times, Twitter and Flowing Data • Thriving open-source community - leading edge of analytics research • Fills the talent gap - new graduates prefer R. 1 2
  • 13. What are we testing? • We have one or two samples and a hypothesis, which may be true or false. • The NULL hypothesis – nothing happened. • The Alternative hypothesis – something did happen. 1 3
  • 14. Strategy • We set out to prove that something did happen. • We look at the distribution of the data. • We choose a test statistic • We look at the p value 1 4
  • 15. What do I need to install? • Install R – www.r-project.org • Install Rstudio – www.rstudio.com • AzureML • AutoML 15
  • 16. “Every American should have above average income, and my Administration is going to see they get it.” (Bill Clinton on campaign trail) “It’s clearly a budget. It’s got lots of numbers in it.” (George W. Bush)
  • 17.
  • 20. What does the t-test give us? • The t-test helps us to work out whether two sets of data are actually different. • It takes two sets of data, and calculates the mean, the variance and standard deviation
  • 21. What does the t-test give us? • Then it does a more sophisticated test to tell us if those two means of those two populations are different.
  • 22. Enter the t-test • The t-test: simple way of establishing whether there are significant differences between two groups of data. • The lower the p value, the more likely that there is a difference in the two groups • We want the probability to be less than 5% to show a difference between two groups.0 10 20 30 40 50 60 70 80 Ireland Elsewhere Sample Size Mean StdDev
  • 23. The Results! • Using the averages, researchers concluded that Guinness served in Ireland is significantly better than pints served elsewhere.
  • 24. Summary • The t-test is a valuable tool for showing differences or similarities between groups. • It has been used here to identify whether Guinness is better in Ireland or outside of Ireland.
  • 25. Business and Statistics? Why? • Statistical analysis is used widely in businesses • Marketing – customer classification, spending patterns • Management consulting – efficient use of resources25
  • 26. Statistically Significant • If you have significant result, it means that your results likely did not happen by chance. • If you don’t have statistically significant results, you throw your test data out (as it doesn’t show anything!); in other words, you can’t reject the null hypothesis.
  • 27. Numerical Measures – what is interesting? • Centre of the data • Spread of the data 2 7
  • 28.
  • 29. Measures of Central Tendency • Mean – this is the average • Median – splits the data in two halves • Mode – the most popular value 2 9
  • 30. Measures of Dispersion • Variance – average squared difference between the data points and the mean • Standard Deviation – square root of the variance, more intuitive 3 0
  • 31. Measures of Dispersion • Percentiles – dataset is divided into 100 equal parts • Quartiles – dataset is divided into four equal parts • Interquartile range – middle 50% of data points 3 1
  • 32. Measures of Association • Covariance – how variables vary together, rise together, fall together • Correlation – very similar, shown between -1 and 1 3 2
  • 33. Measuring Uncertainty • Probability is based on SETS, which we use in SQL • We determine the probability of outcomes: – Addition Rule – Multiplication Rule – Complement Rule 3 3
  • 34. Probability Distributions • Binomial distribution – one of two outcomes • Geometric Distribution – probability before success results • Poisson Distribution – probability that a number of events will occur within a time frame • Uniform Distribution – evenly distributed variables • Normal Distribution – bell shaped curve 3 4
  • 35. Statistical Inference • Process of drawing conclusions about a population of randomly drawn samples 35
  • 36. Linear Regression • We use sample data to work out the strength and direction of a relationship between two variables.
  • 37. Linear Regression • The formula works out the • X: predictor variable, also known as the independent variable • Y: response variable, also known as the dependent variable • Lm( y ~ x, data= dataframe)
  • 38. First Impressions? • How do you go about it? • Check the plot first; how does it look?
  • 39. What tools do we have in R? • In data wrangling, what are the main tasks? • – Filtering rows – Selecting columns of data – Adding new variables – Sorting – Aggregating 39
  • 40. What tools do we have in R? • 80% of your time will be spent preparing and wrangling data • The remainder of your time will be spent complaining about it. 40
  • 43. Multiple Regression In simple linear regression, a criterion variable is predicted from one predictor variable. In multiple regression, the criterion is predicted by two or more variables.
  • 44.
  • 47. Evaluate Model • Receiver Operator Characteristic (ROC) curves • Precision/Recall curves • Lift curves
  • 48.
  • 49. P value • Compare the p-value for the F-test to your significance level. – If the p-value is less than the significance level, your sample data provide sufficient evidence to conclude that your regression model fits the data better than the model with no independent variables.
  • 50. F-Test • An F statistic is a value you get when you run an ANOVA test or a regression analysis to find out if the means between two populations are significantly different.
  • 51. F-Test • A-T test will tell you if a single variable is statistically significant and an F test will tell you if a group of variables are jointly significant.
  • 52. The F-Test • If none of the variables are significant, then the overall F-test is not significant. – It’s an early test so you can throw the model out. • The F-Test can show if the variables are jointly significant • F-test sums the predictive power of all variables
  • 53. RMSE • RMSE measures how accurately the model predicts the response. • It is the most important criterion for model fit if the main purpose of the model is prediction.
  • 54. Model validation - probability • Most of the model validation centers around the residuals (essentially the distance of the data points from the fitted regression line) 54
  • 55. Model validation – Q-Q • Quantile-Quantile plots help evaluate the fit of sample data to the normal distribution. Is the data close to being normally distributed, or are there a lot of outliers, for example? 55
  • 56. How do you interpret the results? Scale-Location Plot • The scale-location plot in the upper right shows the square root of the standardized residuals (sort of a square root of relative error) as a function of the fitted values. • We are not hoping to see an obvious trend in this plot.
  • 57. How do you interpret the results? Importance of each Point • Cook’s Distance – Measure of the importance of each observation to the regression – Distances larger than 1 are suspicious – Outlier 57
  • 59. JenStirrup • Boutique Consultancy Owner of Data Relish • Postgraduate degrees in Artificial Intelligence and Cognitive Science • Twenty year career in industry • Author JenStirrup.com DataRelish.com
  • 60. Get in touch! • http://bit.ly/JenStirrupRD • http://bit.ly/JenStirrupLinkedIn • http://bit.ly/JenStirrupMVP • http://bit.ly/JenStirrupTwitter
  • 61. Let your data surprise you!
  • 62. References and Thanks • R and Data Mining: Examples and Case Studies by Yanchang Zhao 62