SlideShare una empresa de Scribd logo
1 de 6
Distinction between outliers and influential data points
In this section, we learn the distinction between outliers and influential data points. In short:
An outlier is a data point whose response y does not follow the general trend of the rest
of the data.
A data point is influential if it unduly influences any part of a regression analysis, such
as the predicted responses, the estimated slope coefficients, or the hypothesis test results.
Note that — for our purposes — we consider a data point to be an outlier only if it is extreme
with respect to the other y values, not the x values.
One advantage of the case in which we have only one predictor is that we can look at simple
scatter plots in order to identify any outliers and influential data points. Let's take a look at a few
examples that should help to clarify the distinction between the two types of extreme values.

Example #1
Based on the definitions above, do you think the following data set contains any outliers? Or, any
influential data points?

You got it! All of the data points follow the general trend of the rest of the data, so there are no
outliers — for emphasis only — in the y direction. And, none of the data points would appear to
influence the location of the best fitting line.
Example #2
Now, how about this example? Do you think the following data set contains any outliers? Or,
any influential data points?(Wink!Wink!)

Of course! Because the blue data point does not follow the general trend of the rest of the data, it
would be considered an outlier. But, is the blue data point influential? An easy way to determine
if the data point is influential is to find the best fitting line twice — once with the blue data point
included and once the blue data point excluded. The following plot illustrates the two best fitting
lines:

Wow — it's hard to even tell the two estimated regression equations apart! The dashed line
represents the estimated regression equation with the blue data point included, while the solid
line represents the estimated regression equation with the blue data point taken excluded. The
slopes of the two lines are very similar — 5.04 and 5.12, respectively.
and the following output when the blue data point is excluded:

In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not
affected by the inclusion of the blue data point. Therefore, the data point is not deemed
influential. In summary, the blue data point is not influential, but still an outlier.

Example #3
Now, how about this example? Do you think the following data set contains any outliers? Or,
any influential data points?

In this case, the blue data point does follow the general trend of the rest of the data. Therefore, it
is not deemed an outlier here. But, is the blue data point influential? It certainly appears to be far
removed from the rest of the data — in the x direction. Is that sufficient to make the data point
influential?
The following plot illustrates two best fitting lines — one obtained when the blue data point is
included and one obtained when the blue data point is excluded:

Again, it's hard to even tell the two estimated regression equations apart! The dashed line
represents the estimated regression equation with the blue data point included, while the solid
line represents the estimated regression equation with the blue data point taken excluded. The
slopes of the two lines are very similar — 4.93 and 5.12, respectively.

and the following output when the blue data point is excluded:

Here, there are hardly any side effects at all of including the blue data point:
In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not
affected by the inclusion of the blue data point. Therefore, the data point is not deemed
influential. In summary, the blue data point is not influential, nor is it an outlier.
Example #4
One last example! Do you think the following data set contains any outliers? Or, any influential
data points?

That's right — in this case, the blue data point is most certainly an outlier and influential! The
blue data point does follow the general trend of the rest of the data. Below are the two best fitting
lines — one obtained when the blue data point is included and one obtained when the blue data
point is excluded:

are (not surprisingly) substantially different. The dashed line represents the estimated regression
equation with the blue data point included, while the solid line represents the estimated
regression equation with the blue data point taken excluded. The existence of the blue data point
significantly reduces the slope of the regression line — dropping it from 5.12 to 3.32.
and the following output when the blue data point is excluded:

Here, the predicted responses and estimated slope coefficients are clearly affected by the
presence of the blue data point. In this case, the blue data point is deemed both influential and an
outlier.

Summary
The above examples — through the use of simple plots — have highlighted the distinction
between outliers and influential data points. We have seen an example, in which a data point was
an outlier, but not influential. That is, not every outlier strongly influences the regression
analysis. It is your job as a regression analyst to always determine if your regression analysis is
unduly influenced by one or a few data points.
Of course, the easy situation occurs for simple linear regression, when we can rely on simple
scatter plots to elucidate matters. Unfortunately, we don't have that luxury in the case of multiple
linear regression. In that situation, we have to rely on various measures to help us determine
whether a data point is an outlier, influential or both.

Más contenido relacionado

La actualidad más candente

Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonAzdeen Najah
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Salford Systems
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designssmackinnon
 
03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritizationStefan Moser
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby RaoSumit Prajapati
 
Increasing Power without Increasing Sample Size
Increasing Power without Increasing Sample SizeIncreasing Power without Increasing Sample Size
Increasing Power without Increasing Sample Sizesmackinnon
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variablesBorhan Uddin
 
statistical inference
statistical inference statistical inference
statistical inference BasitShah18
 
Statistical Test
Statistical TestStatistical Test
Statistical Testguestdbf093
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundationssmackinnon
 
Basics of Structural Equation Modeling
Basics of Structural Equation ModelingBasics of Structural Equation Modeling
Basics of Structural Equation Modelingsmackinnon
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSSsmackinnon
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2sreenu t
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statisticsteresa_soto
 
A review of statistics
A review of statisticsA review of statistics
A review of statisticsedisonre
 

La actualidad más candente (18)

Representing and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıonRepresenting and generating uncertainty effectively presentatıon
Representing and generating uncertainty effectively presentatıon
 
Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values Imputation Techniques For Market Research Datasets With Missing Values
Imputation Techniques For Market Research Datasets With Missing Values
 
Generalized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects DesignsGeneralized Linear Models for Between-Subjects Designs
Generalized Linear Models for Between-Subjects Designs
 
03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization03 Design of Experiments - Factor prioritization
03 Design of Experiments - Factor prioritization
 
Ch8 Regression Revby Rao
Ch8 Regression Revby RaoCh8 Regression Revby Rao
Ch8 Regression Revby Rao
 
SEM
SEMSEM
SEM
 
P1 Stroop
P1 StroopP1 Stroop
P1 Stroop
 
Increasing Power without Increasing Sample Size
Increasing Power without Increasing Sample SizeIncreasing Power without Increasing Sample Size
Increasing Power without Increasing Sample Size
 
Association between-variables
Association between-variablesAssociation between-variables
Association between-variables
 
statistical inference
statistical inference statistical inference
statistical inference
 
Statistical Test
Statistical TestStatistical Test
Statistical Test
 
One-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual FoundationsOne-Way ANOVA: Conceptual Foundations
One-Way ANOVA: Conceptual Foundations
 
Basics of Structural Equation Modeling
Basics of Structural Equation ModelingBasics of Structural Equation Modeling
Basics of Structural Equation Modeling
 
Introduction to Mediation using SPSS
Introduction to Mediation using SPSSIntroduction to Mediation using SPSS
Introduction to Mediation using SPSS
 
Lecture note 2
Lecture note 2Lecture note 2
Lecture note 2
 
Les5e ppt 11
Les5e ppt 11Les5e ppt 11
Les5e ppt 11
 
Edisons Statistics
Edisons StatisticsEdisons Statistics
Edisons Statistics
 
A review of statistics
A review of statisticsA review of statistics
A review of statistics
 

Destacado

Formato guion de video
Formato guion de videoFormato guion de video
Formato guion de videoAdriana Ruiz R
 
Nestholma Venture Accelerator
Nestholma Venture AcceleratorNestholma Venture Accelerator
Nestholma Venture Acceleratornestholma
 
National Economic Voting in U.S.
National Economic Voting in U.S.National Economic Voting in U.S.
National Economic Voting in U.S.Jyung-Ho Yang
 
Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Ejército de Tierra
 
Futebol nunca mais
Futebol nunca maisFutebol nunca mais
Futebol nunca maisJoão Couto
 
Nerea emakumearen erretratua
Nerea emakumearen erretratuaNerea emakumearen erretratua
Nerea emakumearen erretratuakontakatiluak6a05
 
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedKoninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedThierry Debels
 
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle AssociazioniLa bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle Associazioniredattori
 
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderPortret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderSaskia Vugts Portretschilder
 

Destacado (14)

Anillos de potencia
Anillos de potencia Anillos de potencia
Anillos de potencia
 
Pizarra Digital Interactiva en primaria
Pizarra Digital Interactiva en primariaPizarra Digital Interactiva en primaria
Pizarra Digital Interactiva en primaria
 
Formato guion de video
Formato guion de videoFormato guion de video
Formato guion de video
 
Nestholma Venture Accelerator
Nestholma Venture AcceleratorNestholma Venture Accelerator
Nestholma Venture Accelerator
 
Santuak nerea
Santuak nereaSantuak nerea
Santuak nerea
 
National Economic Voting in U.S.
National Economic Voting in U.S.National Economic Voting in U.S.
National Economic Voting in U.S.
 
Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015Tierra digital nº 5 extraordinario noviembre 2015
Tierra digital nº 5 extraordinario noviembre 2015
 
Practica 3 Cosmin
Practica 3 CosminPractica 3 Cosmin
Practica 3 Cosmin
 
Alfabeto
AlfabetoAlfabeto
Alfabeto
 
Futebol nunca mais
Futebol nunca maisFutebol nunca mais
Futebol nunca mais
 
Nerea emakumearen erretratua
Nerea emakumearen erretratuaNerea emakumearen erretratua
Nerea emakumearen erretratua
 
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoedKoninklijke Schenking verdient 1 miljoen met verkoop vastgoed
Koninklijke Schenking verdient 1 miljoen met verkoop vastgoed
 
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle AssociazioniLa bozza del "nuovo" Regolamento della Consulta delle Associazioni
La bozza del "nuovo" Regolamento della Consulta delle Associazioni
 
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts PortretschilderPortret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
Portret in opdracht,olieverf op linnen, 80/80 Saskia Vugts Portretschilder
 

Similar a Distinction between outliers and influential data points w out hyp test

Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Dr Athar Khan
 
Linear regression
Linear regressionLinear regression
Linear regressionDepEd
 
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxkarlhennesey
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxoreo10
 
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_HappinessUnit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happinessourbusiness0014
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysisAwais Salman
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- CompleteIrfan Yaqoob
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptErgin Akalpler
 
10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docxKin Kan
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification AnalysisYashIyengar
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVADerek Kane
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxAnusuya123
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptxHarryPuri
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxjeffevans62972
 

Similar a Distinction between outliers and influential data points w out hyp test (20)

Binary OR Binomial logistic regression
Binary OR Binomial logistic regression Binary OR Binomial logistic regression
Binary OR Binomial logistic regression
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Outlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docxOutlying and Influential Data In Regression Diagnostics .docx
Outlying and Influential Data In Regression Diagnostics .docx
 
The future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docxThe future is uncertain. Some events do have a very small probabil.docx
The future is uncertain. Some events do have a very small probabil.docx
 
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_HappinessUnit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
Unit 5 8614.pptx A_Movie_Review_Pursuit_Of_Happiness
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Eviews forecasting
Eviews forecastingEviews forecasting
Eviews forecasting
 
Scatter plot- Complete
Scatter plot- CompleteScatter plot- Complete
Scatter plot- Complete
 
ders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.pptders 8 Quantile-Regression.ppt
ders 8 Quantile-Regression.ppt
 
2-20-04.ppt
2-20-04.ppt2-20-04.ppt
2-20-04.ppt
 
10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx10 Must-Know Statistical Concepts for Data Scientists.docx
10 Must-Know Statistical Concepts for Data Scientists.docx
 
Regression and Classification Analysis
Regression and Classification AnalysisRegression and Classification Analysis
Regression and Classification Analysis
 
Data Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVAData Science - Part IV - Regression Analysis & ANOVA
Data Science - Part IV - Regression Analysis & ANOVA
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Descriptive Statistics
Descriptive StatisticsDescriptive Statistics
Descriptive Statistics
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
 
Data Preprocessing
Data PreprocessingData Preprocessing
Data Preprocessing
 
Regression.pptx
Regression.pptxRegression.pptx
Regression.pptx
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
You clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docxYou clearly understand the concepts of this assignment. You’ve don.docx
You clearly understand the concepts of this assignment. You’ve don.docx
 

Último

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Distinction between outliers and influential data points w out hyp test

  • 1. Distinction between outliers and influential data points In this section, we learn the distinction between outliers and influential data points. In short: An outlier is a data point whose response y does not follow the general trend of the rest of the data. A data point is influential if it unduly influences any part of a regression analysis, such as the predicted responses, the estimated slope coefficients, or the hypothesis test results. Note that — for our purposes — we consider a data point to be an outlier only if it is extreme with respect to the other y values, not the x values. One advantage of the case in which we have only one predictor is that we can look at simple scatter plots in order to identify any outliers and influential data points. Let's take a look at a few examples that should help to clarify the distinction between the two types of extreme values. Example #1 Based on the definitions above, do you think the following data set contains any outliers? Or, any influential data points? You got it! All of the data points follow the general trend of the rest of the data, so there are no outliers — for emphasis only — in the y direction. And, none of the data points would appear to influence the location of the best fitting line.
  • 2. Example #2 Now, how about this example? Do you think the following data set contains any outliers? Or, any influential data points?(Wink!Wink!) Of course! Because the blue data point does not follow the general trend of the rest of the data, it would be considered an outlier. But, is the blue data point influential? An easy way to determine if the data point is influential is to find the best fitting line twice — once with the blue data point included and once the blue data point excluded. The following plot illustrates the two best fitting lines: Wow — it's hard to even tell the two estimated regression equations apart! The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The slopes of the two lines are very similar — 5.04 and 5.12, respectively.
  • 3. and the following output when the blue data point is excluded: In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the blue data point. Therefore, the data point is not deemed influential. In summary, the blue data point is not influential, but still an outlier. Example #3 Now, how about this example? Do you think the following data set contains any outliers? Or, any influential data points? In this case, the blue data point does follow the general trend of the rest of the data. Therefore, it is not deemed an outlier here. But, is the blue data point influential? It certainly appears to be far removed from the rest of the data — in the x direction. Is that sufficient to make the data point influential?
  • 4. The following plot illustrates two best fitting lines — one obtained when the blue data point is included and one obtained when the blue data point is excluded: Again, it's hard to even tell the two estimated regression equations apart! The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The slopes of the two lines are very similar — 4.93 and 5.12, respectively. and the following output when the blue data point is excluded: Here, there are hardly any side effects at all of including the blue data point: In short, the predicted responses, estimated slope coefficients, and hypothesis test results are not affected by the inclusion of the blue data point. Therefore, the data point is not deemed influential. In summary, the blue data point is not influential, nor is it an outlier.
  • 5. Example #4 One last example! Do you think the following data set contains any outliers? Or, any influential data points? That's right — in this case, the blue data point is most certainly an outlier and influential! The blue data point does follow the general trend of the rest of the data. Below are the two best fitting lines — one obtained when the blue data point is included and one obtained when the blue data point is excluded: are (not surprisingly) substantially different. The dashed line represents the estimated regression equation with the blue data point included, while the solid line represents the estimated regression equation with the blue data point taken excluded. The existence of the blue data point significantly reduces the slope of the regression line — dropping it from 5.12 to 3.32.
  • 6. and the following output when the blue data point is excluded: Here, the predicted responses and estimated slope coefficients are clearly affected by the presence of the blue data point. In this case, the blue data point is deemed both influential and an outlier. Summary The above examples — through the use of simple plots — have highlighted the distinction between outliers and influential data points. We have seen an example, in which a data point was an outlier, but not influential. That is, not every outlier strongly influences the regression analysis. It is your job as a regression analyst to always determine if your regression analysis is unduly influenced by one or a few data points. Of course, the easy situation occurs for simple linear regression, when we can rely on simple scatter plots to elucidate matters. Unfortunately, we don't have that luxury in the case of multiple linear regression. In that situation, we have to rely on various measures to help us determine whether a data point is an outlier, influential or both.