SlideShare una empresa de Scribd logo
1 de 33
Lecture Series on
  Biostatistics
                     No. Bio-Stat_14
                    Date – 25.12.2008



SIMPLE LINEAR REGRESSIONN AND
         CORRELATION
                           By
           Dr. Bijaya Bhusan Nanda,
       M. Sc (Gold Medalist) Ph. D. (Stat.)
    Topper Orissa Statistics & Economics Services, 1988
            bijayabnanda@yahoo.com
CONTENTS
   Introduction
   The regression model
   The sample regression equation
   Evaluate the regression equation
   Using the regression equation
   The correlation model
   The correlation coefficient
   Some precaution
Introduction
Bivariate relationships
 Chi-square tests of independence to determine

  whether a statistical relationship existed between two
variables. It does not tell us what the relationship is.
 Regression and correlation analyses will show how to

  determine both the nature and the strength of a
relationship between two variables.
 The term “regression “ was first used as a statistical

concept by Sir Francis Galton. He designed the word
regression as the name of the general process of predicting
one variable ( the height of the children ) from another
( the height of the parent ). Later, statisticians coined the
term multiple regression to describe the process by which
several variables are used to predict another.
 In regression analysis an estimating equation – that is
  a mathematical formula that relates the known
  variables to the unknown variable.
 Correlation analysis determine the degree to which
  the variables are related and tell us how well the
  estimating equation actually describes the
  relationship.
Bivariate frequency distribution
 In order to summarize the bi-variate data, for each
  variable a suitable number of classes are taken,
  keeping in view the same consideration as in the uni-
  variate case.
 If there are ‘K’ classes for X and ‘L’ classes for Y.
  Then, the data can be summarized by using a two-way
  frequency table having KL cells.
 By going through the pairs of values of X and Y
  the number of individuals in each cells can be
  found out by using the system of tally marks. The
  whole set of cell frequencies will now define a
  frequency distribution called bi-variate frequency
  distribution.
Correlation
 Definition:

  Correlation analysis is a statistical technique of
  measuring the degree, direction, causation and
  significance of co-variation between two or more
  variables.
Correlation analysis consists of three steps
(i) Determining whether relation exists and, if it does,
    measuring it.
(ii) Testing whether it is significant.
(iii) Establishing the cause & effect relation if any.
        Significant correlation between increase in
    smoking and increase in long cancer does not prove
    that smoking causes cancer. The proof of a cause &
    effect relationship can be developed only by means of
    an exhaustive study of the operative elements
    themselves.
Significance of the study of correlation
   Most variables are correlated in some way or other &
    correlation analysis helps to measure the degree &
    direction of correlation in one figure.
    Ex:- Age & Height, Blood pressure and pulse rate
   Given the close correlation between two or more
    variables, the value of one can be estimated for
    known value of other variable.
   The effect of correlation is to reduce the range of
    uncertainty. The prediction based on correlation
    analysis is likely to be more reliable and near to
    reality.
Correlation may be due to following reasons.
   May be due to pure chance, especially in a small
    sample
ii. Both the correlated variables may be influenced by
    one or more variables.
    Example: Correlation between yield per acre of rice &
    yield per acre of tea may be due to the fact that both
    are depended upon amount of rainfall.
iii. Both the variables may be mutually influencing each
    other so that neither can be designated as the cause
    and the other the effect.
    Example: Correlation” between demand and supply,
    price & production. The variables in that case
    mutually interact.
iv. Correlation may be due to the fact that one variable is
    the cause & other variable the effect.
Types of Correlation
   i. positive or negative
   ii. Simple, partial & multiple
   iii. Linear and non-linear
i. Positive and negative correlation
 Correlation is said to be positive (direct) if both the
   variables vary in the same direction i.e. if one is
   increasing (decreasing) the other on an average, is also
   increasing (decreasing).
 It is said to be negative (inverse) if the variables are
   varying in opposite direction i.e. if one is increasing,
   the other is decreasing or vice versa.
 Example : In recent years, physicians have used the so-
   called diving reflex to reduce abnormally rapid
   heartbeats in humans by submerging the patient’s
   face in cold water.
Simple, Partial and Multiple Correlation

 Simple correlation is a study between two
  variables
 When three or more variables are studied it is a
  problem of either multiple or partial correlation.
  In multiple correlation, three or more variables
  are studies simultaneously.
 In partial correlation, three or more variables are
  recognized, but the correlation between two
  variables are studied keeping the effect of other
  influencing variable (s) constant.
iii) Liner and Non-Linear (Curvi-linear) Correlation
 If the amount of change in one variable tends to
   bear a constant ratio to the amount of change in
   other variable then the correlation is said to be
   linear.
 If the amount of change in one variable does not
   bear a constant ratio to the amount of change in
   other variable then the correlation would be non-
   linear.
 However, since the method of analyzing non-linear
   correlation is very complicated, generally, we
   assume linear relationship between the variables.
Scatter diagram method of studying correlation
  In this method, the given pairs of observations (xi, yi)
  are plotted on a graph paper. We get as many dots in
  the graph paper as the pairs of observations. By
  studying the scatteredness of the dots we can form an
  idea about the direction of the correlation and whether
  it is higher low. If dots are more scattered, the
  correlation is less and vice versa.
Karl Pearsons’ Co-efficient of Correlation
  This is one of the mathematical methods of finding out
  degree and direction of Correlation and is popularly
  known as Pearson’s Co-efficient of Correlation. This is
  denoted by the symbol ‘r’.
                          n   ∑ xi yi − (∑ xi ) ∑ yi 
                r=

                                    ( ∑ xi )           ∑ yi − ( ∑ yi )
                              2                2            2            2
                     n   ∑x   i
                                −                  n
Assumptions:
   For each value of X there is a normally distributed
    subpopulation of Y values.
   For each value of Y there is a normally distributed
    subpopulation of X values.
   The joint distribution of X and Y is a normal
    distribution called the bivariate normal
    distribution.
   The subpopulations of Y values have the same
    variance.
    The subpopulations of X values have the same
    variance.
Significance test for correlation:
Hypothesis: H0:ρ=0
              HA:ρ≠0
Test Statistics:
  Under the null hypothesis (when null hypothesis is
  true)         n−2
         t=r          2
               1− r
                     follows a Students ‘t’ distribution
  with n-2 degrees of freedom.
Decision rule:
  If we let α=0.05, the critical value of t is ±2.0639
  If the computed value of ‘t’≥2.0639 or ≤-2.0639, we
  reject H0 or it may be accepted.
Hypothesis: H0:ρ= ρ0, where ρ0 some value other than 0.
             HA:ρ≠ ρ0
Test statistics:
                   z=
                            z −z
                             r     p

                        1/ n − 3

Which follows approximately the standard normal
 distribution and decision rule of z test is followed.
  Sample size should be ≥25
                             1 1 + r 
                   z    r
                            = n
                             2 1 − r 
                                      
                         1 1 1 + ρ 
                   z p = 2 n 1 − ρ 
                                   
1 estimated standard deviation.
           n−3
If sample size ≥10 but <25, Hotelling procedure
may be used.
Hoteling procedure:
Test statistics:        ∗
                                −     z
                         z =
                             ∗


                             1 n−1
Where                    3 zr + r
         ∗
        z =z −   r
                                 4n
                             3 zρ + ρ
        ρ =z
             ∗
                     ρ
                         −
                     4n
When the assumptions are not made, spearman’s rank
correlation co-efficient is used.
Co-efficient of Determination
 The Co-efficient of determination = r2. This interpretes
  that r2 percent of the variation in the dependent variable
  has been explained by the independent variable. Co-
  efficient of determination may assume the maximum
  value 1. The co-efficient of determination r2 is defined as
  the ratio of the explained variance to the total variance.
 The ratio of un-explained variance to total variance is
  frequently called co-efficient of non-determination,
  denoted by K2 and its square root is called the co-
  efficient of alienation, or K.
Merits and limitations of the Pearsons co-efficient
  This is the widely used for measuring the degree of
  relationship. This summarize in one figure the degree
  and direction of correlation.

Limitations
 Karl Pearson’s co-efficient of correlation assumes
  linear relationship regardless of the fact whether that
  assumption is correct or not.
 The value of the co-efficient is unduly affected by the
  extreme items.
 Very often there is risk of misinterpreting the co-
  efficient hence great care must be exercised, while
  interpreting the co-efficient of correlation.
Interpretation of Co-efficient of Correlation
  The general rules for interpreting the value of r is
  given below.
 r = +1 implies perfect positive relationship between
  the variables.
 The closeness of relationship is not proportional to r.
 r = -1 implies perfect negative relationship
  between the variables.
 r = 0 implies no relationship between the
  variables.
 The closer r is to = 1 or –1 the closer the
  relationship between the variables and closer r
  is to 0, the less close the relationship.
The Simple Regression Model
   Regression: Helpful in ascertaining the probable
    form of the relationship between variables.
   Main objective of regression is to predict or
    estimate the value of one variable corresponding to a
    given value of another variable
   One type of probabilistic model, a simple linear
    regression model, makes assumption that the mean
    value of y for a given value of x graphs as straight
    line and that points deviate about this line of means
    by a random amount equal to e, i.e. y = A + B x + e

        where A and B are unknown parameters of the
    deterministic (nonrandom ) portion of the model.
   If we suppose that the points deviate above or below
    the line of means and with expected value E(e) = 0
    then the mean value of y is y = A + B x.
   Therefore, the mean value of y for a given value of x,
    represented by the symbol E(y) graphs as straight
    line with y-intercept A and slope B.
A SIMPLE LINEAR REGRESSION
 MODEL
 y = A + B x + e,
 where
 y = dependent or response variable
 x = independent or predictor variable
 e = random error
 A = y-intercept of the line
 B = slope of the line
ASSUMPTIONS REQUIRED FOR A LINEAR
  REGRESSION MODEL
 The mean of the probability distribution of the random
  error is 0, E(e) = 0. that is, the average of the errors over
  an infinitely long series of experiments is 0 for each
  setting of the independent variable x. this assumption
  states that the mean value of y, E(y) for a given value of
  x is y = A + B x.
 The variance of the random error is equal to a
  constant, say σ 2, for all value of x.
 The probability distribution of the random error is
  normal.
 The errors associated with any two different
  observations are independent. That is, the error
  associated with one value of y has no effect on the
  errors associated with other values.
Estimating A and B: the method of least squares
    To find estimators of A and B of the regression model
    based on a sample data .
   Suppose we have a sample of n data points (x1, y1), (x2,
    y2), ..., (xn, yn). The straight-line model for the response
    y in terms x is y = A + B x + e.
   The line of means is E(y) = A + B x and the line fitted
    to the sample data is = a + bx
                               ˆ
                               y
                ˆ
    Thus, y is an estimator of the mean value of y and
    a predictor of some future value of y; and a, b are
    estimators of A and B, respectively.
   For a given data point, say the point (xi, yi), the
    observed value of y is yi and the predicted value of y
    would be     ˆ
                 y       i  x
                           = a+b
                                  i
and the deviation of the ith value of y from its predicted

                     ∑ [ y i − ( a + b xi ) ]
  value is      n               2
             SSE =
                     i =1
The values of a and b that make the SSE minimum is
  called the least squares estimators of the population
  parameters A and B and the prediction equation is
  called the least squares line.
FORMULAS FOR THE LEAST SQUARES ESTIMATORS
Slope: b = SS xy     y intercept: a = y − bx
Where
             SS xx

                                                       SS xx ∑( xi −x )
                       n                                         n        2
        SS xy ∑ xi =         (    − x )(   yi   − y)        =
                       i=1                                      i =1

             n                              n
         1                           1
     x =
         n   ∑xi                 y =
                                     n     ∑y i
             i=1            ,              i=1           n = sample size
Estimating σ 2
 In most practical situations, the variance σ 2 of the
  random error e will be unknown and must be
  estimated from the sample data. Since σ 2 measures
  the variation of the y values about the regression line,
  it seems intuitively reasonable to estimate σ 2 by
  dividing the total error SSE by an appropriate
  number.
ESTIMATION OF σ 2
                   2      SSE         SSE
               s  =                 =
                     Degreeoffreedom n − 2
               where


                       ∑( yi − yi )
                        n             2
               SSE =           ˆ
                       i =1
INTERPRETATION OF s, THE ESTIMATED STANDARD
  DEVIATION OF e
   We expect most of the observed y values to lie within
                                                         ˆ
    2s of their respective least squares predicted value y
Making inferences about the slope, B
 The probabilistic model y = A + B x + e for the
  relationship between two random variables x and y,
  where x is independent variable and y is dependent
  variable, A and B are unknown parameters, and e is a
  random error.
 Under the assumptions made on the random error e
  E(y) = A + B x . This is the population regression line.
 If we are given a sample of n data points (xi, yi), i
  =1,...,n, then by the least squares method the straight
  line fitted to these sample data. This line is the sample
  regression line.
 It is an estimate for the population regression line.
 The theoretical background for making inferences about
  the slope B lies in the following properties of the least
  squares estimator b:
PROPERTIES OF THE LEAST SQUARES ESTIMATOR b:
 b will possess sampling distribution that is normally
  distributed.
 The mean of the least squares estimator b is B, E(b) = B, that
  is, b is an unbiased estimator for B.
 The standard deviation of the sampling distribution of b is
                                 σ
                        σ =
                          b
                                  SS of the random error e,
           σ is the standard deviation
                                      xx
    where σ is the standard deviation of the random error e,

                       SS xx = ∑( xi −x )
                                 n           2

                                i=1
Since σ is usually unknown, we use its estimator s and
  instead of         σ we use its estimate           s
              σ =
               b                             S = b
                     SS   xx
                                                     SS   xx
For testing hypotheses about B first we state null and
  alternative hypotheses:
      H :B =B
          0          0



      H : B ≠ B (orB 〈 B orB〉 B )
          a          0           0           0


  where B0 is the hypothesized value of B.
  Often, one tests the hypothesis if B = 0 or not, that is,
  if x does or does not contribute information for the
  prediction of y. The setup of our test of utility of the
  model is summarized in the box.
A TEST OF MODEL UTILITY
ONE-TAILED TEST                         TWO-TAILED TEST
                  H 0: B =0                           H  :B =0
                                                          0

                  H a:B <0
                  (orB > 0)                            H :B ≠0
                                                          a
Test statistic:                         Test statistic:

                  b           b                       b          b
         t=           =                         t=        =
              sb
Rejection region          s / SS   xx                sb
                                        Rejection region      s / SS   xx




              t <−
          ( or t > tα ),
                          tα                 t <−   t α / 2 ort > t α / 2
Where tα is based on (n - 2) df.        where tα /2 is based on (n-2) df.
Example: To model the relationship between the CO
  (Carbon Monoxide) ranking, y, and the nicotine content,
  x, of an american-made cigarette the Federal Trade
  commission tested a random sample of 5 cigarettes. The
  CO ranking and nicotine content values are given in
  Table
               Cigarette   Nicotine Content, x, mgs   CO ranking, y, mgs
                  1            0.2                          2
                  2            0.4                         10
                  3            0.6                         13
                  4            0.8                         15
Solution: Testing 5 usefulness of the model
                  the           1                       requires
                                                           20
  testing the hypothesis
                           H 0: B = 0
                           H value0based on (5 -2) = 3 df.
with n = 5 and , the critical
                              a
                                :B≠
t α / 2 = t 0.025 = 3.182
Thus, we will reject H0 if t < -3.182 or t > 3.182.
In order to compute the test statistic we need the values
  of b, S and SSxx. b =20.5. S = 1.82 and SSxx = 0.4. Hence,
  the test statistic is              b          20.5
                                t=                 =                = 7.12
                                     s / ss   xx       1.82 / 0.4
Since the calculated t-value is greater than the critical
  value t0.025 = 3.182, we reject the null hypothesis and
  conclude that the slope .B≠0 At the significance level α
  = 0.05, the sample data provide sufficient evidence to
  conclude that nicotine content does contribute useful
  information for prediction of carbon-monoxide
  ranking using the linear model.
THANK YOU

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Chi squared test
Chi squared test Chi squared test
Chi squared test
 
Normal distribution
Normal distributionNormal distribution
Normal distribution
 
F test
F testF test
F test
 
Regression Analysis
Regression AnalysisRegression Analysis
Regression Analysis
 
Correlation analysis
Correlation analysisCorrelation analysis
Correlation analysis
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
Correlation and regression analysis
Correlation and regression analysisCorrelation and regression analysis
Correlation and regression analysis
 
Binomial distribution
Binomial distributionBinomial distribution
Binomial distribution
 
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha BaliRegression (Linear Regression and Logistic Regression) by Akanksha Bali
Regression (Linear Regression and Logistic Regression) by Akanksha Bali
 
Probability
ProbabilityProbability
Probability
 
Regression analysis
Regression analysisRegression analysis
Regression analysis
 
Simple linear regression
Simple linear regressionSimple linear regression
Simple linear regression
 
Probability Distributions
Probability DistributionsProbability Distributions
Probability Distributions
 
Regression analysis.
Regression analysis.Regression analysis.
Regression analysis.
 
Correlation IN STATISTICS
Correlation IN STATISTICSCorrelation IN STATISTICS
Correlation IN STATISTICS
 
Z test
Z testZ test
Z test
 
Uniform Distribution
Uniform DistributionUniform Distribution
Uniform Distribution
 
VARIANCE
VARIANCEVARIANCE
VARIANCE
 
Probability
ProbabilityProbability
Probability
 
Regression
RegressionRegression
Regression
 

Similar a Simple linear regressionn and Correlation

Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression imptfreelancer
 
P G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationP G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationAashish Patel
 
Correlation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptxCorrelation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptxHamdiMichaelCC
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relationnuwan udugampala
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionAntony Raj
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06Kishor Ade
 
ch 13 Correlation and regression.doc
ch 13 Correlation  and regression.docch 13 Correlation  and regression.doc
ch 13 Correlation and regression.docAbedurRahman5
 
correlation.final.ppt (1).pptx
correlation.final.ppt (1).pptxcorrelation.final.ppt (1).pptx
correlation.final.ppt (1).pptxChieWoo1
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regressionMOHIT PANCHAL
 

Similar a Simple linear regressionn and Correlation (20)

Correlation and regression impt
Correlation and regression imptCorrelation and regression impt
Correlation and regression impt
 
P G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 CorrelationP G STAT 531 Lecture 9 Correlation
P G STAT 531 Lecture 9 Correlation
 
Correlation and Regression
Correlation and Regression Correlation and Regression
Correlation and Regression
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation
CorrelationCorrelation
Correlation
 
Correlation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptxCorrelation analysis in Biostatistics .pptx
Correlation analysis in Biostatistics .pptx
 
Study of Correlation
Study of Correlation Study of Correlation
Study of Correlation
 
UNIT 4.pptx
UNIT 4.pptxUNIT 4.pptx
UNIT 4.pptx
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
13943056.ppt
13943056.ppt13943056.ppt
13943056.ppt
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Chapter 10
Chapter 10Chapter 10
Chapter 10
 
Regression and Co-Relation
Regression and Co-RelationRegression and Co-Relation
Regression and Co-Relation
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 
A correlation analysis.ppt 2018
A correlation analysis.ppt 2018A correlation analysis.ppt 2018
A correlation analysis.ppt 2018
 
Research Methodology Module-06
Research Methodology Module-06Research Methodology Module-06
Research Methodology Module-06
 
correlation.ppt
correlation.pptcorrelation.ppt
correlation.ppt
 
ch 13 Correlation and regression.doc
ch 13 Correlation  and regression.docch 13 Correlation  and regression.doc
ch 13 Correlation and regression.doc
 
correlation.final.ppt (1).pptx
correlation.final.ppt (1).pptxcorrelation.final.ppt (1).pptx
correlation.final.ppt (1).pptx
 
Correlation and regression
Correlation and regressionCorrelation and regression
Correlation and regression
 

Más de Southern Range, Berhampur, Odisha

Understanding Stress, Stressors, Signs and Symptoms of Stress
Understanding Stress, Stressors, Signs and Symptoms of StressUnderstanding Stress, Stressors, Signs and Symptoms of Stress
Understanding Stress, Stressors, Signs and Symptoms of StressSouthern Range, Berhampur, Odisha
 

Más de Southern Range, Berhampur, Odisha (20)

Growth and Instability in Food Grain production in Oisha
Growth and Instability in Food Grain production  in OishaGrowth and Instability in Food Grain production  in Oisha
Growth and Instability in Food Grain production in Oisha
 
Growth and Instability of OIlseeds Production in Odisha
Growth and Instability of OIlseeds Production in OdishaGrowth and Instability of OIlseeds Production in Odisha
Growth and Instability of OIlseeds Production in Odisha
 
Statistical thinking and development planning
Statistical thinking and development planningStatistical thinking and development planning
Statistical thinking and development planning
 
CAN PLEASURE BE A JOURNEY RATHER THAN A STOP?
CAN PLEASURE BE A JOURNEY RATHER THAN A STOP?CAN PLEASURE BE A JOURNEY RATHER THAN A STOP?
CAN PLEASURE BE A JOURNEY RATHER THAN A STOP?
 
Monitoring and Evaluation
Monitoring and EvaluationMonitoring and Evaluation
Monitoring and Evaluation
 
Anger Management
Anger ManagementAnger Management
Anger Management
 
A Simple Promise for Happiness
A Simple Promise for HappinessA Simple Promise for Happiness
A Simple Promise for Happiness
 
Measures of mortality
Measures of mortalityMeasures of mortality
Measures of mortality
 
Understanding data through presentation_contd
Understanding data through presentation_contdUnderstanding data through presentation_contd
Understanding data through presentation_contd
 
Nonparametric and Distribution- Free Statistics _contd
Nonparametric and Distribution- Free Statistics _contdNonparametric and Distribution- Free Statistics _contd
Nonparametric and Distribution- Free Statistics _contd
 
Nonparametric and Distribution- Free Statistics
Nonparametric and Distribution- Free Statistics Nonparametric and Distribution- Free Statistics
Nonparametric and Distribution- Free Statistics
 
Simple and Powerful promise for Peace and Happiness
Simple and Powerful promise for Peace and HappinessSimple and Powerful promise for Peace and Happiness
Simple and Powerful promise for Peace and Happiness
 
Understanding Stress, Stressors, Signs and Symptoms of Stress
Understanding Stress, Stressors, Signs and Symptoms of StressUnderstanding Stress, Stressors, Signs and Symptoms of Stress
Understanding Stress, Stressors, Signs and Symptoms of Stress
 
Chi sqr
Chi sqrChi sqr
Chi sqr
 
Hypothesis Testing
Hypothesis TestingHypothesis Testing
Hypothesis Testing
 
Inferential statistics-estimation
Inferential statistics-estimationInferential statistics-estimation
Inferential statistics-estimation
 
Sampling methods in medical research
Sampling methods in medical researchSampling methods in medical research
Sampling methods in medical research
 
Probability concept and Probability distribution_Contd
Probability concept and Probability distribution_ContdProbability concept and Probability distribution_Contd
Probability concept and Probability distribution_Contd
 
Probability concept and Probability distribution
Probability concept and Probability distributionProbability concept and Probability distribution
Probability concept and Probability distribution
 
Measures of dispersion
Measures  of  dispersionMeasures  of  dispersion
Measures of dispersion
 

Último

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterMateoGardella
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 

Último (20)

Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 

Simple linear regressionn and Correlation

  • 1. Lecture Series on Biostatistics No. Bio-Stat_14 Date – 25.12.2008 SIMPLE LINEAR REGRESSIONN AND CORRELATION By Dr. Bijaya Bhusan Nanda, M. Sc (Gold Medalist) Ph. D. (Stat.) Topper Orissa Statistics & Economics Services, 1988 bijayabnanda@yahoo.com
  • 2. CONTENTS  Introduction  The regression model  The sample regression equation  Evaluate the regression equation  Using the regression equation  The correlation model  The correlation coefficient  Some precaution
  • 3. Introduction Bivariate relationships  Chi-square tests of independence to determine whether a statistical relationship existed between two variables. It does not tell us what the relationship is.  Regression and correlation analyses will show how to determine both the nature and the strength of a relationship between two variables.  The term “regression “ was first used as a statistical concept by Sir Francis Galton. He designed the word regression as the name of the general process of predicting one variable ( the height of the children ) from another ( the height of the parent ). Later, statisticians coined the term multiple regression to describe the process by which several variables are used to predict another.
  • 4.  In regression analysis an estimating equation – that is a mathematical formula that relates the known variables to the unknown variable.  Correlation analysis determine the degree to which the variables are related and tell us how well the estimating equation actually describes the relationship. Bivariate frequency distribution  In order to summarize the bi-variate data, for each variable a suitable number of classes are taken, keeping in view the same consideration as in the uni- variate case.  If there are ‘K’ classes for X and ‘L’ classes for Y. Then, the data can be summarized by using a two-way frequency table having KL cells.
  • 5.  By going through the pairs of values of X and Y the number of individuals in each cells can be found out by using the system of tally marks. The whole set of cell frequencies will now define a frequency distribution called bi-variate frequency distribution. Correlation  Definition: Correlation analysis is a statistical technique of measuring the degree, direction, causation and significance of co-variation between two or more variables.
  • 6. Correlation analysis consists of three steps (i) Determining whether relation exists and, if it does, measuring it. (ii) Testing whether it is significant. (iii) Establishing the cause & effect relation if any. Significant correlation between increase in smoking and increase in long cancer does not prove that smoking causes cancer. The proof of a cause & effect relationship can be developed only by means of an exhaustive study of the operative elements themselves. Significance of the study of correlation  Most variables are correlated in some way or other & correlation analysis helps to measure the degree & direction of correlation in one figure. Ex:- Age & Height, Blood pressure and pulse rate
  • 7. Given the close correlation between two or more variables, the value of one can be estimated for known value of other variable.  The effect of correlation is to reduce the range of uncertainty. The prediction based on correlation analysis is likely to be more reliable and near to reality. Correlation may be due to following reasons.  May be due to pure chance, especially in a small sample
  • 8. ii. Both the correlated variables may be influenced by one or more variables. Example: Correlation between yield per acre of rice & yield per acre of tea may be due to the fact that both are depended upon amount of rainfall. iii. Both the variables may be mutually influencing each other so that neither can be designated as the cause and the other the effect. Example: Correlation” between demand and supply, price & production. The variables in that case mutually interact. iv. Correlation may be due to the fact that one variable is the cause & other variable the effect.
  • 9. Types of Correlation i. positive or negative ii. Simple, partial & multiple iii. Linear and non-linear i. Positive and negative correlation  Correlation is said to be positive (direct) if both the variables vary in the same direction i.e. if one is increasing (decreasing) the other on an average, is also increasing (decreasing).  It is said to be negative (inverse) if the variables are varying in opposite direction i.e. if one is increasing, the other is decreasing or vice versa. Example : In recent years, physicians have used the so- called diving reflex to reduce abnormally rapid heartbeats in humans by submerging the patient’s face in cold water.
  • 10. Simple, Partial and Multiple Correlation  Simple correlation is a study between two variables  When three or more variables are studied it is a problem of either multiple or partial correlation. In multiple correlation, three or more variables are studies simultaneously.  In partial correlation, three or more variables are recognized, but the correlation between two variables are studied keeping the effect of other influencing variable (s) constant.
  • 11. iii) Liner and Non-Linear (Curvi-linear) Correlation  If the amount of change in one variable tends to bear a constant ratio to the amount of change in other variable then the correlation is said to be linear.  If the amount of change in one variable does not bear a constant ratio to the amount of change in other variable then the correlation would be non- linear.  However, since the method of analyzing non-linear correlation is very complicated, generally, we assume linear relationship between the variables.
  • 12. Scatter diagram method of studying correlation  In this method, the given pairs of observations (xi, yi) are plotted on a graph paper. We get as many dots in the graph paper as the pairs of observations. By studying the scatteredness of the dots we can form an idea about the direction of the correlation and whether it is higher low. If dots are more scattered, the correlation is less and vice versa. Karl Pearsons’ Co-efficient of Correlation This is one of the mathematical methods of finding out degree and direction of Correlation and is popularly known as Pearson’s Co-efficient of Correlation. This is denoted by the symbol ‘r’. n ∑ xi yi − (∑ xi ) ∑ yi  r= ( ∑ xi ) ∑ yi − ( ∑ yi ) 2 2 2 2 n ∑x i − n
  • 13. Assumptions:  For each value of X there is a normally distributed subpopulation of Y values.  For each value of Y there is a normally distributed subpopulation of X values.  The joint distribution of X and Y is a normal distribution called the bivariate normal distribution.  The subpopulations of Y values have the same variance.  The subpopulations of X values have the same variance.
  • 14. Significance test for correlation: Hypothesis: H0:ρ=0 HA:ρ≠0 Test Statistics: Under the null hypothesis (when null hypothesis is true) n−2 t=r 2 1− r follows a Students ‘t’ distribution with n-2 degrees of freedom. Decision rule: If we let α=0.05, the critical value of t is ±2.0639 If the computed value of ‘t’≥2.0639 or ≤-2.0639, we reject H0 or it may be accepted.
  • 15. Hypothesis: H0:ρ= ρ0, where ρ0 some value other than 0. HA:ρ≠ ρ0 Test statistics: z= z −z r p 1/ n − 3 Which follows approximately the standard normal distribution and decision rule of z test is followed. Sample size should be ≥25 1 1 + r  z r = n 2 1 − r   1 1 1 + ρ  z p = 2 n 1 − ρ   
  • 16. 1 estimated standard deviation. n−3 If sample size ≥10 but <25, Hotelling procedure may be used. Hoteling procedure: Test statistics: ∗ − z z = ∗ 1 n−1 Where 3 zr + r ∗ z =z − r 4n 3 zρ + ρ ρ =z ∗ ρ − 4n When the assumptions are not made, spearman’s rank correlation co-efficient is used.
  • 17. Co-efficient of Determination  The Co-efficient of determination = r2. This interpretes that r2 percent of the variation in the dependent variable has been explained by the independent variable. Co- efficient of determination may assume the maximum value 1. The co-efficient of determination r2 is defined as the ratio of the explained variance to the total variance.  The ratio of un-explained variance to total variance is frequently called co-efficient of non-determination, denoted by K2 and its square root is called the co- efficient of alienation, or K. Merits and limitations of the Pearsons co-efficient This is the widely used for measuring the degree of relationship. This summarize in one figure the degree and direction of correlation. 
  • 18. Limitations  Karl Pearson’s co-efficient of correlation assumes linear relationship regardless of the fact whether that assumption is correct or not.  The value of the co-efficient is unduly affected by the extreme items.  Very often there is risk of misinterpreting the co- efficient hence great care must be exercised, while interpreting the co-efficient of correlation. Interpretation of Co-efficient of Correlation The general rules for interpreting the value of r is given below.  r = +1 implies perfect positive relationship between the variables.  The closeness of relationship is not proportional to r.
  • 19.  r = -1 implies perfect negative relationship between the variables.  r = 0 implies no relationship between the variables.  The closer r is to = 1 or –1 the closer the relationship between the variables and closer r is to 0, the less close the relationship.
  • 20. The Simple Regression Model  Regression: Helpful in ascertaining the probable form of the relationship between variables.  Main objective of regression is to predict or estimate the value of one variable corresponding to a given value of another variable  One type of probabilistic model, a simple linear regression model, makes assumption that the mean value of y for a given value of x graphs as straight line and that points deviate about this line of means by a random amount equal to e, i.e. y = A + B x + e where A and B are unknown parameters of the deterministic (nonrandom ) portion of the model.
  • 21. If we suppose that the points deviate above or below the line of means and with expected value E(e) = 0 then the mean value of y is y = A + B x.  Therefore, the mean value of y for a given value of x, represented by the symbol E(y) graphs as straight line with y-intercept A and slope B.
  • 22. A SIMPLE LINEAR REGRESSION MODEL y = A + B x + e, where y = dependent or response variable x = independent or predictor variable e = random error A = y-intercept of the line B = slope of the line
  • 23. ASSUMPTIONS REQUIRED FOR A LINEAR REGRESSION MODEL  The mean of the probability distribution of the random error is 0, E(e) = 0. that is, the average of the errors over an infinitely long series of experiments is 0 for each setting of the independent variable x. this assumption states that the mean value of y, E(y) for a given value of x is y = A + B x.  The variance of the random error is equal to a constant, say σ 2, for all value of x.  The probability distribution of the random error is normal.  The errors associated with any two different observations are independent. That is, the error associated with one value of y has no effect on the errors associated with other values.
  • 24. Estimating A and B: the method of least squares To find estimators of A and B of the regression model based on a sample data .  Suppose we have a sample of n data points (x1, y1), (x2, y2), ..., (xn, yn). The straight-line model for the response y in terms x is y = A + B x + e.  The line of means is E(y) = A + B x and the line fitted to the sample data is = a + bx ˆ y ˆ  Thus, y is an estimator of the mean value of y and a predictor of some future value of y; and a, b are estimators of A and B, respectively.  For a given data point, say the point (xi, yi), the observed value of y is yi and the predicted value of y would be ˆ y i x = a+b i
  • 25. and the deviation of the ith value of y from its predicted ∑ [ y i − ( a + b xi ) ] value is n 2 SSE = i =1 The values of a and b that make the SSE minimum is called the least squares estimators of the population parameters A and B and the prediction equation is called the least squares line. FORMULAS FOR THE LEAST SQUARES ESTIMATORS Slope: b = SS xy y intercept: a = y − bx Where SS xx SS xx ∑( xi −x ) n n 2 SS xy ∑ xi = ( − x )( yi − y) = i=1 i =1 n n 1 1 x = n ∑xi y = n ∑y i i=1 , i=1 n = sample size
  • 26. Estimating σ 2  In most practical situations, the variance σ 2 of the random error e will be unknown and must be estimated from the sample data. Since σ 2 measures the variation of the y values about the regression line, it seems intuitively reasonable to estimate σ 2 by dividing the total error SSE by an appropriate number. ESTIMATION OF σ 2 2 SSE SSE s = = Degreeoffreedom n − 2 where ∑( yi − yi ) n 2 SSE = ˆ i =1
  • 27. INTERPRETATION OF s, THE ESTIMATED STANDARD DEVIATION OF e  We expect most of the observed y values to lie within ˆ 2s of their respective least squares predicted value y Making inferences about the slope, B  The probabilistic model y = A + B x + e for the relationship between two random variables x and y, where x is independent variable and y is dependent variable, A and B are unknown parameters, and e is a random error.  Under the assumptions made on the random error e E(y) = A + B x . This is the population regression line.  If we are given a sample of n data points (xi, yi), i =1,...,n, then by the least squares method the straight line fitted to these sample data. This line is the sample regression line.
  • 28.  It is an estimate for the population regression line.  The theoretical background for making inferences about the slope B lies in the following properties of the least squares estimator b: PROPERTIES OF THE LEAST SQUARES ESTIMATOR b:  b will possess sampling distribution that is normally distributed.  The mean of the least squares estimator b is B, E(b) = B, that is, b is an unbiased estimator for B.  The standard deviation of the sampling distribution of b is σ σ = b SS of the random error e, σ is the standard deviation xx where σ is the standard deviation of the random error e, SS xx = ∑( xi −x ) n 2 i=1
  • 29. Since σ is usually unknown, we use its estimator s and instead of σ we use its estimate s σ = b S = b SS xx SS xx For testing hypotheses about B first we state null and alternative hypotheses: H :B =B 0 0 H : B ≠ B (orB 〈 B orB〉 B ) a 0 0 0 where B0 is the hypothesized value of B. Often, one tests the hypothesis if B = 0 or not, that is, if x does or does not contribute information for the prediction of y. The setup of our test of utility of the model is summarized in the box.
  • 30. A TEST OF MODEL UTILITY ONE-TAILED TEST TWO-TAILED TEST H 0: B =0 H :B =0 0 H a:B <0 (orB > 0) H :B ≠0 a Test statistic: Test statistic: b b b b t= = t= = sb Rejection region s / SS xx sb Rejection region s / SS xx t <− ( or t > tα ), tα t <− t α / 2 ort > t α / 2 Where tα is based on (n - 2) df. where tα /2 is based on (n-2) df.
  • 31. Example: To model the relationship between the CO (Carbon Monoxide) ranking, y, and the nicotine content, x, of an american-made cigarette the Federal Trade commission tested a random sample of 5 cigarettes. The CO ranking and nicotine content values are given in Table Cigarette Nicotine Content, x, mgs CO ranking, y, mgs 1 0.2 2 2 0.4 10 3 0.6 13 4 0.8 15 Solution: Testing 5 usefulness of the model the 1 requires 20 testing the hypothesis H 0: B = 0 H value0based on (5 -2) = 3 df. with n = 5 and , the critical a :B≠
  • 32. t α / 2 = t 0.025 = 3.182 Thus, we will reject H0 if t < -3.182 or t > 3.182. In order to compute the test statistic we need the values of b, S and SSxx. b =20.5. S = 1.82 and SSxx = 0.4. Hence, the test statistic is b 20.5 t= = = 7.12 s / ss xx 1.82 / 0.4 Since the calculated t-value is greater than the critical value t0.025 = 3.182, we reject the null hypothesis and conclude that the slope .B≠0 At the significance level α = 0.05, the sample data provide sufficient evidence to conclude that nicotine content does contribute useful information for prediction of carbon-monoxide ranking using the linear model.