Se está descargando tu SlideShare. ×

# 1667390753_Lind Chapter 10-14.pdf

Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio   Cargando en…3
×

1 de 245 Anuncio

# 1667390753_Lind Chapter 10-14.pdf

its a reference book for your knowledgement about statistics and data analys . Its a Lind chapter10 to 14 .

its a reference book for your knowledgement about statistics and data analys . Its a Lind chapter10 to 14 .

Anuncio
Anuncio

### 1667390753_Lind Chapter 10-14.pdf

1. 1. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin One Sample Tests of Hypothesis Chapter 10
2. 2. 2 GOALS l Define a hypothesis and hypothesis testing. l Describe the five-step hypothesis-testing procedure. l Distinguish between a one-tailed and a two-tailed test of hypothesis. l Conduct a test of hypothesis about a population mean. l Conduct a test of hypothesis about a population proportion. l Define Type I and Type II errors. l Compute the probability of a Type II error.
3. 3. 3 What is a Hypothesis? A Hypothesis is a statement about the value of a population parameter developed for the purpose of testing. Examples of hypotheses made about a population parameter are: – The mean monthly income for systems analysts is \$3,625. – Twenty percent of all customers at Bovine’s Chop House return for another meal within a month.
4. 4. 4 What is Hypothesis Testing? Hypothesis testing is a procedure, based on sample evidence and probability theory, used to determine whether the hypothesis is a reasonable statement and should not be rejected, or is unreasonable and should be rejected.
5. 5. 5 Hypothesis Testing Steps
6. 6. 6 Important Things to Remember about H0 and H1 l H0: null hypothesis and H1: alternate hypothesis l H0 and H1 are mutually exclusive and collectively exhaustive l H0 is always presumed to be true l H1 has the burden of proof l A random sample (n) is used to “reject H0” l If we conclude 'do not reject H0', this does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence to reject H0; rejecting the null hypothesis then, suggests that the alternative hypothesis may be true. l Equality is always part of H0 (e.g. “=” , “≥” , “≤”). l “≠” “<” and “>” always part of H1
7. 7. 7 How to Set Up a Claim as Hypothesis l In actual practice, the status quo is set up as H0 l If the claim is “boastful” the claim is set up as H1 (we apply the Missouri rule – “show me”). Remember, H1 has the burden of proof l In problem solving, look for key words and convert them into symbols. Some key words include: “improved, better than, as effective as, different from, has changed, etc.”
8. 8. 8 Left-tail or Right-tail Test? Keywords Inequality Symbol Part of: Larger (or more) than > H1 Smaller (or less) < H1 No more than £ H0 At least ≥ H0 Has increased > H1 Is there difference? ≠ H1 Has not changed = H0 Has “improved”, “is better than”. “is more effective” See right H1 • The direction of the test involving claims that use the words “has improved”, “is better than”, and the like will depend upon the variable being measured. • For instance, if the variable involves time for a certain medication to take effect, the words “better” “improve” or more effective” are translated as “<” (less than, i.e. faster relief). • On the other hand, if the variable refers to a test score, then the words “better” “improve” or more effective” are translated as “>” (greater than, i.e. higher test scores)
9. 9. 9
10. 10. 10 Parts of a Distribution in Hypothesis Testing
11. 11. 11 One-tail vs. Two-tail Test
12. 12. 12 Hypothesis Setups for Testing a Mean (m)
13. 13. 13 Hypothesis Setups for Testing a Proportion (p)
14. 14. 14 Testing for a Population Mean with a Known Population Standard Deviation- Example Jamestown Steel Company manufactures and assembles desks and other office equipment at several plants in western New York State. The weekly production of the Model A325 desk at the Fredonia Plant follows the normal probability distribution with a mean of 200 and a standard deviation of 16. Recently, because of market expansion, new production methods have been introduced and new employees hired. The vice president of manufacturing would like to investigate whether there has been a change in the weekly production of the Model A325 desk.
15. 15. 15 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m = 200 H1: m ≠ 200 (note: keyword in the problem “has changed”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since σ is known
16. 16. 16 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 4: Formulate the decision rule. Reject H0 if |Z| > Za/2 58 . 2 not is 55 . 1 50 / 16 200 5 . 203 / 2 / 01 . 2 / 2 / > > - > - > Z Z n X Z Z a a s m Step 5: Make a decision and interpret the result. Because 1.55 does not fall in the rejection region, H0 is not rejected. We conclude that the population mean is not different from 200. So we would report to the vice president of manufacturing that the sample evidence does not show that the production rate at the Fredonia Plant has changed from 200 per week.
17. 17. 17 Suppose in the previous problem the vice president wants to know whether there has been an increase in the number of units assembled. To put it another way, can we conclude, because of the improved production methods, that the mean number of desks assembled in the last 50 weeks was more than 200? Recall: σ=16, n=200, α=.01 Testing for a Population Mean with a Known Population Standard Deviation- Another Example
18. 18. 18 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m ≤ 200 H1: m > 200 (note: keyword in the problem “an increase”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since σ is known
19. 19. 19 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 4: Formulate the decision rule. Reject H0 if Z > Za Step 5: Make a decision and interpret the result. Because 1.55 does not fall in the rejection region, H0 is not rejected. We conclude that the average number of desks assembled in the last 50 weeks is not more than 200
20. 20. 20 Type of Errors in Hypothesis Testing l Type I Error - – Defined as the probability of rejecting the null hypothesis when it is actually true. – This is denoted by the Greek letter “a” – Also known as the significance level of a test l Type II Error: – Defined as the probability of “accepting” the null hypothesis when it is actually false. – This is denoted by the Greek letter “β”
21. 21. 21 p-Value in Hypothesis Testing l p-VALUE is the probability of observing a sample value as extreme as, or more extreme than, the value observed, given that the null hypothesis is true. l In testing a hypothesis, we can also compare the p- value to with the significance level (a). l If the p-value < significance level, H0 is rejected, else H0 is not rejected.
22. 22. 22 p-Value in Hypothesis Testing - Example Recall the last problem where the hypothesis and decision rules were set up as: H0: m ≤ 200 H1: m > 200 Reject H0 if Z > Za where Z = 1.55 and Za =2.33 Reject H0 if p-value < a 0.0606 is not < 0.01 Conclude: Fail to reject H0
23. 23. 23 What does it mean when p-value < a? (a) .10, we have some evidence that H0 is not true. (b) .05, we have strong evidence that H0 is not true. (c) .01, we have very strong evidence that H0 is not true. (d) .001, we have extremely strong evidence that H0 is not true.
24. 24. 24 Testing for the Population Mean: Population Standard Deviation Unknown l When the population standard deviation (σ) is unknown, the sample standard deviation (s) is used in its place l The t-distribution is used as test statistic, which is computed using the formula:
25. 25. 25 Testing for the Population Mean: Population Standard Deviation Unknown - Example The McFarland Insurance Company Claims Department reports the mean cost to process a claim is \$60. An industry comparison showed this amount to be larger than most other insurance companies, so the company instituted cost-cutting measures. To evaluate the effect of the cost-cutting measures, the Supervisor of the Claims Department selected a random sample of 26 claims processed last month. The sample information is reported below. At the .01 significance level is it reasonable a claim is now less than \$60?
26. 26. 26 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 1: State the null hypothesis and the alternate hypothesis. H0: m ≥ \$60 H1: m < \$60 (note: keyword in the problem “now less than”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use t-distribution since σ is unknown
27. 27. 27 t-Distribution Table (portion)
28. 28. 28 Testing for the Population Mean: Population Standard Deviation Unknown – Minitab Solution
29. 29. 29 Testing for a Population Mean with a Known Population Standard Deviation- Example Step 5: Make a decision and interpret the result. Because -1.818 does not fall in the rejection region, H0 is not rejected at the .01 significance level. We have not demonstrated that the cost-cutting measures reduced the mean cost per claim to less than \$60. The difference of \$3.58 (\$56.42 - \$60) between the sample mean and the population mean could be due to sampling error. Step 4: Formulate the decision rule. Reject H0 if t < -ta,n-1
30. 30. 30 The current rate for producing 5 amp fuses at Neary Electric Co. is 250 per hour. A new machine has been purchased and installed that, according to the supplier, will increase the production rate. A sample of 10 randomly selected hours from last month revealed the mean hourly production on the new machine was 256 units, with a sample standard deviation of 6 per hour. At the .05 significance level can Neary conclude that the new machine is faster? Testing for a Population Mean with an Unknown Population Standard Deviation- Example
31. 31. 31 Testing for a Population Mean with a Known Population Standard Deviation- Example continued Step 1: State the null and the alternate hypothesis. H0: µ ≤ 250; H1: µ > 250 Step 2: Select the level of significance. It is .05. Step 3: Find a test statistic. Use the t distribution because the population standard deviation is not known and the sample size is less than 30.
32. 32. 32 Testing for a Population Mean with a Known Population Standard Deviation- Example continued Step 4: State the decision rule. There are 10 – 1 = 9 degrees of freedom. The null hypothesis is rejected if t > 1.833. Step 5: Make a decision and interpret the results. The null hypothesis is rejected. The mean number produced is more than 250 per hour. 162 . 3 10 6 250 256 = - = - = n s X t m
33. 33. 33 Tests Concerning Proportion l A Proportion is the fraction or percentage that indicates the part of the population or sample having a particular trait of interest. l The sample proportion is denoted by p and is found by x/n l The test statistic is computed as follows:
34. 34. 34 Assumptions in Testing a Population Proportion using the z-Distribution l A random sample is chosen from the population. l It is assumed that the binomial assumptions discussed in Chapter 6 are met: (1) the sample data collected are the result of counts; (2) the outcome of an experiment is classified into one of two mutually exclusive categories—a “success” or a “failure”; (3) the probability of a success is the same for each trial; and (4) the trials are independent l The test we will conduct shortly is appropriate when both np and n(1- p ) are at least 5. l When the above conditions are met, the normal distribution can be used as an approximation to the binomial distribution
35. 35. 35 Test Statistic for Testing a Single Population Proportion n p z ) 1 ( p p p - - = Sample proportion Hypothesized population proportion Sample size
36. 36. 36 Test Statistic for Testing a Single Population Proportion - Example Suppose prior elections in a certain state indicated it is necessary for a candidate for governor to receive at least 80 percent of the vote in the northern section of the state to be elected. The incumbent governor is interested in assessing his chances of returning to office and plans to conduct a survey of 2,000 registered voters in the northern section of the state. Using the hypothesis-testing procedure, assess the governor’s chances of reelection.
37. 37. 37 Test Statistic for Testing a Single Population Proportion - Example Step 1: State the null hypothesis and the alternate hypothesis. H0: p ≥ .80 H1: p < .80 (note: keyword in the problem “at least”) Step 2: Select the level of significance. α = 0.01 as stated in the problem Step 3: Select the test statistic. Use Z-distribution since the assumptions are met and np and n(1-p) ≥ 5
38. 38. 38 Testing for a Population Proportion - Example Step 5: Make a decision and interpret the result. The computed value of z (2.80) is in the rejection region, so the null hypothesis is rejected at the .05 level. The difference of 2.5 percentage points between the sample percent (77.5 percent) and the hypothesized population percent (80) is statistically significant. The evidence at this point does not support the claim that the incumbent governor will return to the governor’s mansion for another four years. Step 4: Formulate the decision rule. Reject H0 if Z <-Za
39. 39. 39 Type II Error l Recall Type I Error, the level of significance, denoted by the Greek letter “a”, is defined as the probability of rejecting the null hypothesis when it is actually true. l Type II Error, denoted by the Greek letter “β”,is defined as the probability of “accepting” the null hypothesis when it is actually false.
40. 40. 40 Type II Error - Example A manufacturer purchases steel bars to make cotter pins. Past experience indicates that the mean tensile strength of all incoming shipments is 10,000 psi and that the standard deviation, σ, is 400 psi. In order to make a decision about incoming shipments of steel bars, the manufacturer set up this rule for the quality- control inspector to follow: “Take a sample of 100 steel bars. At the .05 significance level if the sample mean strength falls between 9,922 psi and 10,078 psi, accept the lot. Otherwise the lot is to be rejected.”
41. 41. 41 Type I and Type II Errors Illustrated
42. 42. 42 Type II Error Computed
43. 43. 43 Type II Errors For Varying Mean Levels
44. 44. 44 End of Chapter 10
45. 45. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Two-sample Tests of Hypothesis Chapter 11
46. 46. 2 GOALS l Conduct a test of a hypothesis about the difference between two independent population means. l Conduct a test of a hypothesis about the difference between two population proportions. l Conduct a test of a hypothesis about the mean difference between paired or dependent observations. l Understand the difference between dependent and independent samples.
47. 47. 3 Comparing two populations – Some Examples l Is there a difference in the mean value of residential real estate sold by male agents and female agents in south Florida? l Is there a difference in the mean number of defects produced on the day and the afternoon shifts at Kimble Products? l Is there a difference in the mean number of days absent between young workers (under 21 years of age) and older workers (more than 60 years of age) in the fast-food industry? l Is there is a difference in the proportion of Ohio State University graduates and University of Cincinnati graduates who pass the state Certified Public Accountant Examination on their first attempt? l Is there an increase in the production rate if music is piped into the production area?
48. 48. 4 Comparing Two Population Means l No assumptions about the shape of the populations are required. l The samples are from independent populations. l The formula for computing the value of z is: 2 2 2 1 2 1 2 1 2 1 known are and if or 30 sizes sample if Use n n X X z s s s s + - = > 2 2 2 1 2 1 2 1 2 1 unknown are and if and 30 sizes sample if Use n s n s X X z + - = > s s
49. 49. 5 EXAMPLE 1 The U-Scan facility was recently installed at the Byrne Road Food-Town location. The store manager would like to know if the mean checkout time using the standard checkout method is longer than using the U- Scan. She gathered the following sample information. The time is measured from when the customer enters the line until their bags are in the cart. Hence the time includes both waiting in line and checking out.
50. 50. 6 EXAMPLE 1 continued Step 1: State the null and alternate hypotheses. H0: µS ≤ µU H1: µS > µU Step 2: State the level of significance. The .01 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because both samples are more than 30, we can use z-distribution as the test statistic.
51. 51. 7 Example 1 continued Step 4: State the decision rule. Reject H0 if Z > Za Z > 2.33
52. 52. 8 Example 1 continued Step 5: Compute the value of z and make a decision 13 . 3 064 . 0 2 . 0 100 30 . 0 50 40 . 0 3 . 5 5 . 5 2 2 2 2 = = + - = + - = u u s s u s n n X X z s s The computed value of 3.13 is larger than the critical value of 2.33. Our decision is to reject the null hypothesis. The difference of .20 minutes between the mean checkout time using the standard method is too large to have occurred by chance. We conclude the U-Scan method is faster.
53. 53. 9 Two-Sample Tests about Proportions Here are several examples. l The vice president of human resources wishes to know whether there is a difference in the proportion of hourly employees who miss more than 5 days of work per year at the Atlanta and the Houston plants. l General Motors is considering a new design for the Pontiac Grand Am. The design is shown to a group of potential buyers under 30 years of age and another group over 60 years of age. Pontiac wishes to know whether there is a difference in the proportion of the two groups who like the new design. l A consultant to the airline industry is investigating the fear of flying among adults. Specifically, the company wishes to know whether there is a difference in the proportion of men versus women who are fearful of flying.
54. 54. 10 Two Sample Tests of Proportions l We investigate whether two samples came from populations with an equal proportion of successes. l The two samples are pooled using the following formula.
55. 55. 11 Two Sample Tests of Proportions continued The value of the test statistic is computed from the following formula.
56. 56. 12 Manelli Perfume Company recently developed a new fragrance that it plans to market under the name Heavenly. A number of market studies indicate that Heavenly has very good market potential. The Sales Department at Manelli is particularly interested in whether there is a difference in the proportions of younger and older women who would purchase Heavenly if it were marketed. There are two independent populations, a population consisting of the younger women and a population consisting of the older women. Each sampled woman will be asked to smell Heavenly and indicate whether she likes the fragrance well enough to purchase a bottle. Two Sample Tests of Proportions - Example
57. 57. 13 Step 1: State the null and alternate hypotheses. H0: p1 = p 2 H1: p 1 ≠ p 2 Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use the z-distribution Two Sample Tests of Proportions - Example
58. 58. 14 Step 4: State the decision rule. Reject H0 if Z > Za/2 or Z < - Za/2 Z > 1.96 or Z < -1.96 Two Sample Tests of Proportions - Example
59. 59. 15 Step 5: Compute the value of z and make a decision The computed value of 2.21 is in the area of rejection. Therefore, the null hypothesis is rejected at the .05 significance level. To put it another way, we reject the null hypothesis that the proportion of young women who would purchase Heavenly is equal to the proportion of older women who would purchase Heavenly. Two Sample Tests of Proportions - Example
60. 60. 16 Two Sample Tests of Proportions – Example (Minitab Solution)
61. 61. 17 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) The t distribution is used as the test statistic if one or more of the samples have less than 30 observations. The required assumptions are: 1. Both populations must follow the normal distribution. 2. The populations must have equal standard deviations. 3. The samples are from independent populations.
62. 62. 18 Small sample test of means continued Finding the value of the test statistic requires two steps. 1. Pool the sample standard deviations. 2. Use the pooled standard deviation in the formula. 2 ) 1 ( ) 1 ( 2 1 2 2 2 2 1 1 2 - + - + - = n n s n s n sp ÷ ÷ ø ö ç ç è æ + - = 2 1 2 2 1 1 1 n n s X X t p
63. 63. 19 Owens Lawn Care, Inc., manufactures and assembles lawnmowers that are shipped to dealers throughout the United States and Canada. Two different procedures have been proposed for mounting the engine on the frame of the lawnmower. The question is: Is there a difference in the mean time to mount the engines on the frames of the lawnmowers? The first procedure was developed by longtime Owens employee Herb Welles (designated as procedure 1), and the other procedure was developed by Owens Vice President of Engineering William Atkins (designated as procedure 2). To evaluate the two methods, it was decided to conduct a time and motion study. A sample of five employees was timed using the Welles method and six using the Atkins method. The results, in minutes, are shown on the right. Is there a difference in the mean mounting times? Use the .10 significance level. Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test)
64. 64. 20 Step 1: State the null and alternate hypotheses. H0: µ1 = µ2 H1: µ1 ≠ µ2 Step 2: State the level of significance. The .10 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because the population standard deviations are not known but are assumed to be equal, we use the pooled t-test. Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
65. 65. 21 Step 4: State the decision rule. Reject H0 if t > ta/2,n1+n2-2 or t < - ta/2,n1+n2-2 t > t.05,9 or t < - t.05,9 t > 1.833 or t < - 1.833 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
66. 66. 22 Step 5: Compute the value of t and make a decision (a) Calculate the sample standard deviations Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
67. 67. 23 Step 5: Compute the value of t and make a decision Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example -0.662 The decision is not to reject the null hypothesis, because 0.662 falls in the region between -1.833 and 1.833. We conclude that there is no difference in the mean times to mount the engine on the frame using the two methods.
68. 68. 24 Comparing Population Means with Unknown Population Standard Deviations (the Pooled t-test) - Example
69. 69. 25 Comparing Population Means with Unequal Population Standard Deviations If it is not reasonable to assume the population standard deviations are equal, then we compute the t- statistic shown on the right. The sample standard deviations s1 and s2 are used in place of the respective population standard deviations. In addition, the degrees of freedom are adjusted downward by a rather complex approximation formula. The effect is to reduce the number of degrees of freedom in the test, which will require a larger value of the test statistic to reject the null hypothesis.
70. 70. 26 Comparing Population Means with Unequal Population Standard Deviations - Example Personnel in a consumer testing laboratory are evaluating the absorbency of paper towels. They wish to compare a set of store brand towels to a similar group of name brand ones. For each brand they dip a ply of the paper into a tub of fluid, allow the paper to drain back into the vat for two minutes, and then evaluate the amount of liquid the paper has taken up from the vat. A random sample of 9 store brand paper towels absorbed the following amounts of liquid in milliliters. 8 8 3 1 9 7 5 5 12 An independent random sample of 12 name brand towels absorbed the following amounts of liquid in milliliters: 12 11 10 6 8 9 9 10 11 9 8 10 Use the .10 significance level and test if there is a difference in the mean amount of liquid absorbed by the two types of paper towels.
71. 71. 27 The following dot plot provided by MINITAB shows the variances to be unequal. Comparing Population Means with Unequal Population Standard Deviations - Example
72. 72. 28 Step 1: State the null and alternate hypotheses. H0: m1 = m2 H1: m1 ≠ m2 Step 2: State the level of significance. The .10 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use unequal variances t-test Comparing Population Means with Unequal Population Standard Deviations - Example
73. 73. 29 Step 4: State the decision rule. Reject H0 if t > ta/2d.f. or t < - ta/2,d.f. t > t.05,10 or t < - t.05, 10 t > 1.812 or t < -1.812 Step 5: Compute the value of t and make a decision The computed value of t is less than the lower critical value, so our decision is to reject the null hypothesis. We conclude that the mean absorption rate for the two towels is not the same. Comparing Population Means with Unequal Population Standard Deviations - Example
74. 74. 30 Minitab
75. 75. 31 Two-Sample Tests of Hypothesis: Dependent Samples Dependent samples are samples that are paired or related in some fashion. For example: – If you wished to buy a car you would look at the same car at two (or more) different dealerships and compare the prices. – If you wished to measure the effectiveness of a new diet you would weigh the dieters at the start and at the finish of the program.
76. 76. 32 Hypothesis Testing Involving Paired Observations Use the following test when the samples are dependent: t d s n d = / d Where is the mean of the differences sd is the standard deviation of the differences n is the number of pairs (differences)
77. 77. 33 Nickel Savings and Loan wishes to compare the two companies it uses to appraise the value of residential homes. Nickel Savings selected a sample of 10 residential properties and scheduled both firms for an appraisal. The results, reported in \$000, are shown on the table (right). At the .05 significance level, can we conclude there is a difference in the mean appraised values of the homes? Hypothesis Testing Involving Paired Observations - Example
78. 78. 34 Step 1: State the null and alternate hypotheses. H0: md = 0 H1: md ≠ 0 Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. We will use the t-test Hypothesis Testing Involving Paired Observations - Example
79. 79. 35 Step 4: State the decision rule. Reject H0 if t > ta/2, n-1 or t < - ta/2,n-1 t > t.025,9 or t < - t.025, 9 t > 2.262 or t < -2.262 Hypothesis Testing Involving Paired Observations - Example
80. 80. 36 Step 5: Compute the value of t and make a decision The computed value of t is greater than the higher critical value, so our decision is to reject the null hypothesis. We conclude that there is a difference in the mean appraised values of the homes. Hypothesis Testing Involving Paired Observations - Example
81. 81. 37 Hypothesis Testing Involving Paired Observations – Excel Example
82. 82. 38 End of Chapter 11
83. 83. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Analysis of Variance Chapter 12
84. 84. 2 GOALS l List the characteristics of the F distribution. l Conduct a test of hypothesis to determine whether the variances of two populations are equal. l Discuss the general idea of analysis of variance. l Organize data into a one-way and a two-way ANOVA table. l Conduct a test of hypothesis among three or more treatment means. l Develop confidence intervals for the difference in treatment means. l Conduct a test of hypothesis among treatment means using a blocking variable. l Conduct a two-way ANOVA with interaction.
85. 85. 3 Characteristics of F-Distribution l There is a “family” of F Distributions. l Each member of the family is determined by two parameters: the numerator degrees of freedom and the denominator degrees of freedom. l F cannot be negative, and it is a continuous distribution. l The F distribution is positively skewed. l Its values range from 0 to ¥ l As F ® ¥ the curve approaches the X-axis.
86. 86. 4 Comparing Two Population Variances The F distribution is used to test the hypothesis that the variance of one normal population equals the variance of another normal population. The following examples will show the use of the test: l Two Barth shearing machines are set to produce steel bars of the same length. The bars, therefore, should have the same mean length. We want to ensure that in addition to having the same mean length they also have similar variation. l The mean rate of return on two types of common stock may be the same, but there may be more variation in the rate of return in one than the other. A sample of 10 technology and 10 utility stocks shows the same mean rate of return, but there is likely more variation in the Internet stocks. l A study by the marketing department for a large newspaper found that men and women spent about the same amount of time per day reading the paper. However, the same report indicated there was nearly twice as much variation in time spent per day among the men than the women.
87. 87. 5 Test for Equal Variances
88. 88. 6 Test for Equal Variances - Example Lammers Limos offers limousine service from the city hall in Toledo, Ohio, to Metro Airport in Detroit. Sean Lammers, president of the company, is considering two routes. One is via U.S. 25 and the other via I-75. He wants to study the time it takes to drive to the airport using each route and then compare the results. He collected the following sample data, which is reported in minutes. Using the .10 significance level, is there a difference in the variation in the driving times for the two routes?
89. 89. 7 Step 1: The hypotheses are: H0: σ1 2 = σ1 2 H1: σ1 2 ≠ σ1 2 Step 2: The significance level is .05. Step 3: The test statistic is the F distribution. Test for Equal Variances - Example
90. 90. 8 Step 4: State the decision rule. Reject H0 if F > Fa/2,v1,v2 F > F.05/2,7-1,8-1 F > F.025,6,7 Test for Equal Variances - Example
91. 91. 9 The decision is to reject the null hypothesis, because the computed F value (4.23) is larger than the critical value (3.87). We conclude that there is a difference in the variation of the travel times along the two routes. Step 5: Compute the value of F and make a decision Test for Equal Variances - Example
92. 92. 10 Test for Equal Variances – Excel Example
93. 93. 11 Comparing Means of Two or More Populations l The F distribution is also used for testing whether two or more sample means came from the same or equal populations. l Assumptions: – The sampled populations follow the normal distribution. – The populations have equal standard deviations. – The samples are randomly selected and are independent.
94. 94. 12 l The Null Hypothesis is that the population means are the same. The Alternative Hypothesis is that at least one of the means is different. l The Test Statistic is the F distribution. l The Decision rule is to reject the null hypothesis if F (computed) is greater than F (table) with numerator and denominator degrees of freedom. l Hypothesis Setup and Decision Rule: H0: µ1 = µ2 =…= µk H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Comparing Means of Two or More Populations
95. 95. 13 Analysis of Variance – F statistic l If there are k populations being sampled, the numerator degrees of freedom is k – 1. l If there are a total of n observations the denominator degrees of freedom is n – k. l The test statistic is computed by: ( ) ( ) k n SSE k SST F - - = 1
96. 96. 14 Joyce Kuhlman manages a regional financial center. She wishes to compare the productivity, as measured by the number of customers served, among three employees. Four days are randomly selected and the number of customers served by each employee is recorded. The results are: Comparing Means of Two or More Populations – Illustrative Example
97. 97. 15 Comparing Means of Two or More Populations – Illustrative Example
98. 98. 16 Recently a group of four major carriers joined in hiring Brunner Marketing Research, Inc., to survey recent passengers regarding their level of satisfaction with a recent flight. The survey included questions on ticketing, boarding, in-flight service, baggage handling, pilot communication, and so forth. Twenty-five questions offered a range of possible answers: excellent, good, fair, or poor. A response of excellent was given a score of 4, good a 3, fair a 2, and poor a 1. These responses were then totaled, so the total score was an indication of the satisfaction with the flight. Brunner Marketing Research, Inc., randomly selected and surveyed passengers from the four airlines. Comparing Means of Two or More Populations – Example Is there a difference in the mean satisfaction level among the four airlines? Use the .01 significance level.
99. 99. 17 Step 1: State the null and alternate hypotheses. H0: µE = µA = µT = µO H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Step 2: State the level of significance. The .01 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because we are comparing means of more than two groups, use the F statistic Comparing Means of Two or More Populations – Example
100. 100. 18 Step 4: State the decision rule. Reject H0 if F > Fa,k-1,n-k F > F01,4-1,22-4 F > F01,3,18 F > 5.801 Comparing Means of Two or More Populations – Example
101. 101. 19 Step 5: Compute the value of F and make a decision Comparing Means of Two or More Populations – Example
102. 102. 20 Comparing Means of Two or More Populations – Example
103. 103. 21 Computing SS Total and SSE
104. 104. 22 Computing SST The computed value of F is 8.99, which is greater than the critical value of 5.09, so the null hypothesis is rejected. Conclusion: The population means are not all equal. The mean scores are not the same for the four airlines; at this point we can only conclude there is a difference in the treatment means. We cannot determine which treatment groups differ or how many treatment groups differ.
105. 105. 23 Inferences About Treatment Means l When we reject the null hypothesis that the means are equal, we may want to know which treatment means differ. l One of the simplest procedures is through the use of confidence intervals.
106. 106. 24 Confidence Interval for the Difference Between Two Means l where t is obtained from the t table with degrees of freedom (n - k). l MSE = [SSE/(n - k)] ( ) X X t MSE n n 1 2 1 2 1 1 - ± + æ è ç ö ø ÷
107. 107. 25 From the previous example, develop a 95% confidence interval for the difference in the mean rating for Eastern and Ozark. Can we conclude that there is a difference between the two airlines’ ratings? The 95 percent confidence interval ranges from 10.46 up to 26.04. Both endpoints are positive; hence, we can conclude these treatment means differ significantly. That is, passengers on Eastern rated service significantly different from those on Ozark. Confidence Interval for the Difference Between Two Means - Example
108. 108. 26 Minitab
109. 109. 27 Excel
110. 110. 28 Two-Way Analysis of Variance l For the two-factor ANOVA we test whether there is a significant difference between the treatment effect and whether there is a difference in the blocking effect. Let Br be the block totals (r for rows) l Let SSB represent the sum of squares for the blocks where: SSB B k X n r = é ë ê ù û ú - S S 2 2 ( )
111. 111. 29 WARTA, the Warren Area Regional Transit Authority, is expanding bus service from the suburb of Starbrick into the central business district of Warren. There are four routes being considered from Starbrick to downtown Warren: (1) via U.S. 6, (2) via the West End, (3) via the Hickory Street Bridge, and (4) via Route 59. WARTA conducted several tests to determine whether there was a difference in the mean travel times along the four routes. Because there will be many different drivers, the test was set up so each driver drove along each of the four routes. Next slide shows the travel time, in minutes, for each driver-route combination. At the .05 significance level, is there a difference in the mean travel time along the four routes? If we remove the effect of the drivers, is there a difference in the mean travel time? Two-Way Analysis of Variance - Example
112. 112. 30 Two-Way Analysis of Variance - Example
113. 113. 31 Step 1: State the null and alternate hypotheses. H0: µu = µw = µh = µr H1: The means are not all equal Reject H0 if F > Fa,k-1,n-k Step 2: State the level of significance. The .05 significance level is stated in the problem. Step 3: Find the appropriate test statistic. Because we are comparing means of more than two groups, use the F statistic Two-Way Analysis of Variance - Example
114. 114. 32 Step 4: State the decision rule. Reject H0 if F > Fa,v1,v2 F > F.05,k-1,n-k F > F.05,4-1,20-4 F > F.05,3,16 F > 2.482 Two-Way Analysis of Variance - Example
115. 115. 33
116. 116. 34
117. 117. 35 Using Excel to perform the calculations. The computed value of F is 2.482, so our decision is to not reject the null hypothesis. We conclude there is no difference in the mean travel time along the four routes. There is no reason to select one of the routes as faster than the other. Two-Way Analysis of Variance – Excel Example
118. 118. 36 Two-Way ANOVA with Interaction Interaction occurs if the combination of two factors has some effect on the variable under study, in addition to each factor alone. We refer to the variable being studied as the response variable. An everyday illustration of interaction is the effect of diet and exercise on weight. It is generally agreed that a person’s weight (the response variable) can be controlled with two factors, diet and exercise. Research shows that weight is affected by diet alone and that weight is affected by exercise alone. However, the general recommended method to control weight is based on the combined or interaction effect of diet and exercise.
119. 119. 37 Graphical Observation of Mean Times Our graphical observations show us that interaction effects are possible. The next step is to conduct statistical tests of hypothesis to further investigate the possible interaction effects. In summary, our study of travel times has several questions: l Is there really an interaction between routes and drivers? l Are the travel times for the drivers the same? l Are the travel times for the routes the same? Of the three questions, we are most interested in the test for interactions. To put it another way, does a particular route/driver combination result in significantly faster (or slower) driving times? Also, the results of the hypothesis test for interaction affect the way we analyze the route and driver questions.
120. 120. 38 Interaction Effect l We can investigate these questions statistically by extending the two-way ANOVA procedure presented in the previous section. We add another source of variation, namely, the interaction. l In order to estimate the “error” sum of squares, we need at least two measurements for each driver/route combination. l As example, suppose the experiment presented earlier is repeated by measuring two more travel times for each driver and route combination. That is, we replicate the experiment. Now we have three new observations for each driver/route combination. l Using the mean of three travel times for each driver/route combination we get a more reliable measure of the mean travel time.
121. 121. 39 Example – ANOVA with Replication
122. 122. 40 Three Tests in ANOVA with Replication The ANOVA now has three sets of hypotheses to test: 1. H0: There is no interaction between drivers and routes. H1: There is interaction between drivers and routes. 2. H0: The driver means are the same. H1: The driver means are not the same. 3. H0: The route means are the same. H1: The route means are not the same.
123. 123. 41 ANOVA Table
124. 124. 42 Excel Output
125. 125. 43
126. 126. 44 End of Chapter 12
127. 127. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13
128. 128. 2 GOALS l Understand and interpret the terms dependent and independent variable. l Calculate and interpret the coefficient of correlation, the coefficient of determination, and the standard error of estimate. l Conduct a test of hypothesis to determine whether the coefficient of correlation in the population is zero. l Calculate the least squares regression line. l Construct and interpret confidence and prediction intervals for the dependent variable.
129. 129. 3 Regression Analysis - Introduction l Recall in Chapter 4 the idea of showing the relationship between two variables with a scatter diagram was introduced. l In that case we showed that, as the age of the buyer increased, the amount spent for the vehicle also increased. l In this chapter we carry this idea further. Numerical measures to express the strength of relationship between two variables are developed. l In addition, an equation is used to express the relationship. between variables, allowing us to estimate one variable on the basis of another.
130. 130. 4 Regression Analysis - Uses Some examples. l Is there a relationship between the amount Healthtex spends per month on advertising and its sales in the month? l Can we base an estimate of the cost to heat a home in January on the number of square feet in the home? l Is there a relationship between the miles per gallon achieved by large pickup trucks and the size of the engine? l Is there a relationship between the number of hours that students studied for an exam and the score earned?
131. 131. 5 Correlation Analysis l Correlation Analysis is the study of the relationship between variables. It is also defined as group of techniques to measure the association between two variables. l A Scatter Diagram is a chart that portrays the relationship between the two variables. It is the usual first step in correlations analysis – The Dependent Variable is the variable being predicted or estimated. – The Independent Variable provides the basis for estimation. It is the predictor variable.
132. 132. 6 Regression Example The sales manager of Copier Sales of America, which has a large sales force throughout the United States and Canada, wants to determine whether there is a relationship between the number of sales calls made in a month and the number of copiers sold that month. The manager selects a random sample of 10 representatives and determines the number of sales calls each representative made last month and the number of copiers sold.
133. 133. 7 Scatter Diagram
134. 134. 8 The Coefficient of Correlation, r The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. It requires interval or ratio-scaled data. l It can range from -1.00 to 1.00. l Values of -1.00 or 1.00 indicate perfect and strong correlation. l Values close to 0.0 indicate weak correlation. l Negative values indicate an inverse relationship and positive values indicate a direct relationship.
135. 135. 9 Perfect Correlation
136. 136. 10 Minitab Scatter Plots
137. 137. 11 Correlation Coefficient - Interpretation
138. 138. 12 Correlation Coefficient - Formula
139. 139. 13 Coefficient of Determination The coefficient of determination (r2) is the proportion of the total variation in the dependent variable (Y) that is explained or accounted for by the variation in the independent variable (X). It is the square of the coefficient of correlation. l It ranges from 0 to 1. l It does not give any information on the direction of the relationship between the variables.
140. 140. 14 Using the Copier Sales of America data which a scatterplot was developed earlier, compute the correlation coefficient and coefficient of determination. Correlation Coefficient - Example
141. 141. 15 Correlation Coefficient - Example
142. 142. 16 Correlation Coefficient – Excel Example
143. 143. 17 How do we interpret a correlation of 0.759? First, it is positive, so we see there is a direct relationship between the number of sales calls and the number of copiers sold. The value of 0.759 is fairly close to 1.00, so we conclude that the association is strong. However, does this mean that more sales calls cause more sales? No, we have not demonstrated cause and effect here, only that the two variables—sales calls and copiers sold—are related. Correlation Coefficient - Example
144. 144. 18 Coefficient of Determination (r2) - Example •The coefficient of determination, r2 ,is 0.576, found by (0.759)2 •This is a proportion or a percent; we can say that 57.6 percent of the variation in the number of copiers sold is explained, or accounted for, by the variation in the number of sales calls.
145. 145. 19 Testing the Significance of the Correlation Coefficient H0: r = 0 (the correlation in the population is 0) H1: r ≠ 0 (the correlation in the population is not 0) Reject H0 if: t > ta/2,n-2 or t < -ta/2,n-2
146. 146. 20 Testing the Significance of the Correlation Coefficient - Example H0: r = 0 (the correlation in the population is 0) H1: r ≠ 0 (the correlation in the population is not 0) Reject H0 if: t > ta/2,n-2 or t < -ta/2,n-2 t > t0.025,8 or t < -t0.025,8 t > 2.306 or t < -2.306
147. 147. 21 Testing the Significance of the Correlation Coefficient - Example The computed t (3.297) is within the rejection region, therefore, we will reject H0. This means the correlation in the population is not zero. From a practical standpoint, it indicates to the sales manager that there is correlation with respect to the number of sales calls made and the number of copiers sold in the population of salespeople.
148. 148. 22 Minitab
149. 149. 23 Linear Regression Model
150. 150. 24 Computing the Slope of the Line
151. 151. 25 Computing the Y-Intercept
152. 152. 26 Regression Analysis In regression analysis we use the independent variable (X) to estimate the dependent variable (Y). l The relationship between the variables is linear. l Both variables must be at least interval scale. l The least squares criterion is used to determine the equation.
153. 153. 27 Regression Analysis – Least Squares Principle l The least squares principle is used to obtain a and b. l The equations to determine a and b are: b n XY X Y n X X a Y n b X n = - - = - ( ) ( )( ) ( ) ( ) S S S S S S S 2 2
154. 154. 28 Illustration of the Least Squares Regression Principle
155. 155. 29 Regression Equation - Example Recall the example involving Copier Sales of America. The sales manager gathered information on the number of sales calls made and the number of copiers sold for a random sample of 10 sales representatives. Use the least squares method to determine a linear equation to express the relationship between the two variables. What is the expected number of copiers sold by a representative who made 20 calls?
156. 156. 30 Finding the Regression Equation - Example 6316 . 42 ) 20 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ ^ = + = + = + = Y Y X Y bX a Y
157. 157. 31 Computing the Estimates of Y Step 1 – Using the regression equation, substitute the value of each X to solve for the estimated sales 4736 . 54 ) 30 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 Jones Soni ^ ^ ^ = + = + = Y Y X Y 6316 . 42 ) 20 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 Keller Tom ^ ^ ^ = + = + = Y Y X Y
158. 158. 32 Plotting the Estimated and the Actual Y’s
159. 159. 33 The Standard Error of Estimate l The standard error of estimate measures the scatter, or dispersion, of the observed values around the line of regression l The formulas that are used to compute the standard error: 2 ) ( 2 ^ . - - S = n Y Y s x y 2 2 . - S - S - S = n XY b Y a Y s x y
160. 160. 34 Standard Error of the Estimate - Example Recall the example involving Copier Sales of America. The sales manager determined the least squares regression equation is given below. Determine the standard error of estimate as a measure of how well the values fit the regression line. X Y 1842 . 1 9476 . 18 ^ + = 901 . 9 2 10 211 . 784 2 ) ( 2 ^ . = - = - - S = n Y Y s x y
161. 161. 35 ) ( ^ Y Y - Graphical Illustration of the Differences between Actual Y – Estimated Y
162. 162. 36 Standard Error of the Estimate - Excel
163. 163. 37 Assumptions Underlying Linear Regression For each value of X, there is a group of Y values, and these l Y values are normally distributed. The means of these normal distributions of Y values all lie on the straight line of regression. l The standard deviations of these normal distributions are equal. l The Y values are statistically independent. This means that in the selection of a sample, the Y values chosen for a particular X value do not depend on the Y values for any other X values.
164. 164. 38 Confidence Interval and Prediction Interval Estimates of Y •A confidence interval reports the mean value of Y for a given X. •A prediction interval reports the range of values of Y for a particular value of X.
165. 165. 39 Confidence Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent confidence interval for all sales representatives who make 25 calls.
166. 166. 40 Step 1 – Compute the point estimate of Y In other words, determine the number of copiers we expect a sales representative to sell if he or she makes 25 calls. 5526 . 48 ) 25 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ = + = + = Y Y X Y Confidence Interval Estimate - Example
167. 167. 41 Step 2 – Find the value of t l To find the t value, we need to first know the number of degrees of freedom. In this case the degrees of freedom is n - 2 = 10 – 2 = 8. l We set the confidence level at 95 percent. To find the value of t, move down the left-hand column of Appendix B.2 to 8 degrees of freedom, then move across to the column with the 95 percent level of confidence. l The value of t is 2.306. Confidence Interval Estimate - Example
168. 168. 42 Confidence Interval Estimate - Example
169. 169. 43 Confidence Interval Estimate - Example Step 4 – Use the formula above by substituting the numbers computed in previous slides Thus, the 95 percent confidence interval for the average sales of all sales representatives who make 25 calls is from 40.9170 up to 56.1882 copiers.
170. 170. 44 Prediction Interval Estimate - Example We return to the Copier Sales of America illustration. Determine a 95 percent prediction interval for Sheila Baker, a West Coast sales representative who made 25 calls.
171. 171. 45 Step 1 – Compute the point estimate of Y In other words, determine the number of copiers we expect a sales representative to sell if he or she makes 25 calls. 5526 . 48 ) 25 ( 1842 . 1 9476 . 18 1842 . 1 9476 . 18 : is equation regression The ^ ^ ^ = + = + = Y Y X Y Prediction Interval Estimate - Example
172. 172. 46 Step 2 – Using the information computed earlier in the confidence interval estimation example, use the formula above. Prediction Interval Estimate - Example If Sheila Baker makes 25 sales calls, the number of copiers she will sell will be between about 24 and 73 copiers.
173. 173. 47 Confidence and Prediction Intervals – Minitab Illustration
174. 174. 48 Transforming Data l The coefficient of correlation describes the strength of the linear relationship between two variables. It could be that two variables are closely related, but there relationship is not linear. l Be cautious when you are interpreting the coefficient of correlation. A value of r may indicate there is no linear relationship, but it could be there is a relationship of some other nonlinear or curvilinear form.
175. 175. 49 Transforming Data - Example On the right is a listing of 22 professional golfers, the number of events in which they participated, the amount of their winnings, and their mean score for the 2004 season. In golf, the objective is to play 18 holes in the least number of strokes. So, we would expect that those golfers with the lower mean scores would have the larger winnings. To put it another way, score and winnings should be inversely related. In 2004 Tiger Woods played in 19 events, earned \$5,365,472, and had a mean score per round of 69.04. Fred Couples played in 16 events, earned \$1,396,109, and had a mean score per round of 70.92. The data for the 22 golfers follows.
176. 176. 50 Scatterplot of Golf Data l The correlation between the variables Winnings and Score is 0.782. This is a fairly strong inverse relationship. l However, when we plot the data on a scatter diagram the relationship does not appear to be linear; it does not seem to follow a straight line.
177. 177. 51 What can we do to explore other (nonlinear) relationships? One possibility is to transform one of the variables. For example, instead of using Y as the dependent variable, we might use its log, reciprocal, square, or square root. Another possibility is to transform the independent variable in the same way. There are other transformations, but these are the most common.
178. 178. 52 In the golf winnings example, changing the scale of the dependent variable is effective. We determine the log of each golfer’s winnings and then find the correlation between the log of winnings and score. That is, we find the log to the base 10 of Tiger Woods’ earnings of \$5,365,472, which is 6.72961. Transforming Data - Example
179. 179. 53 Scatter Plot of Transformed Y
180. 180. 54 Linear Regression Using the Transformed Y
181. 181. 55 Using the Transformed Equation for Estimation Based on the regression equation, a golfer with a mean score of 70 could expect to earn: •The value 6.4372 is the log to the base 10 of winnings. •The antilog of 6.4372 is 2.736 •So a golfer that had a mean score of 70 could expect to earn \$2,736,528.
182. 182. 56 End of Chapter 13
183. 183. ©The McGraw-Hill Companies, Inc. 2008 McGraw-Hill/Irwin Multiple Linear Regression and Correlation Analysis Chapter 14
184. 184. 2 GOALS l Describe the relationship between several independent variables and a dependent variable using multiple regression analysis. l Set up, interpret, and apply an ANOVA table l Compute and interpret the multiple standard error of estimate, the coefficient of multiple determination, and the adjusted coefficient of multiple determination. l Conduct a test of hypothesis to determine whether regression coefficients differ from zero. l Conduct a test of hypothesis on each of the regression coefficients. l Use residual analysis to evaluate the assumptions of multiple regression analysis. l Evaluate the effects of correlated independent variables. l Use and understand qualitative independent variables. l Understand and interpret the stepwise regression method. l Understand and interpret possible interaction among independent variables.
185. 185. 3 Multiple Regression Analysis The general multiple regression with k independent variables is given by: The least squares criterion is used to develop this equation. Because determining b1, b2, etc. is very tedious, a software package such as Excel or MINITAB is recommended.
186. 186. 4 Multiple Regression Analysis For two independent variables, the general form of the multiple regression equation is: •X1 and X2 are the independent variables. •a is the Y-intercept •b1 is the net change in Y for each unit change in X1 holding X2 constant. It is called a partial regression coefficient, a net regression coefficient, or just a regression coefficient.
187. 187. 5 Regression Plane for a 2-Independent Variable Linear Regression Equation
188. 188. 6 Salsberry Realty sells homes along the east coast of the United States. One of the questions most frequently asked by prospective buyers is: If we purchase this home, how much can we expect to pay to heat it during the winter? The research department at Salsberry has been asked to develop some guidelines regarding heating costs for single-family homes. Three variables are thought to relate to the heating costs: (1) the mean daily outside temperature, (2) the number of inches of insulation in the attic, and (3) the age in years of the furnace. To investigate, Salsberry’s research department selected a random sample of 20 recently sold homes. It determined the cost to heat each home last January, as well Multiple Linear Regression - Example
189. 189. 7 Multiple Linear Regression - Example
190. 190. 8 Multiple Linear Regression – Minitab Example
191. 191. 9 Multiple Linear Regression – Excel Example
192. 192. 10 The Multiple Regression Equation – Interpreting the Regression Coefficients The regression coefficient for mean outside temperature is 4.583. The coefficient is negative and shows an inverse relationship between heating cost and temperature. As the outside temperature increases, the cost to heat the home decreases. The numeric value of the regression coefficient provides more information. If we increase temperature by 1 degree and hold the other two independent variables constant, we can estimate a decrease of \$4.583 in monthly heating cost. So if the mean temperature in Boston is 25 degrees and it is 35 degrees in Philadelphia, all other things being the same (insulation and age of furnace), we expect the heating cost would be \$45.83 less in Philadelphia. The attic insulation variable also shows an inverse relationship: the more insulation in the attic, the less the cost to heat the home. So the negative sign for this coefficient is logical. For each additional inch of insulation, we expect the cost to heat the home to decline \$14.83 per month, regardless of the outside temperature or the age of the furnace. The age of the furnace variable shows a direct relationship. With an older furnace, the cost to heat the home increases. Specifically, for each additional year older the furnace is, we expect the cost to increase \$6.10 per month.
193. 193. 11 Applying the Model for Estimation What is the estimated heating cost for a home if the mean outside temperature is 30 degrees, there are 5 inches of insulation in the attic, and the furnace is 10 years old?
194. 194. 12 Multiple Standard Error of Estimate The multiple standard error of estimate is a measure of the effectiveness of the regression equation. l It is measured in the same units as the dependent variable. l It is difficult to determine what is a large value and what is a small value of the standard error. l The formula is:
195. 195. 13
196. 196. 14 Multiple Regression and Correlation Assumptions l The independent variables and the dependent variable have a linear relationship. The dependent variable must be continuous and at least interval- scale. l The residual must be the same for all values of Y. When this is the case, we say the difference exhibits homoscedasticity. l The residuals should follow the normal distributed with mean 0. l Successive values of the dependent variable must be uncorrelated.
197. 197. 15 The ANOVA Table The ANOVA table reports the variation in the dependent variable. The variation is divided into two components. l The Explained Variation is that accounted for by the set of independent variable. l The Unexplained or Random Variation is not accounted for by the independent variables.
198. 198. 16 Minitab – the ANOVA Table
199. 199. 17 Coefficient of Multiple Determination (r2) Characteristics of the coefficient of multiple determination: 1. It is symbolized by a capital R squared. In other words, it is written as because it behaves like the square of a correlation coefficient. 2. It can range from 0 to 1. A value near 0 indicates little association between the set of independent variables and the dependent variable. A value near 1 means a strong association. 3. It cannot assume negative values. Any number that is squared or raised to the second power cannot be negative. 4. It is easy to interpret. Because is a value between 0 and 1 it is easy to interpret, compare, and understand.
200. 200. 18 Minitab – the ANOVA Table 804 . 0 916 , 212 220 , 171 total 2 = = = SS SSR R
201. 201. 19 Adjusted Coefficient of Determination l The number of independent variables in a multiple regression equation makes the coefficient of determination larger. Each new independent variable causes the predictions to be more accurate. l If the number of variables, k, and the sample size, n, are equal, the coefficient of determination is 1.0. In practice, this situation is rare and would also be ethically questionable. l To balance the effect that the number of independent variables has on the coefficient of multiple determination, statistical software packages use an adjusted coefficient of multiple determination.
202. 202. 20
203. 203. 21 Correlation Matrix A correlation matrix is used to show all possible simple correlation coefficients among the variables. l The matrix is useful for locating correlated independent variables. l It shows how strongly each independent variable is correlated with the dependent variable.
204. 204. 22 Global Test: Testing the Multiple Regression Model The global test is used to investigate whether any of the independent variables have significant coefficients. The hypotheses are: 0 equal s all Not : 0 ... : 1 2 1 0 b b b b H H k = = = =
205. 205. 23 Global Test continued l The test statistic is the F distribution with k (number of independent variables) and n-(k+1) degrees of freedom, where n is the sample size. l Decision Rule: Reject H0 if F > Fa,k,n-k-1
206. 206. 24 Finding the Critical F
207. 207. 25 Finding the Computed F
208. 208. 26 Interpretation l The computed value of F is 21.90, which is in the rejection region. l The null hypothesis that all the multiple regression coefficients are zero is therefore rejected. l Interpretation: some of the independent variables (amount of insulation, etc.) do have the ability to explain the variation in the dependent variable (heating cost). l Logical question – which ones?
209. 209. 27 Evaluating Individual Regression Coefficients (βi = 0) l This test is used to determine which independent variables have nonzero regression coefficients. l The variables that have zero regression coefficients are usually dropped from the analysis. l The test statistic is the t distribution with n-(k+1) degrees of freedom. l The hypothesis test is as follows: H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
210. 210. 28 Critical t-stat for the Slopes -2.120 2.120
211. 211. 29 Computed t-stat for the Slopes
212. 212. 30 Conclusion on Significance of Slopes
213. 213. 31 New Regression Model without Variable “Age” – Minitab
214. 214. 32 New Regression Model without Variable “Age” – Minitab
215. 215. 33 Testing the New Model for Significance
216. 216. 34 Critical t-stat for the New Slopes 110 . 2 0 110 . 2 0 0 0 0 0 0 0 : if H Reject 17 , 025 . 17 , 025 . 1 2 20 , 2 / 05 . 1 2 20 , 2 / 05 . 1 , 2 / 1 , 2 / 1 , 2 / 1 , 2 / 0 - < - > - - < - > - - < - > - - < - > - - < > - - - - - - - - - - - - i i i i i i i i b i b i b i b i b i b i k n b i k n b i k n k n s b s b t s b t s b t s b t s b t s b t s b t t t t a a a a -2.110 2.110
217. 217. 35 Conclusion on Significance of New Slopes
218. 218. 36 Evaluating the Assumptions of Multiple Regression 1. There is a linear relationship. That is, there is a straight-line relationship between the dependent variable and the set of independent variables. 2. The variation in the residuals is the same for both large and small values of the estimated Y To put it another way, the residual is unrelated whether the estimated Y is large or small. 3. The residuals follow the normal probability distribution. 4. The independent variables should not be correlated. That is, we would like to select a set of independent variables that are not themselves correlated. 5. The residuals are independent. This means that successive observations of the dependent variable are not correlated. This assumption is often violated when time is involved with the sampled observations.
219. 219. 37 Analysis of Residuals A residual is the difference between the actual value of Y and the predicted value of Y. Residuals should be approximately normally distributed. Histograms and stem-and-leaf charts are useful in checking this requirement. l A plot of the residuals and their corresponding Y’ values is used for showing that there are no trends or patterns in the residuals.
220. 220. 38 Scatter Diagram
221. 221. 39 Residual Plot
222. 222. 40 Distribution of Residuals Both MINITAB and Excel offer another graph that helps to evaluate the assumption of normally distributed residuals. It is a called a normal probability plot and is shown to the right of the histogram.
223. 223. 41 Multicollinearity l Multicollinearity exists when independent variables (X’s) are correlated. l Correlated independent variables make it difficult to make inferences about the individual regression coefficients (slopes) and their individual effects on the dependent variable (Y). l However, correlated independent variables do not affect a multiple regression equation’s ability to predict the dependent variable (Y).
224. 224. 42 Variance Inflation Factor l A general rule is if the correlation between two independent variables is between -0.70 and 0.70 there likely is not a problem using both of the independent variables. l A more precise test is to use the variance inflation factor (VIF). l The value of VIF is found as follows: •The term R2 j refers to the coefficient of determination, where the selected independent variable is used as a dependent variable and the remaining independent variables are used as independent variables. •A VIF greater than 10 is considered unsatisfactory, indicating that independent variable should be removed from the analysis.
225. 225. 43 Multicollinearity – Example Refer to the data in the table, which relates the heating cost to the independent variables outside temperature, amount of insulation, and age of furnace. Develop a correlation matrix for all the independent variables. Does it appear there is a problem with multicollinearity? Find and interpret the variance inflation factor for each of the independent variables.
226. 226. 44 Correlation Matrix - Minitab
227. 227. 45 VIF – Minitab Example The VIF value of 1.32 is less than the upper limit of 10. This indicates that the independent variable temperature is not strongly correlated with the other independent variables. Coefficient of Determination
228. 228. 46 Independence Assumption l The fifth assumption about regression and correlation analysis is that successive residuals should be independent. l When successive residuals are correlated we refer to this condition as autocorrelation. Autocorrelation frequently occurs when the data are collected over a period of time.
229. 229. 47 Residual Plot versus Fitted Values l The graph below shows the residuals plotted on the vertical axis and the fitted values on the horizontal axis. l Note the run of residuals above the mean of the residuals, followed by a run below the mean. A scatter plot such as this would indicate possible autocorrelation.
230. 230. 48 Qualitative Independent Variables l Frequently we wish to use nominal-scale variables—such as gender, whether the home has a swimming pool, or whether the sports team was the home or the visiting team—in our analysis. These are called qualitative variables. l To use a qualitative variable in regression analysis, we use a scheme of dummy variables in which one of the two possible conditions is coded 0 and the other 1.
231. 231. 49 Qualitative Variable - Example Suppose in the Salsberry Realty example that the independent variable “garage” is added. For those homes without an attached garage, 0 is used; for homes with an attached garage, a 1 is used. We will refer to the “garage” variable as The data from Table 14–2 are entered into the MINITAB system.
232. 232. 50 Qualitative Variable - Minitab
233. 233. 51 Using the Model for Estimation What is the effect of the garage variable? Suppose we have two houses exactly alike next to each other in Buffalo, New York; one has an attached garage, and the other does not. Both homes have 3 inches of insulation, and the mean January temperature in Buffalo is 20 degrees. For the house without an attached garage, a 0 is substituted for in the regression equation. The estimated heating cost is \$280.90, found by: For the house with an attached garage, a 1 is substituted for in the regression equation. The estimated heating cost is \$358.30, found by: Without garage With garage
234. 234. 52 Testing the Model for Significance l We have shown the difference between the two types of homes to be \$77.40, but is the difference significant? l We conduct the following test of hypothesis. H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
235. 235. 53 Evaluating Individual Regression Coefficients (βi = 0) l This test is used to determine which independent variables have nonzero regression coefficients. l The variables that have zero regression coefficients are usually dropped from the analysis. l The test statistic is the t distribution with n-(k+1) or n-k-1degrees of freedom. l The hypothesis test is as follows: H0: βi = 0 H1: βi ≠ 0 Reject H0 if t > ta/2,n-k-1 or t < -ta/2,n-k-1
236. 236. 54 120 . 2 0 120 . 2 0 0 0 0 0 0 0 : if H Reject 16 , 025 . 16 , 025 . 1 3 20 , 2 / 05 . 1 3 20 , 2 / 05 . 1 , 2 / 1 , 2 / 1 , 2 / 1 , 2 / 0 - < - > - - < - > - - < - > - - < - > - - < > - - - - - - - - - - - - i i i i i i i i b i b i b i b i b i b i k n b i k n b i k n k n s b s b t s b t s b t s b t s b t s b t s b t t t t a a a a Conclusion: The regression coefficient is not zero. The independent variable garage should be included in the analysis.
237. 237. 55 Stepwise Regression The advantages to the stepwise method are: 1. Only independent variables with significant regression coefficients are entered into the equation. 2. The steps involved in building the regression equation are clear. 3. It is efficient in finding the regression equation with only significant regression coefficients. 4. The changes in the multiple standard error of estimate and the coefficient of determination are shown.
238. 238. 56 The stepwise MINITAB output for the heating cost problem follows. Temperature is selected first. This variable explains more of the variation in heating cost than any of the other three proposed independent variables. Garage is selected next, followed by Insulation. Stepwise Regression – Minitab Example
239. 239. 57 Regression Models with Interaction l In Chapter 12 we discussed interaction among independent variables. To explain, suppose we are studying weight loss and assume, as the current literature suggests, that diet and exercise are related. So the dependent variable is amount of change in weight and the independent variables are: diet (yes or no) and exercise (none, moderate, significant). We are interested in whether there is interaction among the independent variables. That is, if those studied maintain their diet and exercise significantly, will that increase the mean amount of weight lost? Is total weight loss more than the sum of the loss due to the diet effect and the loss due to the exercise effect? l In regression analysis, interaction can be examined as a separate independent variable. An interaction prediction variable can be developed by multiplying the data values in one independent variable by the values in another independent variable, thereby creating a new independent variable. A two-variable model that includes an interaction term is:
240. 240. 58 Refer to the heating cost example. Is there an interaction between the outside temperature and the amount of insulation? If both variables are increased, is the effect on heating cost greater than the sum of savings from warmer temperature and the savings from increased insulation separately? Regression Models with Interaction - Example
241. 241. 59 Creating the Interaction Variable – Using the information from the table in the previous slide, an interaction variable is created by multiplying the temperature variable by the insulation. For the first sampled home the value temperature is 35 degrees and insulation is 3 inches so the value of the interaction variable is 35 X 3 = 105. The values of the other interaction products are found in a similar fashion. Regression Models with Interaction - Example
242. 242. 60 Regression Models with Interaction - Example
243. 243. 61 The regression equation is: Is the interaction variable significant at 0.05 significance level? Regression Models with Interaction - Example
244. 244. 62 There are other situations that can occur when studying interaction among independent variables. 1. It is possible to have a three-way interaction among the independent variables. In the heating example, we might have considered the three-way interaction between temperature, insulation, and age of the furnace. 2. It is possible to have an interaction where one of the independent variables is nominal scale. In our heating cost example, we could have studied the interaction between temperature and garage.
245. 245. 63 End of Chapter 14