Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12)

Marketing Research Session 10 Hypothesis Testing with Simple Random samples (Chapter 12) Remember: Z.05 = 1.645, Z.01 = 2.33 We will only cover one-sided hypothesis testing (cases 12.3, 12.4.2, 12.5.2, and 12.6.1) 12.6.1 One Sample, One-sided Test: At a (1 α) level of confidence, test H 0 : π K against H a : π > K. Example 12.1 (adapted) Suppose we wish to organize a rock concert at a university campus. The campus has 20,000 students. Our costs will be exactly recovered if 4000 students attend the concert; we will make a profit if more than 4000 students attend. In a simple random sample of 100 students, 30 students will attend the concert. At a 95% level of confidence, test the null hypothesis that not more than 4000 students will attend the concert. Process: Null hypothesis: 20, 000π 4000, that is, π = 1

Remember: Z.05 = 1.645, Z.01 = 2.33 Process: Null hypothesis: 20, 000π 4000, that is, π 4000 20, 000 = 0.2. At a 95% level of confidence, test H 0 : π 0.2 against H a : π >.2..05 π =.2.2+? p Data: n = 100, p = Decision Rule: At a 95% level of confidence, reject H 0 if p > Conclusion: 2

Example 12.1 (continued) H 0 : π.2, H a : π >.2 n = 100, p =.3 π =.2.3 P value p P-value = P (p.3 π =.2) = P ( p π σ p ) = Note: The P value is the probability that you can get a result as extreme as or more extreme than the result from the sample if H 0 is true. For any hypothesis test, if P value < α, we can reject H 0 at confidence level (1 α). 3

Remember: Z.05 = 1.645, Z.01 = 2.33 12.6.1 One Sample, One-sided Test: At a (1 α) level of confidence, test H 0 : π K against H a : π > K. Decision Rule: At a (1 α) level of confidence, reject H 0 if p > K + Z K(1 K) α n Example 12.1 (adapted) Suppose we wish to organize a rock concert at a university campus. The campus has 20,000 students. Our costs will be exactly recovered if 4000 students attend the concert; we will make a profit if more than 4000 students attend. In a simple random sample of 100 students, 30 students will attend the concert. At a 95% level of confidence, test the null hypothesis that not more than 4000 students will attend the concert. H o : H a : n = p = Decision Rule: At a 95% level of confidence, reject H 0 if: Conclusion: 4

Remember: Z.05 = 1.645, Z.01 = 2.33 12.6.1 One Sample, One-sided Test: At a (1 α) level of confidence, test H 0 : π K against H a : π > K. Decision Rule: At a (1 α) level of confidence, reject H 0 if p > K + Z K(1 K) α n Example 12.6: There are 50,000 households in a city, and we have drawn a simple random sample of size 500 from this population. In this sample of 500, 220 households own pets. At a 99% level of confidence, test the null hypothesis that not more than 20,000 households in this city own pets. Answer: First write H 0 and H a in terms of π H o : H a : n = p = Decision Rule: At a 99% level of confidence, reject H 0 if: Conclusion: 5

12.3: At a (1 α) level of confidence, test H 0 : µ K against H a : µ > K Decision Rule If n 30: s At a (1 α) level of confidence, reject H 0 if X > K + Z α n If n < 30: s At a (1 α) level of confidence, reject H 0 if X > K + t α, where the degree n of freedom of t is (n 1). 6

Remember: Z.05 = 1.645, Z.01 = 2.33 Example 12.2. A university has 15,000 students. We have drawn a simple random sample of size 400 from the population, and recorded how much money each student spend on cellular telephone service during November, 2003. For this sample, the sample mean is $36, and sample standard deviation is $20. At a 99% level of confidence, test the null hypothesis that these 15,000 students, combined, did not spend more than $500,000 on cellular telephone service during November, 2003. Answer: H 0 : H a : X = s = n = Decision Rule: At a 99% level of confidence, reject H 0 if Conclusion: 7

H 0 : µ K against H a : µ > K Decision Rule If n 30: s At a (1 α) level of confidence, reject H 0 if X > K + Z α n If n < 30: s At a (1 α) level of confidence, reject H 0 if X > K + t α, where the degree n of freedom of t is (n 1). Example 12.3. Suppose once again we want to test the hypothesis described in Example 12.2, and a simple random sample again gives x = 36 and s = 20. Perform the test if the sample size is 25 instead of 400. Answer: H 0 : H a : X = s = n = Decision Rule: At a 99% level of confidence, reject H 0 if Conclusion: 8

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 α) level of confidence, test H 0 : µ 1 µ 2 K, against H a : µ 1 µ 2 > K. Consider only the case where n 1 30 and n 2 30. Decision Rule: At a confidence level of (1 α), reject H 0 if Note: (X 1 X 2 ) > K + Z α σ X1 X 2 K + Z α s2 1 n 1 + s2 2 n 2 If the hypothesis is stated in terms of (µ 2 µ 1 ), state the decision rule in terms of (X 2 X 1 ). (X 1 X 2 ) and (X 2 X 1 ) have the same standard deviation, s2 1 n 1 + s2 2 n 2. Example 12.4: We have drawn two independent simple random samples, one from the male student population, and the other from the female student population, at a college campus. For each member of the samples, we recorded how much money that student spent purchasing clothes during the Spring 2003 semester. The results are summarized below: 1. Sample of Males: n 1 = 50, X 1 = $420, s 1 = $150. 2. Sample of Females: n 2 = 150, X 2 = $450, s 2 = $250. At a 99% level of confidence, test the null hypothesis that on the average, a female student did not spend more money purchasing clothes than a male student during the Spring 2003 semester. 9

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 α) level of confidence, test H 0 : µ 1 µ 2 K, against H a : µ 1 µ 2 > K. Decision Rule: At a confidence level of (1 α), reject H 0 if (X 1 X 2 ) > K + Z α σ X1 X 2 K + Z α s2 1 n 1 + s2 2 n 2 Note: If the hypothesis is stated in terms of (µ 2 µ 1 ), state the decision rule in terms of (X 2 X 1 ). H 0 : H a : n 1 = 50, X 1 = $420, s 1 = $150 n 2 = 150, X 2 = $450, s 2 = $250 Decision Rule: At a 99% level of confidence, reject H 0 if Conclusion: 10

Remember: Z.05 = 1.645, Z.01 = 2.33 12.4.2. Comparing Means of Two Independent Samples: At a (1 α) level of confidence, test H 0 : µ 1 µ 2 K, against H a : µ 1 µ 2 > K. Consider only the case where n 1 30 and n 2 30. Decision Rule: At a confidence level of (1 α), reject H 0 if (X 1 X 2 ) > K + Z α s2 1 n 1 + s2 2 n 2 Note: If the hypothesis is stated in terms of (µ 2 µ 1 ), state the decision rule in terms of (X 2 X 1 ). Exercise Problem 3 from 12.8. Suppose you want to evaluate a program which helps a student prepare to take the GMAT. You have randomly selected 100 SU undergraduates, and randomly divided them into two groups of 50 each. Students in group 1 attend the training program. Students in group 2 do not attend the training program. Students in both groups subsequently take the GMAT. Results. Group 1: 50 students, average score of the sample is 610, sample standard deviation is 20. Group 2: 50 students, average score of the sample is 580, sample standard deviation is 30. 3.(a) At a 95% level of confidence, test the null hypothesis that the training program does not increase a student s GMAT score by more than 20. 3.(b) At a 95% level of confidence, test the null hypothesis that the population average score GMAT score of students who take the training would not exceed 600. 11

Remember: Z.05 = 1.645, Z.01 = 2.33 12.3. H 0 : µ K, H a : µ > K At confidence (1 α): Reject H 0 if X > K + Z α s n if n 30 Reject H 0 if X > K + t α s n if n < 30 (degree of freedom of t is n 1) 12.4.2. Two independent samples, H 0 : µ 1 µ 2 K, H a : µ 1 µ 2 > K At confidence (1 α) reject H 0 if X 1 X 2 > K + Z α Given: n 1 = 50, X 1 = 610, s 1 = 20 n 2 = 50, X 2 = 580, s 2 = 30 s2 1 n 1 + s2 2 n 2 3.(a) At a 95% level of confidence, test the null hypothesis that the training program does not increase a student s GMAT score by more than 20. Process: Treat the two samples as independent samples and use the decision rule from 12.4.2. H 0 : H a : Decision Rule: At a 95% level of confidence, reject H 0 if Conclusion: 12

Remember: Z.05 = 1.645, Z.01 = 2.33 12.3. H 0 : µ K, H a : µ > K At confidence (1 α): Reject H 0 if X > K + Z α s n if n 30 Reject H 0 if X > K + t α s n if n < 30 (degree of freedom of t is n 1) 12.4.2. Two independent samples, H 0 : µ 1 µ 2 K, H a : µ 1 µ 2 > K At confidence (1 α) reject H 0 if X 1 X 2 > K + Z α Given: n 1 = 50, X 1 = 610, s 1 = 20 n 2 = 50, X 2 = 580, s 2 = 30 s2 1 n 1 + s2 2 n 2 3.(b) At a 95% level of confidence, test the null hypothesis that the population average score GMAT score of students who take the training would not exceed 600. Process: Focus only on the sub-population of students who took the training and use the decision rule from 12.3. H 0 : H a : X 1 = s 1 = n 1 = Decision Rule: At a 95% level of confidence, reject H 0 if X 1 > Conclusion: 13

12.5.2 Comparing Means of Two Related Samples: At a (1 α) level of confidence, test: H 0 : (µ 1 µ 2 ) K, against H a : (µ 1 µ 2 ) > K, where K is a specified number. Restate Problem: Let d = X 1 X 2. Then, µ d = µ 1 µ 2. Hence, we are doing the following test: At a (1 α) level of confidence, Test H 0 : µ d K against µ d > K. Decision Rule: The test is identical to the one-sided hypothesis test using one sample, discussed in Section 12.3, with x replaced by d. Proceed as follows: (1) For each pair i, compute d i = x 1i x 2i. di (2) Compute d = n and s (di d) d = 2. n 1 (3) Depending on sample size, the decision rule is given as follows: Case 1. n 30: At a confidence level of (1 α) reject H 0 if d > K + Z α s d n. Case 2. n < 30: At a confidence level of (1 α), reject H 0 if d > K + t α s d n, where t has a degree of freedom of (n 1). 14

Case 1. n 30: At a confidence level of (1 α) reject H 0 if d > K + Z α s d n. Case 2. n < 30: At a confidence level of (1 α), reject H 0 if d > K + t α s d n, where t has a degree of freedom of (n 1). Example 12.5. Suppose you have selected a simple random sample of 9 Syracuse University undergraduate students and, for each student, recorded how much money (s)he spends in an average week on (1) snacks and (2) alcoholic beverages. Results: Student # $ spent on snacks/week $ spent on alcoholic 1 10 25 2 10 10 3 20 0 4 40 40 5 5 25 6 25 35 7 30 40 8 20 30 9 15 15 beverages/week At a 99% level of confidence, test the null hypothesis that on the average, an SU undergraduate student does not spend more money on alcoholic beverages than on snacks. 15

Student $ spent on $ spent on alcoholic # snacks beverages d i = per week per week x 2i x 1i (d i d) (d i d i ) 2 1 10 25 2 10 10 3 20 0 4 40 40 5 5 25 6 25 35 7 30 40 8 20 30 9 15 15 H 0 : H a : di = d = s d = = Decision Rule: At a 99% level of confidence, reject H 0 if d > (di d) 2 (di d) 2 n 1 Conclusion: 16

Coverage Session 11: Marketing Research Session 11 Chi-Square Analysis with cross-tabulations (Chapter 13) 17

Chi-Square Test with Cross-Tabulation (Chapter 13) Learning Objectives: Meaning of no relationship. The chi-square test Compute expected frequencies. Compute chi-square (χ 2 ). Compute degrees of freedom. Do the test. Check if test is valid and combine rows and/or columns as necessary to have a valid test. Important application: Test if population proportion is same in two or more sub-populations. 18

Example 13.1 We selected a simple random sample of 150 students from a college campus, and recorded (i) the gender of the student, and (ii) whether the student has attended a basketball game played by the college team during the past year. The results are expressed as the following 2 2 cross tabulation: Didn t Attend Game Attended Game Male 30 60 Female 42 18 H 0 : There is no relationship between gender and attendance. Intuition of H 0 : There are two sub-populations: men and women. A member of either sub-population belongs to one of two categories: did not attend, and attended. If H 0 is true, then each sub-population (men or women) should have the same percentage break-down between the two categories of attendance. Formally: Define: π 11 = Proportion of men π 12 = Proportion of men who did not attend who attended π 21 = Proportion of women π 22 = Proportion of women who did not attend who attended H 0 means: π 11 = π 21, π 12 = π 22. 19

Didn t Attend Game Male 30 60 Female 42 18 Attended Game H 0 : There is no relationship between gender and attendance. Expected Frequencies: In the whole sample of 150 students: Total of Column 1 Proportion that did not attend = = 72 n 150 Total of Column 2 Proportion that attended = = 78 n 150 Number of men in sample = Total of Row 1 = 90 Expected number of men that did not attend (E 11 ) = 90 72 90 72 = = 43.2 150 150 Expected number of men that attended (E 12 ) = 90 78 90 78 = = 46.8 150 150 Number of women in sample = Total of Row 2 = 60 Expected number of women that did not attend (E 11 ) = 60 72 60 72 = = 28.8 150 150 Expected number of women that did attended (E 11 ) = 60 78 60 78 = = 31.2 150 150 More Generally: E ij = Total of Row i Total of Column j Sample Size(n) 20

Total of Row i Total of Column j E ij = Sample Size(n) Chi-Square: χ 2 R C (O ij E ij ) 2 = i=1 j=1 E ij (Observed Expected)2 (Compute Expected R = number of rows in each cell, and then sum over all cells.) C = number of columns Degrees of freedom = (R 1) (C 1) Decision Rule: At a confidence level (1 α), reject H 0 if χ 2 > χ 2 α at degree of freedom (R 1) (C 1) 21

Degrees of freedom = (R 1) (C 1) Decision Rule: At a confidence level (1 α), reject H 0 if χ 2 > χ 2 α at degree of freedom (R 1) (C 1) Return to Example 13.1 We selected a simple random sample of 150 students from a college campus, and recorded (i) the gender of the student, and (ii) whether the student has attended a basketball game played by the college team during the past year. tabulation: Didn t Attend Game Male 30 60 Female 42 18 The results are expressed as the following 2 2 cross Attended Game At a 95% level of confidence, test the null hypothesis that there is no relationship between gender and attendance (against the alternate hypothesis that there is some kind of relationship between the two). Here: O 11 =, E 11 = =, O 12 =, E 12 = = O 21 =, E 21 = =, O 22 =, E 22 = = χ 2 = (O 11 E 11 ) 2 E 11 + (O 12 E 12 ) 2 E 12 + (O 21 E 21 ) 2 E 21 + (O 22 E 22 ) 2 E 22 Degrees of freedom = (2 1) (2 1) = Decision Rule: At a 95% level of confidence, reject H 0 if χ 2 > Conclusion: 22

Validity of Chi-Square Test Need: E ij > 1 in all cells. E ij 5 is 80% or more of the cells. Note: We may combine any rows or columns. If we have scale variables (e.g., 1-7 scales), combine adjacent rows or columns for ease of interpretation. If we combine any two rows or any two columns, the observed numbers add up, the expected numbers also add up. If you need to modify a table, any valid modification is acceptable. Minitab you do that by recoding variables. In The final modified table must have at least two rows and at least two columns. After you modify the table, the degree of freedom comes from the number of rows and columns of the modified table. 23

Example 13.3 We collected a simple random sample of 60 students from a college campus and asked them to rate how much they like to watch professional sports on TV on a 1-7 scale (strongly dislike to strongly like). We also noted the gender of each respondent. Based on the results, we have constructed the following cross-tabulation: Like to watch professional sports on TV Gender 1 2 3 4 5 6 7 Male 2 0 4 12 6 8 8 Female 4 3 2 6 3 1 1 At a 99% level of confidence, test the null hypothesis that gender is not related to how much one likes to watch professional sports on TV. Process: First augment the cross-tabulation by row totals and column totals: Like to watch sports on TV Gender 1 2 3 4 5 6 7 Row Totals Male 2 0 4 12 6 8 8 40 Female 4 3 2 6 3 1 1 20 Column Totals 6 3 6 18 9 9 9 Then compute the expected frequencies in the original table. E 11 = 40 6 60 = E 12 = 40 3 60 = E 13 = 40 6 60 = E 14 = 40 18 60 E 15 = 40 9 60 = E 16 = 40 9 60 = E 17 = 40 9 60 = = E 21 = 20 6 60 = E 22 = 20 3 60 = E 23 = 20 6 60 = E 24 = 20 18 60 E 25 = 20 9 60 = E 26 = 20 9 60 = E 27 = 20 9 60 = = Is it valid to use chi-square test with original table? 24

E 11 = 40 6 60 = 4 E 12 = 40 3 60 = 2 E 13 = 40 6 60 = 4 E 14 = 40 18 60 = 12 E 15 = 40 9 60 = 6 E 16 = 40 9 60 = 6 E 17 = 40 9 60 = 6 E 21 = 20 6 60 = 2 E 22 = 20 3 60 = 1 E 23 = 20 6 60 = 2 E 24 = 20 18 60 = 6 E 25 = 20 9 60 = 3 E 26 = 20 9 60 = 3 E 27 = 20 9 60 = 3 Writing Compactly, the Expected Frequencies (E ij s) are: Like to watch professional sports on TV Gender 1 2 3 4 5 6 7 Male 4 2 4 12 6 6 6 Female 2 1 2 6 3 3 3 Items to check: Is E ij > 1 in all cells? If yes, then is E ij 5 in 80% or more cells? Note: If either condition fails, you have to combine columns to get a valid test. Since there are only two rows, you cannot combine rows here. If you had more than two rows, you could have combined rows. (Note that the final table must have at least two rows and at least two columns.) 25

Old Table of Observed Frequencies (O ij s): Like to watch professional sports on TV Gender 1 2 3 4 5 6 7 Male 2 0 4 12 6 8 8 Female 4 3 2 6 3 1 1 Old Table of Expected Frequencies (E ij s): Like to watch professional sports on TV Gender 1 2 3 4 5 6 7 Male 4 2 4 12 6 6 6 Female 2 1 2 6 3 3 3 New Table: Observed and Expected Frequencies: Degrees of freedom = ( 1) ( 1) = χ 2 = Decision Rule: At a 99% level of confidence, reject H 0 if χ 2 > Conclusion: 26

Problem 6, sample test: Suppose a researcher has selected a simple random sample of 50 Syracuse University students, and asked them to rate how satisfied they are with the Syracuse University Parking Services on a 1-5 scale (1 very dissatisfied, 3 neutral, 5 very satisfied). You also noted whether the respondent has a vehicle. The results are summarized as the following cross-tab: Satisfaction Vehicle Ownership 1 2 3 4 5 Has Vehicle: 9 9 6 4 2 Doesn t have vehicle 1 6 9 2 2 Suppose you want to test, at a 95% level of confidence, the null hypothesis that there is no association between vehicle ownership and satisfaction level with the Syracuse University parking services. In the present case, is it valid to use the chi-square test with the original cross-tab? Clearly state yes or no. If no, modify the original cross-tab so that a chi-square test becomes valid. Using the original or modified cross-tab as appropriate, perform a chi-square test of the null hypothesis at a 95% level of confidence. 27

Satisfaction Vehicle Ownership 1 2 3 4 5 Row Totals Has Vehicle: 9 9 6 4 2 30 Doesn t have vehicle 1 6 9 2 2 20 Column Totals 10 15 15 6 4 n = 50 E 11 = = E 12 = = E 13 = = E 14 = = E 15 = = E 21 = = E 22 = = E 23 = = E 24 = = E 25 = = Items to check: Is E ij > 1 in all cells? If yes, then is E ij 5 in 80% or more cells, that is, in at least 8 out of the 10 cells? Note: If either condition fails, you have to combine columns to get a valid test. Since there are only two rows, you cannot combine rows here. If you had more than two rows, you could have combined rows. (Note that the final table must have at least two rows and at least two columns.) 28

Table of observed frequencies (O ij s): Satisfaction Vehicle Ownership 1 2 3 4 5 Has Vehicle: 9 9 6 4 2 Doesn t have vehicle 1 6 9 2 2 Table of Expected frequencies (E ij s): Satisfaction Vehicle Ownership 1 2 3 4 5 Has Vehicle: Doesn t have vehicle New Table: Observed and Expected Frequencies: Degrees of freedom = ( 1) ( 1) = χ 2 = Decision Rule: At a 95% level of confidence, reject H 0 if χ 2 > Conclusion: 29

An Important Application of Chi-Square Test Example 1: Suppose you have collected two separate simple random samples from the male and female students of a college campus. Results: (1) Male sample: n 1 = 100, 70 watch sports on TV every week (2) Female sample: n 2 = 200, 80 watch sports on TV every week. At a 99% level of confidence, test the null hypothesis that an equal proportion of male and female students watch sports on TV. Approach: Express as a cross-tabulation: Do not Watch Watch Male 30 70 Female 120 80 Use chi-square test to test the null hypothesis that there is no relationship between gender watching sports on TV. 30

Logic: (1) Male sample: n 1 = 100, 70 watch sports on TV every week (2) Female sample: n 2 = 200, 80 watch sports on TV every week. Expressed as a cross-tabulation: Do not Watch Watch Male 30 70 Female 120 80 Let: π 11 = fraction of men who do not watch π 12 = fraction of men who watch π 21 = fraction of women who do not watch π 22 = fraction of women who watch Clearly, π 11 + π 12 = 1, and π 21 + π 22 = 1, that is, π 11 = 1 π 12, π 21 = 1 π 22 Therefore, if π 12 = π 22, we also have π 11 = π 21. Hence, the chi-square test here is equivalent to testing H 0 : π 1 π 12 = π 2 π 22 (same proportion of men and women watch) against H a : π 1 π 2. 31

Example 1 (continued): Do not Watch Watch Row Totals Male 30 70 Female 120 80 Column Totals n = 300 E 11 = = E 12 = = E 21 = = E 22 = = χ 2 = (O 11 E 11 ) 2 E 11 + (O 12 E 12 ) 2 E 12 + (O 21 E 21 ) 2 E 21 + (O 22 E 22 ) 2 = ( )2 + ( )2 + ( )2 + ( )2 = E 22 Decision Rule: At a 99% level of confidence, reject H 0 if χ 2 > χ 2.01 at df = (2 1) (2 1) = 1 Conclusion: 32

Example 2: Suppose you have collected simple random samples from three sub-populations: Business majors, Engineering majors, and other majors. For each respondent, you recorded if (s)he reads the Wall Street Journal every week. Results: (1) Business sample: n 1 = 100, 50 read WSJ every week. (2) Engineering sample: n 2 = 50, 15 read WSJ every week. (3) Other: sample: n 3 = 150, 25 read WSJ every week. At a 99% level of confidence, test the null hypothesis that an equal proportion of business, engineering, and other students read WSJ every week. Approach: Express as a cross-tabulation: Do not Read Read Business 50 50 Engineering 35 15 Other 125 25 Use chi-square test to test the null hypothesis that there is no relationship between major and reading WSJ every week. 33

Example 2 (continued): Do not Read Read Row Totals Business 50 50 Engineering 35 15 Other 125 25 Column Totals Sample Size = 300 E 11 = = E 12 = = E 21 = = E 22 = = E 31 = = E 32 = = χ 2 = (O 11 E 11 ) 2 E 11 + (O 12 E 12 ) 2 E 12 + (O 21 E 21 ) 2 E 21 + (O 22 E 22 ) 2 + (O 31 E 31 ) 2 E 31 + (O 32 E 32 ) 2 E 32 = ( )2 + ( )2 + ( )2 + ( )2 E 22 + ( )2 + ( )2 = Decision Rule: At a 99% level of confidence, reject H 0 if χ 2 > χ 2.01 at df = (3 1) (2 1) = 2 Conclusion: 34

More Generally: Suppose you are testing if an equal proportion of k sub-populations have a property of interest (e.g., read Wall Street Journal every week), that is, π 1 = π 2 =... = π k This is equivalent to a chi-square test with a k 2 cross-tabulation where each row comes from one sub-population, and the two columns are do not have property, and have property. Express the data as a k 2 cross tabulation. For any sub-population: Number who do not have property = Size of the sample from the sub-population Number from sub-population who have property Assuming test is valid, reject H 0 if χ 2 exceeds χ 2 α at degrees of freedom (k 1) (2 1) = k 1. 35

Marketing Research Session 12 Coverage of Session 12: Regression Analysis (Chapter 14) Meaning of model R 2 and F -test 36

Regression Analysis Basic Model Form: Y = β 0 + β 1 X 1 + β 2 X 2 +... + β m X m + ϵ β s are same for all cases. These are the regression parameters. We are estimating (m + 1) parameters here. Y is the dependent variable. It is quantitative variable. Strictly speaking, Y should have at least interval scale properties. X s are independent variables. We consider three types of independent variables: Quantitative variable with at least an interval scale property. Dummy variable Product of a dummy variable and an interval scaled variable. Meaning of Parameters in a special Case: If all the independent variables are interval scaled variables, then β i is how much Y changes on the average as X i increases by a unit, keeping all the other X s fixed. 37

Example of Standard Regression Model Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ϵ Y = per capita sales of brand X in a sales territory (in dollars) X 1 = per capita advertising expenditure by brand X in the territory (in dollars) X 2 = per capita personal selling expenditure by brand X in the territory (in dollars) X 3 = per capita sales promotion expenditure by brand X in the territory (in dollars) X 4 = price/unit of brand X (in dollars) Then: E(Y X 1, X 2, X 3, X 4 ) = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 m = 4 β 1 = on the average, how much Y changes if X 1 changes by a unit, but X 2, X 3, and X 4 are held constant. β 2 = on the average, how much Y changes if X 2 changes by a unit, but X 1, X 3, and X 4 are held constant. β 3 = on the average, how much Y changes if X 3 changes by a unit, but X 1, X 2, and X 4 are held constant. β 4 = on the average, how much Y changes if X 4 changes by a unit, but X 1, X 2, and X 3 are held constant. 38

Examples from Section 14.2 Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.1. Naive Model: Y = β 0 + ϵ In this model, E(Y ) is same for all students regardless of D or X. The estimate of the coefficient β 0 is Y, the sample mean. 39

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.2: Y = β 0 + β 1 D + ϵ Writing separately for the two categories of D: (1) D = 0 (not top 20): Y = β 0 + ϵ. (2) D = 1 (top 20): Y = (β 0 + β 1 ) + ϵ. Note: For the sub-population of graduates from top 20 students: Average Y = β 0 + β 1 For other students: Average Y = β 0 β 1 = Average salary of top 20 graduates Average salary of other graduates For either category of students, a change in X has no marginal effect on salary. 40

Model 14.2.3 (continued) E(Y X) = β 0 + β 2 X E(Y X) Slope = β 2 β 0 Note: 0 0 X Intuitively, we are dividing the population into sub-populations, where all graduates in a given sub-population have the same X. For a given subpopulation, Average Y = β 0 + β 2 X β 0 is the intercept, and β 2 is the slope of the regression line. β 2 is also called the marginal effect of X on Y. 42

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.4: (14.6)Y = β 0 + β 1 D + β 2 X + ϵ Writing separately for the two categories: Top 20 Graduate D = 1 Y = (β 0 + β 1 ) + β 2 X + ϵ Other Graduate D = 0 Y = β 0 + β 2 X + ϵ E(Y X) D = 1, slope = β 2 D = 0, slope = β 2 β 0 + β 1 β 0 0 0 X 43

14.2.4 (continued): Top 20 Graduate D = 1 Y = (β 0 + β 1 ) + β 2 X + ϵ Other Graduate D = 0 Y = β 0 + β 2 X + ϵ E(Y X) D = 1, slope = β 2 D = 0, slope = β 2 β 0 + β 1 β 0 Note: 0 0 X We are allowing the regression line to be different for the two categories of graduates. Both regression lines have the same slope β 2. The intercepts may be different for the two lines. 44

Context: Y = starting annual salary of a student who got an undergraduate business degree in 2003. D = 1 if the student graduated from a top 20 school, 0 if not X = cumulative grade point average Model 14.2.5: Y = β 0 + β 1 D + β 2 X + β 3 D X + ϵ Writing separately for the two categories: Top 20 Graduate D = 1, D X = X Y = (β 0 + β 1 ) + (β 2 + β 3 )X + ϵ Other Graduate D = 0, D X = 0 Y = β 0 + β 2 X + ϵ E(Y X) D = 1, slope = β 2 + β 3 D = 0, slope = β 2 β 0 + β 1 β 0 0 0 X 45

14.2.5 (continued): Top 20 Graduate D = 1, D X = X Y = (β 0 + β 1 ) + (β 2 + β 3 )X + ϵ Other Graduate D = 0, D X = 0 Y = β 0 + β 2 X + ϵ E(Y X) D = 1, slope = β 2 + β 3 D = 0, slope = β 2 β 0 + β 1 β 0 Note: 0 0 X We are allowing the regression line to be different for the two categories of graduates. The intercepts may be different for the two lines. If β 1 = 0, then the intercepts are equal. The slopes may be different for the two lines. If β 3 = 0, the slopes are equal. 46

Meaning of Parameters in More General Cases Context: Y = starting annual salary of a student who got an undergraduate business degree in 2006. Job Types: Accounting, Finance, Marketing, Other D 1 = 1 if Accounting, 0 if Finance, Marketing, or Other D 2 = 1 if Finance, 0 if Accounting, Marketing, or Other D 3 = 1 if Marketing, 0 if Accounting, Finance, or Other X = GPA on a 1-4 scale Model 1: Y = β 0 + β 1 X + ϵ E(Y X) = β 0 + β 1 X E(Y X) β 0 0 X 47

Model 2: Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 D 3 + ϵ E(Y X) β 0 + β 3 β 0 + β 2 β 0 + β 1 β 0 0 X Key: Write the model down separately for the four job types Accounting: Finance: Marketing: Other: What is the meaning of: (i) β 3 = 0 (ii) β 1 = β 2 (iii) β 1 = β 2 = β 3 (iv) β 1 = β 2 = β 3 = 0 48

Model 3: Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 D 3 + β 4 X + ϵ β 0 + β 3 β 0 + β 2 β 0 + β 1 β 0 0 E(Y X) X Note: Slope is same (β 4 ) for all four lines. Key: Write the model down separately for the four job types Accounting: Finance: Marketing: Other: Questions: (1) What do the lines become if β 4 = 0? (2) What do the lines become if β 1 = β 2 = β 3? (3) What do the lines become if β 1 = β 2 = β 3 = 0? 49

Model 4: Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 D 3 + β 4 X + β 5 D 1 X + β 6 D 2 X + β 7 D 3 X + ϵ E(Y X) β 0 + β 3 β 0 + β 2 β 0 + β 1 β 0 0 X Note: Slope is may be different for the four lines. Accounting: Finance: Marketing: Other: Questions: (1) What do the lines become if β 5 = β 6 = β 7 = 0? (2) What do the lines become if β 1 = β 2 = β 3 = β 5 = β 6 = β 7 = 0? (3) What do the lines become if β 4 = β 5 = β 6 = β 7 = 0? 50

Accounting: Y = (β 0 + β 1 ) + (β 4 + β 5 )X + ϵ Finance: Y = (β 0 + β 2 ) + (β 4 + β 6 )X + ϵ Marketing: Y = (β 0 + β 3 ) + (β 4 + β 7 )X + ϵ Other: Y = β 0 + +β 4 X + ϵ (1) β 5 = β 6 = β 7 = 0 51

Accounting: Y = (β 0 + β 1 ) + (β 4 + β 5 )X + ϵ Finance: Y = (β 0 + β 2 ) + (β 4 + β 6 )X + ϵ Marketing: Y = (β 0 + β 3 ) + (β 4 + β 7 )X + ϵ Other: Y = β 0 + +β 4 X + ϵ (2) β 1 = β 2 = β 3 = β 5 = β 6 = β 7 = 0 52

Accounting: Y = (β 0 + β 1 ) + (β 4 + β 5 )X + ϵ Finance: Y = (β 0 + β 2 ) + (β 4 + β 6 )X + ϵ Marketing: Y = (β 0 + β 3 ) + (β 4 + β 7 )X + ϵ Other: Y = β 0 + +β 4 X + ϵ (3) β 4 = β 5 = β 6 = β 7 = 0 53

Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 D 3 + β 4 X + β 5 D 1 X + β 6 D 2 X + β 7 D 3 X + ϵ E(Y X) β 0 + β 3 β 0 + β 2 β 0 + β 1 β 0 0 X Accounting: Y = (β 0 + β 1 ) + (β 4 + β 5 )X + ϵ Finance: Y = (β 0 + β 2 ) + (β 4 + β 6 )X + ϵ Marketing: Y = (β 0 + β 3 ) + (β 4 + β 7 )X + ϵ Other: Y = β 0 + β 4 X + ϵ State the following in terms of regression parameters: (1) Marginal effect of GPA on salary is the same for Accounting and Finance jobs. (2) GPA has no marginal effect on salary for Marketing jobs. 54

R 2 and F Test n = sample size m + 1 = number of regression parameters (β 0, β 1,..., β m ) R 2 = 1 (Yj Ŷj) 2 (Yj Y ) 2 Important: For any regression model that includes β 0, 0 R 2 1 For the naive model Y = β 0 + ϵ, the estimate of β 0 is Y (sample average of Y ), and R 2 = 0. If we add another independent variable, R 2 cannot decrease. F Test: H 0 : β 1 =... = β k = 0 (can be any k of β 1,..., β m ). H a : At least one of the β s listed in H 0 is not 0. Full regression: Y against X 1,..., X m Restricted Regression: Y against the remaining variables after dropping the variables that are not significant according to H 0. F = ( R2 full R 2 restricted) 1 R 2 full ) ( n m 1 ) k At a (1 α) degree of confidence, reject H 0 if F > F α at degrees of freedom (k, n m 1). Note: If H 0 : β 1 =... = β m = 0, then k = m, and R 2 restricted = 0. 55

Example of Standard Regression Model Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ϵ n = 45 Model Dependent Variable Independent Variable(s) R 2 1 Y X 1, X 2, X 3, X 4 0.60 2 Y X 1, X 2, X 3 0.55 3 Y X 2, X 3, X 4 0.50 4 Y X 1, X 3, X 4 0.54 5 Y X 1, X 2, X 4 0.51 6 Y X 1, X 2 0.48 7 Y X 3, X 4 0.45 At 99% level of confidence, test H 0 : β 3 = β 4 = 0 k =, m =, n =, n m 1 = Full model: Model Rfull 2 = Restricted Model: Model Rrestricted 2 = F = R2 full R 2 restricted 1 R 2 full n m 1 k = Decision Rule: At a 99% level confidence, reject H 0 if F > F α (k, n m 1) = Conclusion: 57

Example of Standard Regression Model Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 4 X 4 + ϵ n = 45 Model Dependent Variable Independent Variable(s) R 2 1 Y X 1, X 2, X 3, X 4 0.60 2 Y X 1, X 2, X 3 0.55 3 Y X 2, X 3, X 4 0.50 4 Y X 1, X 3, X 4 0.54 5 Y X 1, X 2, X 4 0.51 6 Y X 1, X 2 0.48 7 Y X 3, X 4 0.45 At 99% level of confidence, test H 0 : β 1 = β 2 = β 3 = β 4 = 0 k =, m =, n =, n m 1 = Full model: Model Rfull 2 = Restricted Model: Model Rrestricted 2 = F = R2 full R 2 restricted 1 R 2 full n m 1 k = Decision Rule: At a 99% level confidence, reject H 0 if F > F α (k, n m 1) = Conclusion: 58

Sample Test, Problem 7 Problem Scenario: A likert summated scale designed to measure how much an adult US citizen likes President Bush was administered to a random sample of US voters. The following regression model was used to analyze the data: Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 X + β 4 D 1 X + β 5 D 2 X + ϵ, where: Y = attitude score on the likert scale, D 1 = 1 if the citizen is a registered Democrat, and 0 if not; D 2 = 1 if the citizen is a registered Republican, and 0 if not; X = annual family income of the citizen, ϵ is the random error defined as usual. 7.(b)(2+2=4 pt) Write each of the following two hypotheses in terms of model parameters. 7.(b)(i) If a citizen is a registered Democrat, then annual family income has no marginal effect on attitude score. 7.(b)(ii) The marginal effect of annual family income on attitude score is the same for registered Republicans and other citizens. 59

Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 X + β 4 D 1 X + β 5 D 2 X + ϵ, where: Y = attitude score on the likert scale, Registered D 1 = 1, D 2 = 0, Y Democrat D 1 X = X, D 2 X = 0 = (β 0 + β 1 ) + (β 3 + β 4 )X + ϵ Registered D 1 = 0, D 2 = 1, Y Republican D 1 X = 0, D 2 X = X = (β 0 + β 2 ) + (β 3 + β 5 )X + ϵ Other D 1 = 0, D 2 = 0, Y D 1 X = 0, D 2 X = 0 = β 0 + β 3 X + ϵ 7.(b)(2+2=4 pt) Write each of the following two hypotheses in terms of model parameters. 7.(b)(i) If a citizen is a registered Democrat, then annual family income has no marginal effect on attitude score. 7.(b)(ii) The marginal effect of annual family income on attitude score is the same for registered Republicans and other citizens. 60

From Section 14.7: Consider the regression model: Y = β 0 + β 1 D 1 + β 2 D 2 + β 3 X 1 + β 4 X 2 + β 5 D 1 X 1 + β 6 D 2 X 1 + β 7 D 1 X 2 + β 8 D 2 X 2 + ϵ, where: Y = sales of a brand in a sales territory (unit = $100,000) during Fall, 2003; X 1 is the number of salespeople in the territory; X 2 is the retail price (in dollars) in the territory; D 1 and D 2 are dummy variables for the level of advertising in the territory. The advertising level can be low, medium, or high. D 1 = 1 if the advertising level is medium, and D 1 = 0 otherwise; D 2 = 1 if the advertising level is high, and D 2 = 0 otherwise. ϵ is defined as usual. Regression Model for the three advertising levels: Low Advertising: Medium Advertising: High Advertising: 61

Regression Model for the three advertising levels: Low Advertising: Medium Advertising: High Advertising: 2. State each of the following null hypotheses in terms of the parameters of the regression model (e.g., H 0 : β 1 = 0): 2.(a) The marginal effect of price on sales is the same for medium and high advertising levels. 2.(b) Changes in price do not affect sales if the level of advertising is low. 62

3. Suppose we have estimated regression models using data from 49 territories, and got the following results: Results Regression Dependent Variable Independent Variables R 2 1 Y D 1, D 2, X 1, X 2, D 1 X 1,.75 D 2 X 1, D 1 X 2, D 2 X 2 2 Y D 1, D 2.2 3 Y X 1, X 2.2 4 Y D 1, D 2, X 1, X 2, D 1 X 1, D 2 X 1.6 5 Y D 1, D 2, X 1, D 1 X 1, D 2 X 1.55 6 Y D 1, D 2, X 2, D 1 X 2, D 2 X 2.5 7 Y D 1, D 2, D 1 X 1, D 2 X 1,.3 D 1 X 2, D 2 X 2 8 Y D 1, D 2, X 1, X 2, D 1 X 2, D 2 X 2.7 At a 99% level of confidence, test each of the following two null hypotheses: 3(a) The marginal effect of price on sales is the same for all levels of advertising. H 0 : H a : k =, n =, m =, n m 1 = 1 = R 2 full =, R 2 restricted = 1 F = ( ) ( ) = 1 Decision Rule: At a 99% level of confidence, reject H 0 if F > Conclusion: 63

Regression Dependent Variable Independent Variables R 2 1 Y D 1, D 2, X 1, X 2, D 1 X 1,.75 D 2 X 1, D 1 X 2, D 2 X 2 2 Y D 1, D 2.2 3 Y X 1, X 2.2 4 Y D 1, D 2, X 1, X 2, D 1 X 1, D 2 X 1.6 5 Y D 1, D 2, X 1, D 1 X 1, D 2 X 1.55 6 Y D 1, D 2, X 2, D 1 X 2, D 2 X 2.5 7 Y D 1, D 2, D 1 X 1, D 2 X 1,.3 D 1 X 2, D 2 X 2 8 Y D 1, D 2, X 1, X 2, D 1 X 2, D 2 X 2.7 3(b) Having an additional salesperson does not have any effect on sales at any level of advertising. H 0 : H a : k =, n =, m =, n m 1 = 1 = R 2 full =, R 2 restricted = F = ( 1 ) ( 1 ) = Decision Rule: At a 99% level of confidence, reject H 0 if F > Conclusion: 64