Sociology 6Z03 Review II John Fox McMaster University Fall 2016 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 1 / 35 Outline: Review II Probability Part I Sampling Distributions Probability Part II Confidence Intervals Hypothesis Tests Inference for Means Inference for Proportions Inference for Contingency Tables Inference for Regression Analysis One-Way Analysis of Variance John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 2 / 35
Probability (Part I) Probability Basics Experiment Outcomes Sample space Events The axioms of probability theory John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 3 / 35 Probability (Part I) Discrete and Continuous Random Variables Discrete random variables Probability distribution Mean, variance, and standard deviation Continuous random variables Density curves Normal distributions µ = x i p i σ 2 = (x i µ) 2 p i σ = + σ 2 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 4 / 35
Sampling Distributions Statistical Inference Statistical Inference: Drawing conclusions about populations from random samples Characteristics of populations: parameters (Greek letters, e.g, µ, σ) Characteristics of samples: statistics (Roman letters, e.g., x, s) Statistics vary from sample to sample John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 5 / 35 Sampling Distributions Sampling distribution of a statistic Repeated sampling Sampling variability Bias, variance, and mean-square error The sampling distribution of sample means From a normal population: x N(µ, σ/ n) The central limit theorem, almost any population: x N(µ, σ/ n) Using simulation to explore sampling distributions John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 6 / 35
Probability (Part II) Venn diagrams Addition rule for non-disjoint events: P(A or B) = P(A) + P(B) P(A and B) Independent and dependent events. For A and B independent: P(A and B) = P(A)P(B) Conditional probability: P(B A) = P(A and B)/P(A) Tree diagrams John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 7 / 35 Probability (Part II) Binomial Distributions Formula for the binomial distribution with n trials and probability of success p: ( ) n P(X = k) = p k (1 p) n k k Mean: E (X ) = np Variance: V (X ) = np(1 p) Normal approximation to the binomial John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 8 / 35
Confidence Intervals Point estimation vs. interval estimation Confidence intervals: estimate ± margin of error Proper interpretation (e.g., of a 95-percent confidence interval): With repeated sampling, 95 percent of confidence intervals constructed by this method will include the true value of the parameter and 5 percent will miss the true value. (The confidence interval for any particular sample either includes the parameter or misses it.) Confidence interval for the population mean µ: x ± z σ n John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 9 / 35 Confidence Intervals Margin of Error The margin of error z σ/ n gets smaller when the level of confidence C is made smaller the population standard deviation σ gets smaller the sample size n gets larger Choosing the sample size for a desired margin of error m: n = ( z ) σ 2 m John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 10 / 35
Hypothesis Tests The Null and Alternative Hypotheses Null hypothesis H 0 and alternative hypothesis H a Directional (one-sided) alternative hypothesis: H 0 : µ = µ 0 or H a : µ > µ 0 H 0 : µ = µ 0 H a : µ < µ 0 Nondirectional (two-sided) alternative hypothesis: H 0 : µ = µ 0 H a : µ = µ 0 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 11 / 35 Hypothesis Tests Steps in Hypothesis Testing (for Means) 1 State the null and alternative hypotheses 2 From the data, calculate the test statistic z = x µ 0 σ/ n 3 The null sampling distribution of the test statistic is the standard normal distribution, z N(0, 1). 4 Calculate the P-value: the probability of obtaining a sample result (value of the test statistic) at least as extreme as the one observed if H 0 is true. Statistical significance and the significance level α John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 12 / 35
Hypothesis Tests Hypothesis Tests and Confidence Intervals A two-sided test at the level α (e.g.,.05 ) corresponds to a confidence interval with level C = 1 α (e.g.,.95 or 95 percent) Cautions concerning confidence intervals: Data must be a SRS from a large population Beware of outliers and non-normality in small samples The margin of error covers only random sampling errors John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 13 / 35 Hypothesis Tests Hypothesis Testing as Decision Making State of nature Decision H 0 true H 0 false Reject H 0 Type I error Correct decision Accept H 0 Correct decision Type II error Probability of Type I error = α (the level of the test) Power of the test = 1 P(Type II error) The power of the test goes up as the sample size n grows the true value of µ gets farther from the null value µ 0 the α level is made larger John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 14 / 35
Inference for Means Single-Sample t-tests and t-intervals Assumptions for the single-sample t procedures: Data are a SRS. The population is normal with mean µ and standard deviation σ, both of which are unknown. The statistic t = x µ s/ n follows a t-distribution with n 1 degrees of freedom. The standard error of the sample mean is SE = s/ n. To test the hypothesis H 0 : µ = µ 0, calculate the test statistic t = x µ 0 s/ n John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 15 / 35 Inference for Means Single-Sample t-tests and t-intervals To construct a level C confidence interval for µ, find the critical value t from the t-distribution with n 1 degrees of freedom, and with probability (1 C )/2 to the right. Then calculate estimate ± t SE = x ± t s n For matched-pairs data, the single-sample t-test and t-interval procedures are applied to the differences between the pairs. The t procedures are robust with respect to violation of the assumption of normality if the sample size is large. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 16 / 35
Inference for Means Two-Sample t-tests and t-intervals Assumptions for the two-sample t procedures: We have two independent SRSs from two populations. Both populations are normally distributed, with unknown means µ 1 and µ 2 and standard deviations σ 1 and σ 2. The statistic t = (x 1 x 2 ) (µ 1 µ 2 ) s1 2 + s2 2 n 1 n 2 follows a t-distribution with degrees of freedom approximated by the smaller of n 1 1 and n 2 1 (or by a complicated formula). The standard error of the difference in sample means x 1 x 2 is s 2 SE = 1 + s2 2 n 1 n 2 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 17 / 35 Inference for Means Two-Sample t-tests and t-intervals To test the hypothesis H 0 : µ 1 µ 2 = 0, calculate the test statistic t = x 1 x 2 s1 2 + s2 2 n 1 n 2 To construct a level C confidence interval for µ 1 µ 2, find the critical value t from the t-distribution with the smaller of n 1 1 and n 2 1 degrees of freedom, and with probability (1 C )/2 to the right. Then calculate estimate ± t SE = (x 1 x 2 ) ± t s1 2 + s2 2 n 1 n 2 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 18 / 35
Inference for Proportions Single-Sample Tests and Intervals for a Population Proportion p The sample proportion is Assumptions: p = count of successes in the sample n The data are a SRS from the population. For a test, np 0 and n(1 p 0 ) are both at least 10; for a confidence interval, the counts of successes and failures are both at least 15. The statistic z = p p p(1 p) follows an approximate standard normal distribution N(0, 1). John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 19 / 35 n Inference for Proportions Single-Sample Tests and Intervals for a Population Proportion p To test the hypothesis H 0 : p = p 0, calculate the test statistic p p 0 z = p0 (1 p 0 ) n To construct a level C confidence interval for p, find the critical value z from the standard normal distribution with probability (1 C )/2 to the right. Then calculate p(1 p) estimate ± z SE = p ± z n To find the sample size for a desired margin of error, m: ( z ) 2 n = p (1 p ) m p is a guessed value for the population proportion (conservatively,.5). John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 20 / 35
Inference for Proportions Two-Sample Tests and Intervals for a Difference in Population Proportions The difference in population proportions is p 1 p 2. The sample difference in proportions p 1 p 2 is approximately normally distributed with mean p 1 p 2 and standard deviation p 1 (1 p 1 ) + p 2(1 p 2 ) n 1 n 2 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 21 / 35 Inference for Proportions Two-Sample Tests and Intervals for a Difference in Population Proportions To test the hypothesis H 0 : p 1 p 2 = 0 (i.e., H 0 : p 1 = p 2 ), when the counts of successes and failures in both samples are all at least 5: 1 Calculate the pooled sample proportion p = count of successes in both samples combined n 1 + n 2 2 Calculate the test statistic z = p 1 p 2 ( 1 p(1 p) + 1 ) n 1 n 2 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 22 / 35
Inference for Proportions Two-Sample Tests and Intervals for a Difference in Population Proportions To construct a level C confidence interval for p 1 p 2 (when the counts of successess and failures in both sample are all at least 10): 1 Find the critical value z from the standard normal distribution with probability (1 C )/2 to the right 2 Calculate SE = p 1 (1 p 1 ) + p 2(1 p 2 ) n 1 n 2 3 Calculate estimate ± z SE = ( p 1 p 2 ) ± z SE John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 23 / 35 Inference for Contingency Tables The chi-square test for independence is used to test the null hypothesis that two categorical variables are unrelated in the population. Expected counts for each cell of the r c table under the null hypothesis are calculated as The test statistic expected count = X 2 = row total column total n (observed count expected count)2 expected count is approximately distributed as chi-square with (r 1)(c 1) degrees of freedom. For the chi-square test to be accurate: No more than 20 percent of the expected counts should be less than 5, and all of the expected counts should be 1 or larger. The data should be a SRS from the population. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 24 / 35
Inference for Regression Analysis Simple Linear Regression: Statistical Model Recall that the least-squares line ŷ = a + bx in simple regression is given by b = r s y s x a = y bx Inference in simple regression is based on a statistical model, which is assumed to describe the population: 1 Linearity: The average response in the population is µ y = α + βx. 2 Constant spread: The population standard deviation σ of y is the same for all values of x. 3 Normality: For any fixed value of x, the response y follows a normal distribution. 4 Independence: Observations are sampled independently. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 25 / 35 Inference for Regression Analysis Simple Linear Regression: Standard Error of the Regression The standard error about the regression line s estimates σ: s = 1 n 2 residual 2 where the residual is y ŷ. The degrees of freedom for s are n 2. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 26 / 35
Inference for Regression Analysis Simple Linear Regression: Confidence Intervals and Hypothesis Tests for the Slope Confidence intervals and hypothesis tests for the population slope β use the t distribution: The standard error of the sample least-squares slope b is s SE b = (x x) 2 The statistic t = b β SE b follows a t distribution with n 2 degrees of freedom. To test the null hypothesis H 0 : β = 0, calculate the test statistic t = b SE b To construct a confidence interval for β, find the critical value t from the t-distribution with n 2 degrees of freedom. Then calculate estimate ± t SE = b ± t SE b John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 27 / 35 Inference for Regression Analysis Multiple Linear Regression Inference in multiple regression is similar, but based on the model µ y = α + β 1 x 1 + β 2 x 2 + + β k x k The least-squares fit is ŷ = a + b 1 x 1 + b 2 x 2 + + b k x k The standard deviation σ of y around the population regression is estimated by s = 1 n k 1 residual 2 The degrees of freedom for s are n k 1. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 28 / 35
Inference for Regression Analysis Multiple Linear Regression: Confidence Intervals and Hypothesis Tests for a Slope Coefficient The standard errors of individual slope coefficients b 1, b 2,..., b k are found using a computer program. Then, to test, e.g., H 0 : β 1 = 0, we calculate t = b 1 SE b1 which follows a t distribution with n k 1 degrees of freedom under H 0. To construct a confidence interval, e.g., for β 1, find the critical value t from the t-distribution with n k 1 degrees of freedom. Then calculate estimate ± t SE = b 1 ± t SE b1 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 29 / 35 Inference for Regression Analysis Checking the Assumptions of the Regression Model Linearity: Plot the residuals against each x. Constant Spread: Plot the residuals against the fitted values ŷ. Normality: Examine a histogram of the residuals. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 30 / 35
One-Way Analysis of Variance One-way analysis of variance (ANOVA) is used to test the null hypothesis that several population means are equal: against the alternative hypothesis H 0 : µ 1 = µ 2 = = µ I H a : not all of µ 1, µ 2,..., µ I are equal John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 31 / 35 One-Way Analysis of Variance Assumptions 1 We have I independent SRSs, one from each population. 2 Each population is normally distributed, with unknown (and potentially different) means µ i, but the same unknown standard deviation σ. 3 The ANOVA F -test is approximately correct if the largest of the sample standard deviations is no more than twice as large as the smallest of the sample standard deviations. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 32 / 35
One-Way Analysis of Variance Calculating the F Test Statistic 1. Find the sum of squares for groups where n i is the size of the ith sample; x i is the mean for the ith sample; x is the mean for all of the data. The degrees of freedom for SSG is I 1. 2. The mean square for groups is SSG = n i (x i x) 2 MSG = SSG I 1 John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 33 / 35 One-Way Analysis of Variance Calculating the F Test Statistic 3. Find the sum of squares for error where s 2 i is the variance in sample i. SSE = (n i 1)s 2 i The sum of squares for error has N I degrees of freedom (careful: not N 1), where N = n i is the total sample size. 4. The mean square for error is MSE = SSE N I 5. The test statistic F = MSG MSE follows the F distribution with I 1 degrees of freedom in the numerator and N I degrees of freedom in the denominator. John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 34 / 35
One-Way Analysis of Variance The ANOVA Table It is convenient to organize the calculation of the one-way analysis of variance F -test in an ANOVA table: Source df SS MS F Groups I 1 n i (x i x) 2 SSG MSG I 1 MSE Error N I (n i 1)si 2 Total N 1 SSG + SSE SSE N I John Fox (McMaster University) Sociology 6Z03 Review II Fall 2016 35 / 35