Nonparametric tests If data do not come from a normal population (and if the sample is not large), we cannot use a t-test. One useful approach to creating test statistics is through the use of rank statistics. Sign test for one population The sign test assumes the data come from from a continuous distribution with model Y i = η + ɛ i, i = 1,...,n. η is the population median and ɛ i follow an unknown, continuous distribution. Want to test H 0 : η = η 0 where η 0 is known versus one of the three common alternatives: H 0 : η < η 0, H 0 : η η 0, or H 0 : η > η 0. 1
Test statistic is B = n i=1 I{y i > η 0 }, the number of y 1,...,y n larger than η 0. Under H 0, B bin(n, 0.5). Reject H 0 if B is unusual relative to this binomial distribution. Example: Eye relief data. Question: How would you form a large sample test statistic from B? You would not need to do that here, but this is common with more complex test statistics with non-standard distributions. 2
Wilcoxin signed rank test Again, test H 0 : η = η 0. However, this method assumes a symmetric pdf around the median η. Test statistic built from ranks of { y 1 η 0, y 2 η 0,..., y n η 0 }, denoted R 1,...,R n. The signed rank for observation i is R + R i y i > η 0 i = 0 y i η 0. The signed rank statistic is W + = n i=1 R+ i. If W + is large, this is evidence that η > η 0. If W + is small, this is evidence that η < η 0. 3
Both the sign test and the signed-rank test can be used with paired data (e.g. we could test whether the median difference is zero). When to use what? Use t-test when data are approximately normal, or in large sample sizes. Use sign test when data are highly skewed, multimodal, etc. Use signed rank test when data are approximately symmetric but non-normal (e.g. heavy or light-tailed, multimodal yet symmetric, etc.) Example: Weather station data Note: The sign test and signed-rank test are more flexible than the t-test because they require less strict assumptions, but the t-test has more power when the data are approximately normal. 4
Mann-Whitney-Wilcoxin nonparametric test (pp. 795 796) The Mann-Whitney test assumes Y 11,...Y 1n1 iid F 1 independent Y 21,...,Y 2n2 iid F 2, where F 1 is the cdf of data from the first group and F 2 is the cdf of data from the second group. The null hypothesis is H 0 : F 1 = F 2, i.e. that the distributions of data in the two groups are identical. The alternative is H 1 : F 1 F 2. One-sided tests can also be performed. Although the test statistic is built assuming F 1 = F 2, the alternative is often taken to be that the population medians are unequal. This is a fine way to report the results of the test. Additionally, a CI will give a plausible range for in the shift model F 2 (x) = F 1 (x ). can be the difference in medians or means. 5
The Mann-Whitney test is intuitive. The data are y 11, y 12,...,y 1n1 and y 21, y 22,...,y 2n2. For each observation j in the first group count the number of observations in the second group c j that are smaller; ties result in adding 0.5 to the count. Assuming H 0 is true, on average half the observations in group 2 would be above Y 1j and half would be below if they come from the same distribution. That is E(c j ) = 0.5n 2. The sum of these guys is U = n 1 j=1 c j and has mean E(U) = 0.5n 1 n 2. The variance is a bit harder to derive, but is Var(U) = n 1 n 2 (n 1 + n 2 + 1)/12. 6
Something akin to the CLT tells us Z 0 = U E(U) n1 n 2 (n 1 + n 2 + 1)/12 N(0, 1), when H 0 is true. Seeing a U far away from what we expect under the null gives evidence that H 0 is false; U is then standardized as usual (subtract off then mean we expect under the null and standardize by an estimate of the standard deviation of U). A p-value can be computed as usual as well as a CI. Example: Dental measurements. Note: This test essentially boils down to replacing the observed values with their ranks and carrying out a simple pooled t-test! 7
Regression models Model: a mathematical approximation of the relationship among real quantities (equation & assumptions about terms). We have seen several models for a single outcome variable from either one or two populations. Now we ll consider models that relate an outcome to one or more continuous predictors. 8
Section 1.1: relationships between variables A functional relationship between two variables is deterministic, e.g. Y = cos(2.1x) + 4.7. Although often an approximation to reality (e.g. the solution to a differential equation under simplifying assumptions), the relation itself is perfect. (e.g. page 3) A statistical or stochastic relationship introduces some error in seeing Y, typically a functional relationship plus noise. (e.g. Figures 1.1, 1.2, and 1.3; pp. 4 5). Statistical relationship: not a perfect line or curve, but a general tendency plus slop. 9
Example: x y 1 1 2 1 3 2 4 2 5 4 Let s obtain the plot in R: x=c(1,2,3,4,5); y=c(1,1,2,2,4) Must decide what is the proper functional form for this relationship, i.e. trend: Linear? Curved? Piecewise continuous? 10
Section 1.3: Simple linear regression model For a sample of pairs {(x i, Y i )} n i=1, let where Y i = β 0 + β 1 + ɛ i, i = 1,...,n, Y 1,...,Y n are realizations of the response variable, x 1,...,x n are the associated predictor variables, β 0 is the intercept of the regression line, β 1 is the slope of the regression line, and ɛ 1,...,ɛ n are unobserved, uncorrelated random errors. This model assumes that x and Y are linearly related, i.e. the mean of Y changes linearly with x. 11
Assumptions about the random errors We assume that E(ɛ i ) = 0 and var(ɛ i ) = σ 2, and corr(ɛ i, ɛ j ) = 0 for i j: mean zero, constant variance, uncorrelated. β 0 + β 1 x i is the deterministic part of the model. It is fixed but unknown. ɛ i represents the random part of the model. The goal of statistics is often to separate signal from noise; which is which here? 12
Note that E(Y i ) = E(β 0 + β 1 x i + ɛ i ) = β 0 + β 1 x i + E(ɛ i ) = β 0 + β 1 x i, and similarly var(y i ) = var(β 0 + β 1 x i + ɛ i ) = var(ɛ i ) = σ 2. Also, corr(y i, Y j ) = 0 for i j. (We will draw a picture...) 13
Example: Pages 10 11 (see picture). We have E(Y ) = 9.5 + 2.1x. When x = 45, our expected y-value is 104, but we will actually observe a value somewhere around 104. What does 9.5 represent here? Is it sensible/interpretable? How is 2.1 interpreted here? In general, β 1 represents how the mean response changes when x is increased one unit. 14
β1 β1 Matrix form: Note the simple linear regression model can be written in matrix terms as Y 1 1 x 1 ɛ 1 Y 2. = 1 x 2.. β 0 + ɛ 2., Y n 1 x n ɛ n or equivalently where Y = Xβ + ɛ, Y 1 1 x 1 ɛ 1 Y = Y 2.,X = 1 x 2.., β = β 0, ɛ = ɛ 2.. Y n 1 x n ɛ n This will be useful later on. 15
Section 1.6: Estimation of (β 0, β 1 ) β 0 and β 1 are unknown parameters to be estimated from the data: (x 1, Y 1 ), (x 2, Y 2 ),...,(x n, Y n ). They completely determine the unknown mean at each value of x: E(Y ) = β 0 + β 1 x. Since we expect the various Y i to be both above and below β 0 + β 1 x i roughly the same amount (E(ɛ i ) = 0), a good-fitting line b 0 + b 1 x will go through the heart of the data points in a scatterplot. The method of least-squares formalizes this idea by minimizing the sum of the squared deviations of the observed y i from the line b 0 + b 1 x i. 16
Least squares method for estimating (β 0, β 1 ). The sum of squared deviations about the line is n Q(β 0, β 1 ) = [Y i (β 0 + β 1 x i )] 2. i=1 Least squares minimizes Q(β 0, β 1 ) over all (β 0, β 1 ). Calculus shows that the least squares estimators are b 1 = n i=1 (x i x)(y i Ȳ ) n i=1 (x i x) 2 b 0 = Ȳ b 1 x 17
Proof: Q β 1 = Q β 0 = n 2(Y i β 0 β 1 x i )( x i ) i=1 [ n n n = 2 x i Y i β 0 x i β 1 i=1 i=1 n 2(Y i β 0 β 1 x i )( 1) i=1 [ n ] n = 2 Y i nβ 0 β 1 x i. i=1 i=1 i=1 x 2 i ]. 18
Setting these equal to zero, and dropping indexes on the summations, we have xi Y i = b 0 xi + b 1 x 2 i normal equations Yi = nb 0 + b 1 xi Multiply the first by n and multiply the second by x i and subtract yielding n x i Y i x i Yi = b 1 [n ( ) ] 2 x 2 i xi. Solving for b 1 we get b 1 = n x i Y i x i Yi n x 2 i ( x i ) 2 = xi Y i nȳ x x 2 i n x 2. 19
The second normal equation immediately gives b 0 = Ȳ b 1 x. Our solution for b 1 is correct but not as aesthetically pleasing as the purported solution b 1 = n i=1 (x i x)(y i Ȳ ) n i=1 (x i x) 2. Show (xi x)(y i Ȳ ) = x i Y i nȳ x (xi x) 2 = x 2 i n x 2 20
The line Ŷ = b 0 + b 1 x is called the least squares estimated regression line. Why are the least squares estimates (b 0, b 1 ) good? 1. They are unbiased: E(b 0 ) = β 0 and E(b 1 ) = β 1. 2. Among all linear unbiased estimators, they have the smallest variance. They are best linear unbiased estimators, BLUEs. We will show the first property in the next class. The second property is formally called the Gauss-Markov theorem and is proved in linear models. (Look it up on Wikipedia if interested.) 21