Stat 502 Design and Analysis of Experiments Normal and Related Distributions (t, χ 2 and F ) Tests, Power, Sample Size & Condence Intervals

Size: px

Start display at page:

Download "Stat 502 Design and Analysis of Experiments Normal and Related Distributions (t, χ 2 and F ) Tests, Power, Sample Size & Condence Intervals"

Eugenia Moore
5 years ago
Views:

1 1 Stat 502 Design and Analysis of Experiments Normal and Related Distributions (t, χ 2 and F ) Tests, Power, Sample Size & Condence Intervals Fritz Scholz Department of Statistics, University of Washington October 21, 2013

2 2 Hypothesis Testing We addressed the question: Does the type of ux aect SIR? Formally we have tested the null hypothesis H 0 : The type of ux does not aect SIR against the alternative hypothesis H 1 : The type of ux does aect SIR. While H 0 seems fairly specic, H 1 is open ended. H 1 can be anything but H 0. There may be many ways for SIR to be aected by ux dierences, e.g., change in mean, median, or scatter. Dierent eects for dierent boards? Such dierences may show up in the data vector Z through an appropriate test statistic s(z). Here Z = (X 1,..., X 9, Y 1,..., Y 9 ).

3 3 Test Criteria or Test Statistics In the ux analysis we used the absolute dierence of sample means, s(z) = Ȳ X, as our test statistic for testing H 0. A test statistic is a value calculated from data and other known entities, e.g., assumed (e.g., hypothesized) parameter values. We could have used the absolute dierence of sample medians or the ratio of sample standard deviations and compared that ratio with 1, etc. Dierent test statistics are sensitive to dierent deviations from the null hypothesis. A test statistic, when viewed as a function of random input data, is itself a random variable, and has a distribution, its sampling distribution.

4 4 Sampling Distributions For a test statistic s(z) to be eective in deciding between H 0 and H 1, the sampling distributions of s(z) under H 0 and H 1 should be separable to some degree. relative frequency Sampling Distribution under H 0 relative frequency Sampling Distribution under H 1

5 5 Sampled and Sampling Distributions Sampled Distributions distributions generating samples X 1,, X n H 0 H 1 X Sampling Distribution of X under H 0 X Sampling Distribution of X under H 1 X

6 6 When to Reject H 0 The previous illustration shows a specic sampling distribution for s(z) under H 1. Typically H 1 consists of many possible distributional models, leading to many possible sampling distributions under H 1. Under H 0 we often have just a single sampling distribution, the null distribution. If under H 1 the test statistics s(z) tends to have mostly higher values than under H 0, we would want to reject H 0 when s(z) is large, as on the previous slide. How large is too large? Need a critical value C crit and reject H 0 when s(z) C crit. Choose C crit such that P(s(Z) C crit H 0 ) = α, a pre-chosen signicance level. Typically α =.05 or.01. It is the probability of the type I error. The previous illustration also shows that there may be values s(z) in the overlap of both distributions. Decisions are not clear cut = type I or type II error

7 7 Decision Table Truth Decision H 0 is true H 0 is false accept H 0 correct decision type II error reject H 0 type I error correct decision Testing hypotheses (like estimation) is a branch of a more general framework, namely decision theory. Decisions are optimized with respect to penalties for wrong decisions, i.e., P(Type I Error) and P(Type II Error), or the mean squared error of an estimate ˆθ of θ, namely E((ˆθ θ) 2 ).

8 8 The Null Distribution and Critical Values relative frequency accept H 0 reject H 0 type I error critical value = significance level α = Sampling Distribution under H 0 accept H 0 reject H 0 relative frequency type II error Sampling Distribution under H 1

9 9 Critical Values and p-values The p-value(s(z)) for the observed test statistic s(z) is P(s(Z) s(z) H 0 ). Note that p-value(s(z)) α is equivalent to rejecting H 0 at level α. observed value relative frequency p value = critical value = significance level α = Sampling Distribution under H 0 accept H 0 reject H 0 relative frequency type II error Sampling Distribution under H 1

10 p-values and Signicance Levels 10 We just saw that knowing the p-value allows us to accept or reject H 0 at level α. However, the p-value is more informative than saying that we reject at level α. It is the smallest level α at which we would still reject H 0. It is also called the observed signicance level. Working with predened α made it possible to choose the best level α test. Best: Having highest probability of rejecting H 0 when H 1 is true. Neyman-Pearson theory. This makes for nice and useful mathematical theory, but p-values should be the preferred way of judging and reporting test results for a specic test statistic. For a given z a dierent test statistic s 1 (z) may give a smaller p-value than the optimal test statistic s 0 (z). This complicates nding optimal tests based on p-value behavior.

11 11 The Power Function The probability of rejecting H 0 is denoted by β. It is a function of the distributional model F governing Z, i.e., β = β(f ). It is called the power function of the test. When the hypothesis H 0 is composite and when s(z) has more than one possible distribution under H 0 one denes the highest probability of type I error as the signicance level of the test. Hence α = maximum{β(f ) : F H 0 }. α limits the type I error probability. For various F H 1 the power function also gives us the corresponding probabilities of type II error as 1 β(f ). Sometimes the probability of type II error = β = β(f ). Make sure what is meant by β(f ) when you read about it.

12 12 Samples and Populations We have covered inference based on a randomization test. This relied heavily on our randomized assignment of ux X and ux Y to the 18 circuit boards. Such inference can logically only say something about ux dierences in the context of those 18 boards. To generalize any conclusions to other boards would require some assumptions, judgement, and ultimately a step of faith. Would the same conclusion have been reached for another set of 18 boards? What if one of the boards was an outlier board or something else was peculiar? To be representative of the population of all boards we should view these 18 boards and their processing as a random sample from a conceptual population of such processed boards.

13 13 Conceptual Populations Clearly the 18 boards happened to be available at the time of the experiment. Not a random sample of all boards available at the time. May have been taken sequentially in the order of production. Not a sample from future boards, yet to be produced. The randomized processing steps might give the appearance of random samples, assuming that these steps are mostly responsible for response variations. Thus we could regard the 9+9 SIR values as two random samples from two very large or innite conceptual populations of SIR values. One sample of 9 boards from all boards/processes treated with ux X and one sample of 9 boards treated with ux Y. A board can only be treated with ux X or Y (not both at the same time) further conceptualization.

14 14 Population Distributions and Densities Innite populations of Z -values are conveniently described by densities f (z), with f (z) 0 z and f (z)dz = 1. The probability of observing a randomly chosen element Z with Z x is F (x) = P(Z x) = x f (z)dz = x f (t)dt z & t are just dummy variables. Avoid using x as dummy integration variable. F (x) as a function of x is also called the cumulative distribution function (CDF) of the random variable Z. F (x) from 0 to 1 as x goes from to. For discrete populations with a nite or countably innite number of distinct possible values z, we replace f (z) by the probability mass function p(z) = P(Z = z) 0 and write F (x) = P(Z x) = p(z) with p(z) = 1. z x z

15 15 Means, Expectations and Variances The mean or expectation of Z or its population is dened by µ = µ Z = E(Z) = zf (z)dz or µ = E(Z) = z a probability weighted average of z values. µ is the center of probability mass balance. zp(z) By extension, the mean or expectation of g(z) is dened by E(g(Z)) = g(z)f (z)dz or E(g(Z)) = z Using g(z) = (z µ) 2 the variance of Z is dened by σ 2 = var(z) = E ( (Z µ) 2) = µ) (z 2 f (z)dz or σ 2 = z g(z)p(z) (z µ) 2 p(z) σ = σ Z = var(z) is called the standard deviation of Z or its population. It is a measure of distribution spread.

16 16 p-quantiles z p The p-quantile z p of a distribution is dened as the lowest point z with F (z) p, or z p = inf{z : F (z) p}. When the distribution of Z is continuous and strictly increasing, z p is also dened by F (z p ) = p. This is the case typically encountered in this course. If there is a whole interval [z 1, z 2 ] over which we have F (z) = p for z [z 1, z 2 ] then any point in that interval could qualify as p-quantile, although the lowest point (z 1 ) is customarily chosen, and by the above denition. For p =.5 one may prefer the midpoint of such an interval. If Z has a discrete distribution, then z p could coincide with one of the jump points and we could have F (z p ) > p.

17 17 p-quantile Illustrations continuous, strictly increasing CDF discrete distribution F(z) p =.95 z p = F(z) p = 0.95 p =.5 z 0.5 = 4 z 0.95 = 6 z 0.95 = z z

18 18 Multivariate Densities or Populations f (z 1,..., z n ) is a multivariate density if it has the following properties: f (z 1,..., z n ) 0 z 1,..., z n and... f (z 1,..., z n ) dz 1... dz n = 1. It describes the behavior of the innite population of such n-tuples (z 1,..., z n ). A random element (Z 1,..., Z n ) drawn from such a population is a random vector. We say that Z 1,..., Z n in such a random vector are (statistically) independent when the following property holds: f (z 1,..., z n ) = f 1 (z 1 ) f n (z n ) z 1,..., z n Here f i (z i ) is the marginal density of Z i. It is obtainable from the multivariate density by integrating out all other variables, e.g., f 2 (z 2 ) =... f (z 1, z 2, z 3,..., z n ) dz 1 dz 3... dz n.

19 19 E(g(Z 1,..., Z n )) and cov(z 1, Z 2 ) In analogy to the univariate case we dene E(g(Z 1,..., Z n ) =... In particular, using g(z 1, z 2,..., z n ) = z 1 z 2, E(Z 1 Z 2 ) = For independent Z 1 and Z 2 we have E(Z 1 Z 2 ) = g(z 1,..., z n ) f (z 1,..., z n ) dz 1... dz n z 1 z 2 f (z 1, z 2 ) dz 1 dz 2 z 1 z 2 f 1 (z 1 )f 2 (z 2 ) dz 1 dz 2 = E(Z 1 )E(Z 2 ) Dene the covariance of Z 1 and Z 2 as cov(z 1, Z 2 ) = E [(Z 1 E(Z 1 ))(Z 2 E(Z 2 ))] = E(Z 1 Z 2 ) E(Z 1 )E(Z 2 ) Note that independent Z 1 and Z 2 = cov(z 1, Z 2 ) = 0, but =. Also note cov(z, Z) = E(Z 2 ) [E(Z)] 2 = var(z).

20 20 Random Sample When drawing repeatedly values Z 1,..., Z n from a common innite population with density f (z) we get a multivariate random vector (Z 1,..., Z n ). If the drawings are physically unrelated or independent, we may consider Z 1,..., Z n as statistically independent, i.e., the random vector has density h(z 1,..., z n ) = f (z 1 ) f (z n ) note f 1 =... = f n = f. Z 1,..., Z n is then also referred to as a random sample from f. We also express this as Z 1,..., Z n i.i.d. f. i.i.d. = independent and identically distributed.

21 21 Rules of Expectations & Variances (Review) For any set of random variables X 1,..., X n and constants a 0, a 1,..., a n we have E (a 0 + a 1 X a n X n ) = a 0 + a 1 E(X 1 ) a n E(X n ) provided E(X 1 ),..., E(X n ) exist and are nite. This holds whether X 1,..., X n are independent or not. For any set of independent random variables X 1,..., X n and constants a 0, a 1,..., a n we have var (a 0 + a 1 X a n X n ) = a 2 1 var(x 1) a 2 n var(x n) provided the variances var(x 1 ),..., var(x n ) exist and are nite. var(a 0 ) = 0. This is also true under the weaker (than independence) condition cov(x i, X j ) = E(X i X j ) E(X i )E(X j ) = 0 for i j. In that case X 1,..., X n are uncorrelated.

22 22 Rules for Averages Whether X 1,..., X n are independent or not, we have E ( X ) ( ) 1 n = E X i = 1 n E(X i ) = 1 n µ i = µ n n n i=1 i=1 If µ 1 =... = µ n = µ then E( X ) = µ. When X 1,..., X n are independent (or uncorrelated) we have ( ( ) ) 1 n var X = var X i = 1 n n n 2 var(x i ) = 1 n n 2 σi 2 = 1 n σ2 n where σ 2 n = 1 n i=1 i=1 i=1 i=1 n σi 2 σ n 2 = σ 2 when σ1 2 =... = σn 2 = σ 2. i=1 σ 2 n/n 0 as n, provided σ 2 n stays bounded.

23 23 A Normal Random Sample X 1,..., X n is called a normal random sample when the common density f of the X i is a normal density: f (x) = 1 ) (x µ)2 exp ( 2πσ 2σ 2 we also write X i N (µ, σ 2 ). This density has mean µ and standard deviation σ. When µ = 0 and σ = 1, it is called the standard normal density ϕ(x) = 1 exp ( x 2 ) x with CDF Φ(x) = ϕ(z) dz. 2π 2 X N (µ, σ 2 ) Z = (X µ)/σ N (0, 1)

24 24 The CLT & the Normal Population Model The normal population model is motivated by the Central Limit Theorem (CLT). This comes about because many physical or natural measured phenomena can be viewed as the addition of several independent source inputs or contributors. Y = X X k or Y = a 0 + a 1 X a k X k for independent r.v.'s X 1,..., X k and constants a 0, a 1,..., a k. Or in a 1-term Taylor expansion Y = f (X 1,..., X k ) f (µ 1,..., µ k ) + k i=1 = a 0 + a 1 X a k X k (X i µ i ) f (µ 1,..., µ k ) µ i provided the linearization is suciently good, i.e., the deviations X i µ i are small compared to the curvature of f (small σ i ).

25 Central Limit Theorem (CLT) I Suppose we randomly and independently draw random variables X 1,..., X n from n possibly dierent populations with respective means µ 1,..., µ n and standard deviations σ 1,..., σ n Suppose that the following variance ratio property holds 1 max i=1,...,n ( σi 2 ) σ , as n σ2 n i.e., none of the variances dominates among all variances, obviously the case when σ 1 =... = σ n and n. Then Y = Y n = X X n has an approximate normal distribution with mean and variance given by µ Y = µ µ n and σ 2 Y = σ σ 2 n. 1 The required condition is a bit more technical.

26 26 Central Limit Theorem (CLT) II The next few slides illustrate how well the CLT performs and when it falls short. In the following CLT illustrations the superimposed normal density corresponds to µ = µ µ n and σ 2 = σ σ2 n. First we show 4 (5) distributions from which the X 1,..., X 4 are sampled, respectively. These distributions cover a wide spectrum of shapes. We only sample 4, illustrating that n = 4 can be suciently close to in order for the CLT to be reasonably eective. Note the slight deterioration in the normal appearance when we exchange the normal distribution with a skewed distribution, i.e. when generating X X 5 instead of X X 4. (= slide 30 (VI))

27 27 Central Limit Theorem (CLT) III standard normal population uniform population on (0,1) Density Density x 1 x 2 a log normal population Weibull population Density Density x 3 x 4

28 28 Central Limit Theorem (CLT) IV Central Limit Theorem at Work Density x 1 + x 2 + x 3 + x 4

29 29 Central Limit Theorem (CLT) V standard normal population uniform population on (0,1) Density Density x1 x2 a log normal population Weibull population Density Density x3 x4 Weibull population Density x5

30 30 Central Limit Theorem (CLT) VI Central Limit Theorem at Work Density x 1 + x 2 + x 3 + x 4 Central Limit Theorem at Work Density x 2 + x 3 + x 4 + x 5

31 31 Central Limit Theorem (CLT) VII standard normal population uniform population on (0,1) Density Density x 1 x 2 a log normal population Weibull population Density Density x 3 x 4

32 32 Central Limit Theorem (CLT) VIII Central Limit Theorem at Work (not so good) Density x 1 + x 2 + x 3 + x 4

33 33 Central Limit Theorem (CLT) IX standard normal population uniform population on (0,20) Density Density x 1 x 2 a log normal population Weibull population Density Density x 3 x 4

34 34 Central Limit Theorem (CLT) X Central Limit Theorem at Work (not so good) Density x 1 + x 2 + x 3 + x 4

35 35 Central Limit Theorem (CLT) XI The last 4 slides illustrate the result when the condition that none of the variances dominates is violated. First we scaled up the log-normal distribution by a factor of 10 and the resulting distribution of X X 4 looks skewed to the right and looks in shape very much like the dominating log-normal variation source. In the second such example we instead scaled up the uniform distribution by a factor of 20. The resulting distribution of X X 4 looks almost like a uniform distribution with somewhat smoothed shoulders. In both cases the distribution with the dominating variability imprints its character on the resulting distribution of X X 4.

36 36 Central Limit Theorem (CLT) XII What would happen if instead we increased the spread in X 1 by a factor of 10? The distribution of X 2 + X 3 + X 4 would look approximately normal, like the dominating X 1 distribution. The sum of independent normal random variable is normal, regardless of variances.

37 37 Distributions Derived from Normal Model Other than working with randomization reference distributions we otherwise generally assume normal distributions as the sources of our data Thus it is worthwhile to characterize some sampling distributions that are derived from the normal distribution. The chi-square distribution, the Student t-distribution, and the F -distribution will play a signicant role later on. These distributions come about as sampling distributions of certain test statistics based on normal random samples. Much of what is covered in this course could also be dealt with in the context of other sampling population models. We will not get into that.

38 38 Properties of Normal Random Variables Assume that X 1,..., X n are independent normal random variables with respective means and variances given by: µ 1,..., µ n and σ 2 1,..., σ2 n. Then Y = X X n N (µ µ n, σ σ 2 n) Appendix A, slides 136. Here means distributed as If X N (µ, σ 2 ) then a + bx N (a + bµ, b 2 σ 2 ) X µ σ N (0, 1) with b = 1/σ and a = µ/σ Caution: Some people write X N (µ, σ) when others (and I) write X N (µ, σ 2 ). For example, in R we have dnorm(x,µ,σ), pnorm(x,µ,σ),qnorm(p,µ,σ),and rnorm(n,µ,σ), which respectively give the density, CDF, quantile of N (µ, σ 2 ) and random samples from N (µ, σ 2 ).

39 39 The Chi-Square Distribution When Z 1,..., Z f i.i.d. N (0, 1) we say that C f = f i=1 Z 2 i χ 2 f chi-square distribution with f degrees of freedom Memorize this denition! The density of C f is 1 h(x) = 2 f /2 Γ(n/2) x f /2 1 exp( x/2) for x > 0 not to memorize with mean = f and variance = 2f, worth memorizing. Density, CDF, quantiles of, and random samples from the chi-square distribution can be obtained in R via: dchisq(x,f), pchisq(x,f), qchisq(p,f), rchisq(n,f). C f1 χ 2 f 1, C f2 χ 2 f 2 are independent C f1 + C f2 χ 2 f 1 +f 2 Z Z 2 f 1 + Z Z 2 f 2 χ 2 f 1 +f 2 with Z i, Z j i.i.d. N (0, 1)

40 40 χ 2 Densities density df=1 df=2 df=5 df=10 df=20 note how mean and variability increase with df, µ = df, σ = 2df s

41 41 The Noncentral χ 2 f Distribution Suppose X 1 N (d 1, 1),..., X f N (d f, 1) are independent. Then we say that C = X X 2 f χ 2 f,λ has a noncentral χ 2 distribution with f degrees of freedom and noncentrality parameter λ = f. (memorize denition!) i=1 d i 2 E(C) = f + λ and var(c) = 2f + 4λ. (d 1,..., d f ) = (0,..., 0) = previously dened (central) χ 2 f. The distribution of C depends on (d 1,..., d f ) only through λ = f i=1 d i 2! See geometric explanation on next slide. What does R give us? For the noncentral χ 2 we have: dchisq(x, df, ncp=0), pchisq(q, df, ncp=0), qchisq(p, df, ncp=0), rchisq(n, df, ncp=0)

42 Dependence on d d 2 2 = λ Only x density contours of f(x 1, x 2) = 1 2exp x x 2 2π 2 d 2 r density volume over blue circle gives the central χ 2 probability P(X X 2 2 r 2 ) d 1 x density contours of f(x 1, x 2) = 1 2 exp (x 1 d 1) 2 + (x 2 d 2) 2 2π 2 d 2 r density volume over blue circle, P(X X 2 2 r 2 ), does not change as density center rotates on the red circle around the origin, i.e., as long as d d 2 2 stays fixed d x 1 x 1 42 This works because of the rotational invariance of density & the blue circle region. Easily extends to higher dimensions.

43 43 Noncentral χ 2 Densities density noncentral χ 5, λ densities λ = 0 λ = 1 λ = 5 λ = 10 λ = x

44 44 The Student t-distribution When Z N (0, 1) is independent of C f χ 2 f we say that t = Z Cf /f t f memorize this denition! has a Student t-distribution with f degrees of freedom. For < x < its density is g f (x) = Γ((f + 1)/2) f π Γ(f /2) [ x 2 /f + 1 ] (f +1)/2 not to memorize It has mean 0 (for f > 1) and variance f /(f 2) if f > 2. g f (x) ϕ(x) (standard normal density) as f. This follows either directly from the density using (1 + x 2 /f ) f /2 exp( x 2 /2), or from the denition of t because C f /f 1 which follows from E(C f /f ) = 1 & var(c f /f ) = 2f /f 2 0. Density, CDF, quantiles of, and random samples from the Student t-distribution can be obtained in R via: dt(x,f), pt(x,f), qt(p,f), and rt(n,f) respectively.

45 45 Densities of the Student t-distribution density df=1 df=2 df=5 df=10 df=20 df=30 df= t

46 46 The Noncentral Student t-distribution When X N (δ, 1) is independent of C f χ 2 f,δ we say that t = X Cf /f t f,δ memorize this denition! has a noncentral Student t-distribution with f degrees of freedom and noncentrality parameter ncp =δ. R gives us: dt(x,f,ncp), pt(q,f,ncp), qt(p,f,ncp), and rt(n,f,ncp) respectively. As before one can argue that this distribution converges to N (δ, 1) as f. Mean (for f > 1) and variance (for f > 2) of t are (don't memorize) f E(t) = δ 2 Γ( f 1 2 ) Γ( f 2 ) & var(t) = f (1 + δ2 ) f 2 δ2 f 2 ( Γ( f 1 2 ) Γ( f 2 ) ) 2

47 47 Densities of the Noncentral Student t-distribution density df = 6, ncp = 0 df = 6, ncp = 1 df = 6, ncp = 2 df = 6, ncp = t These densities march to the left for negative ncp.

48 48 Densities of the Noncentral Student t-distribution density ncp = 4, df = 2 ncp = 4, df = 5 ncp = 4, df = 10 ncp = 4, df = 20 ncp = 4, df = t

49 49 The F -Distribution When C f1 χ 2 f 1 and C f2 χ 2 f 2 are independent χ 2 random variables with f 1 and f 2 degrees of freedom, we say that F = C f 1 /f 1 C f2 /f 2 F f1,f 2 memorize this denition! has an F distribution with f 1 and f 2 degrees of freedom. Its density is g(x) = Γ((f 1 + f 2 )/2)(f 1 /f 2 ) f 1/2 x (f 1/2) 1 Γ(f 1 /x)γ(f 2 /2)[(f 1 /f 2 ) x + 1] (f 1+f 2 )/2 not to memorize Density, CDF, quantiles of,and random samples from the F f1,f 2 -distribution can be obtained in R via: df(x,f1,f2), pf(x,f1,f2), qf(p,f1,f2), rf(n,f1,f2). t t f = t 2 F 1,f. Why? Also, F F f1,f 2 = 1/F F f2,f 1. Why?

50 50 F Densities density df1 = 1, df2 = 3 df1 = 2, df2 = 5 df1 = 5, df2 = 5 df1 = 10, df2 = 20 df1 = 20, df2 = 20 df1 = 50, df2 = F

51 51 The Noncentral F -Distribution Let C 1 χ f1,λ be a noncentral χ 2 random variable and let C 2 χ 2 f 2 be a (central) χ 2 random variable which is independent of C 1, then we say that F = C 1/f 1 C 2 /f 2 F f1,f 2,λ (memorize denition!) has a noncentral F -distribution with f 1 and f 2 degrees of freedom and with noncentrality parameter λ. For the noncentral F R gives us: df(x, df1, df2, ncp), pf(q, df1, df2, ncp), qf(p, df1, df2, ncp), rf(n, df1, df2, ncp)

52 52 Noncentral F Densities density noncentral F 5, 10, λ densities λ = 0 λ = 1 λ = 5 λ = 10 λ = x

53 53 Decomposition of the Sum of Squares (SS) We illustrate here an early example of the SS-decomposition. n i=1 X 2 i = = = since n (X i X + X ) 2 i=1 n (X i X ) i=1 n (X i X ) 2 + n n (X i X ) X + X 2 i=1 n X 2 = i=1 i=1 i=1 i=1 n (X i X ) 2 + n X 2. (Xi X ) = X i n X i /n = 0 i.e., the residuals sum to zero. Such decompositions are the intrinsic and recurring theme in the Analysis of Variance (ANOVA), more later.

54 54 Sampling Distribution of X and n i=1 (X i X ) 2 When X 1,..., X n are a random sample from N (µ, σ 2 ) the joint distribution of X and n i=1 (X i X ) 2 can be described as follows, as is shown in Appendix B, slides 142. X and n i=1 (X i X ) 2 are statistically independent X N (µ, σ 2 /n) or ( X µ)/(σ/ n) = n( X µ)/σ N (0, 1) n i=1 (X i X ) 2 /σ 2 χ 2 n 1 or n i=1 (X i X ) 2 σ 2 χ 2 n 1 Only n 1 of the terms X i X can vary "independently," since (Xi X ) = 0. The above independence seems perplexing, given that X also appears within the expression n i=1 (X i X ) 2. It is a peculiar property of normal distribution samples. This independence will not occur for other sampled distributions.

55 X and S 2 Dependence (500 Samples of Size n = 10) optional X S 2 Normal(0,1) Sample X S 2 Uniform(0,1) Sample X S 2 Chi Square(5) Sample X S 2 Weibull(2,5) Sample X S 2 Student t(5) Sample X S 2 Poisson(5) Sample 55

56 56 Comments optional It appears that for symmetric sampled distribution (normal, uniform, and t) the least squares line, tting S 2 as linear function of X, is roughly horizontal, indicating zero correlation, but not necessarily independence. For uniform samples a high values of S 2 seem to be associated with an X near.5. For t-samples large S 2 values seem to indicate outliers which also aect X. When the sampled distribution is skewed (Chi-Square, Weibull, Poisson) the least squares line shows a denite slope, i.e., we have correlation and denitely dependence.

57 X and S 2 Dependence (500 Samples of Size n = 10) optional x f(x) beta density (.5,.5) X S 2 Beta(.5,.5) Sample 57

58 58 Comments optional The sampled beta density over the interval (0, 1) is shown on the left. It gives higher density near 0 or 1 than in the middle. The ( X, S 2 ) scatter shows a denite wedge pattern of high values of S 2 being associated with more central values of X. Values of X near 0 or 1 are associated with low values of S 2. These associations make sense. A value of X near 1 can only come about when most observations are on the right side of.5, leading to a smaller S 2. We clearly see dependence in spite of zero correlation patterns.

59 59 One-Sample t-test i.i.d. Assume that X = (X 1,..., X n ) N (µ, σ 2 ). We want to test the hypothesis H 0 : µ = µ 0 against the alternatives H 1 : µ µ 0. σ is left unspecied and is unknown. H 0 is a composite hypothesis, not a single F. X is a good indicator for µ since its mean is µ and its variance is var( X ) = σ 2 /n. Thus a reasonable test statistic may be X µ 0 N (µ µ 0, σ 2 /n) = N (0, σ 2 /n) when H 0 is true. Unfortunately we do not know σ. n( X µ 0 )/σ = ( X µ 0 )/(σ/ n) N (0, 1) suggests replacing the unknown σ by suitable estimate to get a single reference distribution under H 0. From a previous slide: = s 2 = n i=1 (X i X ) 2 /(n 1) σ 2 C n 1 /(n 1) s 2 is independent of X. Note E(s 2 ) = σ 2, i.e., s 2 is an unbiased estimator of σ 2.

60 One-Sample t-statistic Replacing σ by s in n( X µ 0 )/σ = one-sample t-statistic t(x) = ( X µ 0 ) n( s/ X µ 0 )/σ = n s2 /σ 2 n( X µ0 )/σ = Cn 1 /(n 1) = Z Cn 1 /(n 1) t n 1 60 since under H 0 we have that Z = n( X µ 0 )/σ N (0, 1) and C n 1 χ 2 n 1, both independent of each other. We thus satisfy the denition of the t-distribution under H 0. We can use t(x) with the single known reference distribution t n 1 under the composite hypothesis H 0. The 2-sided level α test has critical value t crit = t n 1,1 α/2 = qt(1 α/2, n 1) = t.crit. We reject H 0 when t(x) t crit. The 2-sided p-value for the observed t obs (x) = t.obs is P( t n 1 t obs (x) ) = 2P(t n 1 t obs (x) ) = 2 pt( abs(t.obs), n 1).

61 61 The t.test in R R t.test, that performs 1- and 2-sample t-tests. See?t.test for documentation. We focus here on the 1-sample test. > t.test(rnorm(20)+.4) One Sample t-test data: rnorm(20) t = , df = 19, p-value = alternative hypothesis: true mean is not equal to 0 95 percent confidence interval: sample estimates: mean of x

62 62 Calculation of the Power Function of the Two-Sided t-test The power function of this two-sided t-test is given by β(µ, σ) = P ( t t crit ) = P(t t crit ) + 1 P(t < t crit ) t = t(x) = = n( X µ0 ) n( X µ)/σ + n(µ µ0 )/σ = s s/σ Z + δ Cn 1 /(n 1) t n 1,δ noncentral t-distribution with δ = n(µ µ 0 )/σ = delta. Thus the power function depends on µ and σ only through δ β(δ) = P(t n 1,δ t crit ) + 1 P(t n 1,δ < t crit ) = pt( t.crit, n 1, delta) + 1 pt(t.crit, n 1, delta) The power function also depends on n.

63 63 Power Function of Two-Sided t-test β(δ) sample size n = 10 α = 0.05 α = δ = n(µ µ 0) σ

64 64 How to Use the Power Function For the level α =.05 test in the previous plot we can read o β(δ).6 for δ = ±2.5 or µ 0 µ ±2.5σ/ n. The smaller the natural variability σ the smaller the dierence µ 0 µ that we can detect with probability.6. Similarly, the larger the sample size n the smaller the dierence µ 0 µ we can detect with probability.6. Note however the square root eect in n. Both of these conclusions are intuitive because σ( X ) = σ/ n. Given a required detection dierence µ µ 0 and with some upper bound knowledge σ u σ we can plan the appropriate minimum sample size n to achieve the desired power β =.6: 2.5 σ/ µ µ σ u / µ µ 0 = n.

65 65 Where is the Flaw in Previous Argument? We tacitly assumed that the power curve plot would not change with n, i.e., we consider the eect of n only via δ = n µ µ 0 /σ on the plot abscissa. Both t crit = qt(1 α/2, n 1) and P( t n 1,δ t crit ) depend on n, as does δ = n µ µ 0 /σ. See the next 4 plots. Thus it does not suce to consider the n in δ alone. However, typically the sample size requirements will ask for large values of n. In that case t crit qnorm(1 α/2) and t n 1,δ N (δ, 1) stabilize (for xed δ). Compare n = 100 and n = 1000 in the next few plots. For large n, most of the benet from increasing n comes via increasing δ = n µ µ 0 /σ. We will provide a function that gets us out of this dilemma.

66 66 Power Function of Two-Sided t-test β(δ) sample size n = 3 α = 0.05 α = δ = n(µ µ 0) σ

67 67 Power Function of Two-Sided t-test β(δ) sample size n = 30 α = 0.05 α = δ = n(µ µ 0) σ

68 68 Power Function of Two-Sided t-test β(δ) sample size n = 100 α = 0.05 α = δ = n(µ µ 0) σ

69 69 Power Function of Two-Sided t-test β(δ) sample size n = 1000 α = 0.05 α = δ = n(µ µ 0) σ

70 70 Some Discussion of the Impact of n The previous slides showed that the impact of n on the power function curve shapes can be substantial for small n (see n = 3 = n = 30). However, for larger n the power functions only change their shape slightly, see n = 30 n = 100 n = In those (larger n) cases the main impact of n on power is through the noncentrality parameter δ = n(µ µ 0 )/σ.

71 71 Sample Size Function (2-sided) sample.size2 <- function(delta0=1,nrange=10:100, alpha=.05){ power <- NULL for(n in nrange){ tcrit <- qt(1-alpha/2,n-1) power <- c(power,1-pt(tcrit,n-1,sqrt(n)*delta0)+ pt(-tcrit,n-1,sqrt(n)*delta0))} plot(nrange,power,type="l",xlab="sample size n") abline(h=seq(.01,.99,.01),col="grey") abline(v=nrange,col="grey") title(substitute((mu-mu[0])/sigma[u]==delta0, list(delta0=delta0))) } Here delta0 = (µ µ 0 )/σ u is the point at which we want to achieve a given power. σ u is a known (conservative) upper bound on σ. nrange = vector of sample sizes at which the power function is computed for delta0 and signicance level alpha.

72 72 Power of Two-Sided t-test for Various n power (µ µ 0 ) σ u = 0.5, α = sample size n Want power.9 at µ µ 0 =.5σ u.

73 73 Power of Two-Sided t-test for Various n ( rened view) power (µ µ 0 ) σ u = 0.5, α = sample size n n = 44 will do for power.9 at µ µ 0 =.5σ u.

74 74 The Blessings of (µ µ 0 )/σ The power depends on µ µ 0 only in relation to σ units. This is a sensible benet. If σ is large, only very large dierences between µ and µ 0 would matter. Smaller dierences µ µ 0 would get swamped or appear irrelevant when compared with the population variation, i.e., σ.

75 75 One-Sided One-Sample Testing Problem Again assume X = (X 1,..., X n ) In some situations we may want to test H 0 : µ = µ 0 against H 1 : µ > µ 0. More broadly, we may want to test H 0 : µ µ 0 against H 1 : µ > µ 0. i.i.d. N (µ, σ 2 ). Large values of X µ 0 speak for H 1 and against H 0 or H 0. Large negative values of X µ 0 may also speak against H 0 but not against H 0. Be very clear as to the hypotheses being tested, and whether it should be a one-sided or two-sided alternative.

76 76 One-Sided One-Sample t-test Based on the discussion for the 2-sided testing problem it is natural to consider as our test statistic t = t(x) = X µ 0 s/ n Reject H 0 in favor of H 1 when t is too large. What is the reference distribution of t under H 0 : µ µ 0? Recall that t t n 1,δ, the noncentral t-distribution with noncentrality parameter δ = n(µ µ 0 )/σ. Note that µ µ 0 δ = n(µ µ 0 )/σ 0. Thus we have many dierent reference distributions under H 0 : δ 0.

77 77 Power Function of the One-Sided One-Sample t-test Reject H 0 when t t crit for some chosen critical value t crit. Then the power function again depends only on δ and n and is given by β(δ) = P(t n 1,δ t crit ) = 1 pt(t crit, n 1, δ). β(δ) is strictly increasing in δ since ( ) ) Z + δ P Cn 1 /(n 1) t C n 1 crit = P (Z t crit (n 1) δ as δ Thus max δ 0 { β(δ) } = β(0). Thus choose t crit = t.crit such that β(0) = α, i.e, α = β(0) = P(t n 1,0 t crit ) = P(t n 1 t crit ) or t.crit = qt(1 α, n 1)

78 78 Using t.test > t.test(rnorm(20)+.4,alternative="greater") One Sample t-test data: rnorm(20) t = , df = 19, p-value = alternative hypothesis: true mean is greater than 0 95 percent confidence interval: Inf sample estimates: mean of x

79 79 Power Function of One-Sided t-test β(δ) α = 0.05 α = 0.01 sample size n = δ = n(µ µ 0) σ

80 80 Power Function of One-Sided t-test β(δ) α = 0.05 α = 0.01 sample size n = δ = n(µ µ 0) σ

81 81 Power Function of One-Sided t-test β(δ) α = 0.05 α = 0.01 sample size n = δ = n(µ µ 0) σ

82 82 Power Function of One-Sided t-test β(δ) α = 0.05 α = 0.01 sample size n = δ = n(µ µ 0) σ

83 83 Sample Size Function (1-sided) sample.size1 <- function(delta0=1,nrange=10:100, alpha=.05){ power <- NULL for(n in nrange){ tcrit <- qt(1-alpha,n-1) power <- c(power,1-pt(tcrit,n-1,sqrt(n)*delta0)) } plot(nrange,power,type="l",xlab="sample size n") abline(h=seq(.01,.99,.01),col="grey") abline(v=nrange,col="grey") title(substitute((mu-mu[0])/sigma[u]==delta0~ ","~alpha==alpha0,list(delta0=delta0,alpha0=alpha))) lines(nrange,power,col="red") } Here delta0 = (µ µ 0 )/σ u is the point at which we want to achieve a given power. σ u is a known (or assumed conservative) upper bound on σ. nrange gives a vector of sample sizes at which the power is computed for delta0 and signicance level alpha.

84 84 Power of One-Sided t-test for Various n power (µ µ 0 ) σ u = 0.5, α = sample size n Want power.9 at µ µ 0 =.5σ u.

85 85 Power of One-Sided t-test for Various n ( rened view) power (µ µ 0 ) σ u = 0.5, α = sample size n n = 36 will do for power.9 at µ µ 0 =.5σ u.

86 86 Hypothesis Tests & Condence Intervals Testing H 0 : µ = µ 0 accept H 0 with the two-sided t-test whenever X µ 0 s/ n < t s n 1,1 α/2 µ 0 X ± t n 1,1 α/2 n Thus the interval X ± tn 1,1 α/2 s/ n consists of all acceptable µ 0, i.e., all µ 0 for which one would accept H 0 : µ = µ 0 at level α. Under H 0 our acceptance probability is 1 α, thus P µ0 ( X t n 1,1 α/2 ) s < µ 0 < X s + t n 1,1 α/2 = 1 α n n Here the subscript µ 0 on P indicates the assumed true value of the mean µ. Since this holds for any value µ 0 we may as well drop the subscript 0 on µ 0 and write P µ ( X t n 1,1 α/2 ) s < µ < X s + t n 1,1 α/2 = 1 α n n < or is immaterial. The case = has probability zero!

87 87 The Nature of Condence Intervals The interval [ X t n 1,1 α/2 s, X + t n n 1,1 α/2 s ] n is a 100 (1 α)% condence interval for the unknown µ. It is a random interval with probability 1 α of covering µ. This is not a statement about µ being random due to being unknown or uncertain. Not a Bayesian view! µ does not fall into that interval with probability 1 α. µ is xed but unknown. Without knowing µ we will not know whether the interval covers µ or not. Statistics means never having to say you're certain. (Myles Hollander) = Love means never having to say you're sorry, Love Story by Eric Segal

88 50 Condence Intervals samples 1,..., 50 of size n = 10 each, true mean = 0 95 % confidence intervals Expected Missed Coverages =

89 50 Condence Intervals samples 1,..., 50 of size n = 10 each, true mean = 0 95 % confidence intervals 89

90 50 Condence Intervals samples 1,..., 50 of size n = 10 each, true mean = 0 95 % confidence intervals 90

91 91 Using Condence Intervals to Test Hypotheses Condence intervals provide a more informative way of estimating parameters, as opposed to just stating the interval midpoint X as point estimate for µ. They express some uncertainty range about µ. They can also be used to directly test hypotheses. No surprise: Condence intervals were introduced via acceptable hypotheses. Thus we reject H : µ = µ 0 at level α whenever the 100(1 α)% condence interval does not cover µ 0. If the coverage probability is at least 1 α, then the probability of missing the true target (rejecting falsely) is at most α.

92 92 What about One-Sided Hypotheses? Testing H : µ µ 0 against H 1 : µ > µ 0 we reject when n( X µ 0 )/s t n 1,1 α, or accept whenever n( X µ 0 )/s < t n 1,1 α µ 0 > X t n 1,1 α s/ n, X t n 1,1 α s/ n is a 100(1 α)% lower condence bound for µ 0. It is an open ended interval ( X t n 1,1 α s/ n, ). Reject H 0 whenever this interval shows no overlap with the interval (, µ 0 ] as given by H 0 : µ µ 0 or no overlap with µ 0 when testing H 0 : µ = µ 0. Testing H 0 : µ µ 0 vs. H 1 : µ < µ 0, reject when n( X µ 0 )/s t n 1,α or when the corresponding 100(1 α)% upper condence bound interval (, X t n 1,α s/ n) does not contain µ 0.

93 93 Return to Two-Sample Tests Here we revisit the 2-sample problem, this time from a population perspective. i.i.d. Assume X 1,..., X m N (µ X, σ 2 ) is independent of i.i.d. Y 1,..., Y n N (µ Y, σ 2 ). Note the assumptions of normality and equal variances. Independence assumption is more or less a judgment issue. Test H 0 : µ X = µ Y vs. H 1 : µ X µ Y, with σ unknown. Clearly, a good indicator of H 0 : µ Y µ X = 0 true or not is ) Ȳ X N (µ Y µ X, σ2 n + σ2 Ȳ or under H X 0 m σ N (0, 1) 1/m + 1/n Under H 0 the sampling distribution of Ȳ X N is unknown because of the unknown σ. ( 0, σ2 n ) + σ2 m

94 94 Estimating σ 2 Have two unbiased estimates of σ 2 : sx 2 = 1 m 1 m i=1 (X i X ) 2 and s 2 Y = 1 n 1 How to combine or pool them? (s 2 X + s2 Y )/2? or any other λs2 X + (1 λ)s2 Y? n j=1 (Y j Ȳ ) 2 sx 2 χ2 σ2 m 1 m 1 2(m 1) var(s2 X ) = σ4 (m 1) 2 = 2σ4 m 1 var(s 2 Y ) = 2σ4 /(n 1), independence of s 2 X and s2 Y give us ( var λsx 2 + (1 ) λ)s2 Y = λ 2 2σ 4 m 1 + (1 2σ 4 λ)2 n 1 a quadratic in λ with minimum at λ = (m 1)/(m + n 2) s 2 = (m 1)s2 X m + n 2 + (n m 1)s2 Y m + n 2 = (X i=1 i X ) 2 + n (Y j=1 j Ȳ )2 m + n 2 as best pooled variance estimate for σ 2.

95 95 s 2 is Independent of Ȳ X X 1,..., X m i.i.d. N (µ X, σ 2 ) = X and s 2 X Y 1,..., Y n i.i.d. N (µ Y, σ 2 ) = Ȳ and s2 Y are independent. are independent. X 1,..., X m, Y 1,..., Y n independent = X, s 2 X and Ȳ, s2 Y are all independent. = Ȳ X and s 2 = (m 1)s2 X m + n 2 + (n 1)s2 Y m + n 2 are independent. (m + n 2)s 2 σ 2 χ 2 m+n 2 since (m 1)sX 2 σ2 χ 2 m 1 and (n 1)s2 Y σ2 χ 2 n 1, see slide 39.

96 96 Two-Sample t-statistic The 2-sample t-statistic is Ȳ t(x, Y) = X s 1/m + 1/n = = /[ (Ȳ X ) σ ] 1/m + 1/n s/σ Z Cm+n 2 /(m + n 2) t m+n 2 It has a known null distribution under H 0. Reject H 0 : µ Y µ X = 0 at signicance level α when t(x, Y) t m+n 2,1 α/2 = t crit. The 2-sided p-value of the observed t(x, y) is P( t m+n 2 t(x, y ) = 2 (1 pt(abs(t(x, y)), m + n 2))

97 97 Other Hypotheses We may also want to test H : µ Y µ X = or H : µ Y µ X = 0 for a specied value. The natural change is to subtract from Ȳ X and note that t(x, Y ) = = Ȳ X s 1/m + 1/n (Ȳ X /[ ) σ ] 1/m + 1/n s/σ t m+n 2 when H is true. We reject H at signicance level α whenever t(x, Y ) t m+n 2,1 α/2. As in the case of the 1-sample test we can derive (exercise) a 100(1 α)% condence interval for as Ȳ X ± t m+n 2,1 α/2 s 1/m + 1/n

98 98 Two-Sample t.test in R > t.test(flux$sir[flux$flux=="y"], flux$sir[flux$flux=="x"],var.equal=t) Two Sample t-test data: flux$sir[flux$flux == "Y"] and flux$sir[flux$flux == "X"] t = , df = 16, p-value = alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: sample estimates: mean of x mean of y

99 99 Power Function of the Two-Sample t-test The power function of the 2-sided 2-sample t-test is given by β(µ X, µ Y, σ) = P( t(x, Y) t crit ) = P(t(X, Y) t crit ) + 1 P(t(X, Y) < t crit ) t(x, Y) = = Ȳ X s 1/m + 1/n Ȳ X (µ Y µ X ) σ 1/m+1/n + µ Y µ X σ 1/m+1/n s/σ = Z + δ Cm+n 2 m+n 2 t m+n 2,δ with noncentrality parameter δ = (µ Y µ X )/[σ 1/m + 1/n]. Thus β depends on µ X, µ Y, σ only through δ and we have β(δ) = pt( t crit, m + n 2, δ) + 1 pt(t crit, m + n 2, δ)

100 100 Allocation of Sample Sizes Previously saw that such power functions are increasing in δ. If we can allocate m and n subject to the restriction m + n = N being xed at some even number, we maximize δ for xed µ Y µ x /σ by minimizing 1 m + 1 n = 1 m + 1 N m over m Calculus or algebra = m = n = N/2 gives us the minimum and thus the sample size allocation with highest power potential against any xed µ Y µ x /σ.

101 101 Sample Size Planning Tools It is a simple matter to adapt the previous sample size functions sample.size2 and sample.size1 for the 1-sample t-test to the 2-sample situations, i.e., for 2-sided tests and 1-sided tests. Since t crit = t m+n 2,1 α/2 and t m+n 2,δ (aside from δ) depend on m and n only through N = m + n and since the previous slide made the case for m = n = N/2, we just need to express the corresponding ( sample ( size function in terms of N and use 2/N )) δ = (µ Y µ x )/ σ + 2/N = N/4 (µ Y µ x )/σ. Of course, we would then also need to take m = n. See Homework!

102 102 Reection on Treatments of Two-Sample Problem Randomization test: No population assumptions. Under H 0 : no ux dierence, the SIR results for the 18 boards would be the same under all ux assignments. The ux assignments are then irrelevant! Test is based on random assignment of uxes giving us the randomization reference distribution for calculation of p-values. Using t(x, Y), a good approximation to the null distribution often is t m+n 2. Generalization to other boards only by judgment/assumptions. Normal 2-sample test: Assumes 2 independent samples from normal populations with common variance and possibly dierent means. We generalize upfront. The t-test makes inferences concerning these two (conceptual) populations.

103 103 Check σ 2 X = σ2 Y Test the hypothesis H 0 : σx 2 /σ2 Y = 1 vs. the alternative H 1 : σx 2 /σ2 Y 1 A good indicator for σx 2 /σ2 Y is the ratio of sample variances F = sx 2 /s2 Y. Note that F = s2 X s 2 Y = s2 X /σ2 X σ2 X sy 2 /σ2 Y σy 2 σ2 X σy 2 F m 1,n 1 F m 1,n 1 under H 0 Reject H 0 when s 2 X /s2 Y F m 1,n 1,α/2 or F m 1,n 1,1 α/2. We denote by F m 1,n 1,p the p-quantile of F m 1,n 1. Unfortunately this test is very sensitive to deviations from normality (will return to this later).

104 104 F-Distribution & Critical Values for α =.05 Density α 2 = degrees of freedom f 1 = 8, f 2 = 11 α 2 = F Ratio Density α 2 = degrees of freedom f 1 = 11, f 2 = 8 α 2 = Reciprocal of F Ratio

105 105 Condence Interval for σ 2 X /σ2 Y From we get 1 α = P sx 2 /σ2 X sy 2 /σ2 Y = s2 X /s2 Y σ 2 X /σ2 Y ( F m 1,n 1,α/2 s2 X /s2 Y σ 2 X /σ2 Y F m 1,n 1 F m 1,n 1,1 α/2 ) ( sx 2 = P /s2 Y σ2 X F m 1,n 1,1 α/2 σy 2 s 2 X /s2 Y F m 1,n 1,α/2 ) i.e., [ (s 2 X /s 2 Y )/F m 1,n 1,1 α/2, (s 2 X /s2 Y )/F m 1,n 1,α/2] is a 100(1 α)% condence interval for σ 2 X /σ2 Y.

106 106 t -Test when σ 2 X σ2 Y? When σx 2 σ2 Y we could emulate (Ȳ X )/ σx 2 /m + σ2 Y /n N (0, 1) by using t (X, Y) = (Ȳ X )/ sx 2 /m + s2 Y /n as test statistic. But what is its reference distribution under H 0 : µ X = µ Y? It is unknown. Known as the Behrens-Fisher problem. Approximate the distribution of sx 2 /m + s2 Y /n by that of a χ 2 f /f, where a and f are chosen to match mean and variance of approximand and approximation. E ( a χ 2 f /f ) ( = a and var a χ 2 f /f ) = a2 2f f 2 ( ) s 2 E X m + s2 Y n ( ) s 2 X var m + s2 Y n = σ2 X m + σ2 Y n = a = 2(m 1)σ4 X m 2 (m 1) + 2(n 1)σ4 Y 2 n 2 (n 1) 2 = 2a2 f = 2a2 f

107 107 The Satterthwaite Approximation Solving these moment equations = f = a 2 σx 4 m 2 (m 1) + σ4 Y n 2 (n 1) = ( σ 2 X /m + σ 2 Y /n) 2 (σ 2 X /m)2 m 1 + (σ2 Y /n)2 n 1 Replace the unknown σx 2 and σ2 Y by s2 X and s2 Y resulting f by ˆf, i.e., and denote the ˆf = ( s 2 X /m + s 2 Y /n) 2 (s 2 X /m)2 m 1 + (s2 Y /n)2 n 1 = t (X, Y) = and approximate σ (Ȳ X )/ 2 X m + σ2 Y n ( s 2 ) /( ) = X m + s2 Y σ 2 X n m + σ2 Y n s 2 X /m + s2 Y /n σ 2 X /m + σ2 Y /n by χ2ˆf /ˆf Z χ 2ˆf /ˆf Reject H 0 : µ X = µ Y when t (X, Y) is too large and compute p-values from the tˆf distribution. tˆf

108 108 Which Test to Use: t or t? When m = n one easily sees that t(x, Y) = t (X, Y), but their null distributions are only the same when sx 2 = s2 Y in which case ˆf = 2(m 1). However, sx 2 = s2 Y is an unlikely occurrence. To show only takes some algebra. What if m = n but σ X σ Y and we use t(x, Y) anyway? What if m n and σ X σ Y and we use t(x, Y) anyway? How is the probability of type I error aected? Such questions can easily be examined using simulation in R. When m = n or σ X σ Y it seems that using t(x, Y)is relatively safe, otherwise use t (X, Y) (see following slides).

109 109 Simulating the Null-Distribution of t(x, Y) Recall t(x, Y) = Ȳ X s 1/n + 1/m with s2 = (m 1)s2 X + (n 1)s2 Y m + n 2 and under H 0 : µ X = µ Y with independent Ȳ X N (0, σ 2 Y /n + σ2 X /m) (m 1)s 2 X σ 2 X χ2 m 1 and (n 1)s 2 Y σ2 Y χ2 n 1 rst 3 command lines in the R function t.sig.diff, i.e.: Dbar <- rnorm(nsim,0,sqrt(sigyˆ2/n + sigxˆ2/m)) s2 <- (rchisq(nsim,m-1)*sigxˆ2 +rchisq(nsim,n-1)*sigyˆ2)/(m+n-2) t.stat <- Dbar/sqrt(s2*(1/m+1/n))

110 110 R Function Examining P(Type I Error) for t(x, Y)-Test t.sig.diff <- function (m=10,n=5,sigx=1,sigy=1, Nsim=10000){ Dbar <- rnorm(nsim,0,sqrt(sigx^2/m+sigy^2/n)) s2 <- (rchisq(nsim,m-1)*sigx^2+ rchisq(nsim,n-1)*sigy^2)/(m+n-2) t.stat <- Dbar/sqrt(s2*(1/m+1/n)) x <- seq(-5,5,.01) y <- dt(x,m+n-2) hist(t.stat,nclass=101,probability=t,main="", xlab="conventional 2-sample t-statistic") title(substitute(sigma[x]==sigx~", "~ sigma[y]==sigy~", "~n[x]==nx~", "~n[y]==ny, list(sigx=sigx,sigy=sigy,nx=m,ny=n))) lines(x,y,col="blue") }

111 111 Inated P(Type I Error) σ X = 1, σ Y = 2, n X = 10, n Y = 5 Density conventional 2 sample t statistic

112 112 Deated P(Type I Error) σ X = 2, σ Y = 1, n X = 10, n Y = 5 Density conventional 2 sample t statistic

113 113 Heuristic Explanation of Opposite Eects Recall that s 2 = (n X 1)s 2 X + (n Y 1)s 2 Y n X + n Y 2 Thus when σ X = 1 and σ Y = 2 we have = 9 13 s2 X s2 Y E(s 2 ) = 9 13 σ2 X σ2 Y = = = and when σ X = 2 and σ Y = 1 we have E(s 2 ) = 9 13 σ2 X σ2 Y = = = while for σ X = σ Y = 1.5 we have (correct t-distribution) E(s 2 ) = 9 13 σ2 X σ2 Y = = 2.25 s 2 tends to be too small when σ X = 1, σ Y = 2 and too large when σ X = 2, σ Y = 1. This leads to inated (deated) values of t(x, Y).

114 114 P(Type I Error) Hardly Aected σ X = 20, σ Y = 1, n X = 10, n Y = 10 Density conventional 2 sample t statistic

115 115 P(Type I Error) Mildly Aected σ X = 1.2, σ Y = 1, n X = 10, n Y = 5 Density conventional 2 sample t statistic

116 116 P(Type I Error) Mildly Aected σ X = 1, σ Y = 1.2, n X = 10, n Y = 5 Density conventional 2 sample t statistic

117 117 Checking Normality of a Sample The p-quantile of N (µ, σ 2 ) is x p = µ + σz p, z p = Φ 1 (p). Sort the sample X 1,..., X n in increasing order X (1)... X (n) assigning fractional ranks p i (0, 1) to these order statistics in one of several ways for i = 1,..., n: p i = i.5 n or p i = i n + 1 or p i = i.375 n Plot X (i) against z pi = qnorm(p i ) for i = 1,..., n. We would expect X (i) x pi = µ + σz pi, i.e., X (i) should look linear against z pi with intercept µ and slope σ. Judging approximate linearity takes practice. R uses third choice for p i for a given sample vector x. qqline(x) ts a line to the middle half of the data.

Experiments and Observational Studies

Experiments and Observational Studies In observational studies we obtain measurements on several variables. Sampling could be random or not. We observe what is in the sample, no manipulation of factors