Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries

Size: px

Start display at page:

Download "Outline. Unit 3: Inferential Statistics for Continuous Data. Outline. Inferential statistics for continuous data. Inferential statistics Preliminaries"

Jerome Cuthbert Ross
6 years ago
Views:

1 Unit 3: Inferential Statistics for Continuous Data Statistics for Linguists with R A SIGIL Course Designed by Marco Baroni 1 and Stefan Evert 1 Center for Mind/Brain Sciences (CIMeC) University of Trento, Italy Corpus Linguistics Group Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 1 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 3 / 33 for continuous data Goal: infer (characteristics of) population distribution from small random sample, or test hypotheses about population problem: overwhelmingly infinite coice of possible distributions can estimate/test characteristics such as mean µ and s.d. but H 0 doesn t determine a unique sampling distribution then parametric model, where the population distribution of a r.v. X is completely determined by a small set of parameters In this session, we assume a Gaussian population distribution estimate/test parameters µ and of this distribution sometimes a scale transformation is necessary (e.g. lognormal) Nonparametric tests need fewer assumptions, but... cannot test hypotheses about µ and (instead: median m, IQR = inter-quartile range, etc.) more complicated and computationally expensive procedures correct interpretation of results often difficult SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 4 / 33

2 for continuous data Rationale similar to binomial test for frequency data: measure observed statistic T in sample, which is compared against its expected value E 0 [T ] if difference is large enough, reject H 0 Question 1: What is a suitable statistic? depends on null hypothesis H0 large difference T E 0 [T ] should provide evidence against H 0 e.g. unbiased estimator for population parameter to be tested Question : what is large enough? reject if difference is unlikely to arise by chance need to compute sampling distribution of T under H0 for continuous data Easy if statistic T has a Gaussian distribution T N(µ, ) µ and are determined by null hypothesis H 0 reject H 0 at two-sided significance level α =.05 if T < µ 1.96 or T > µ This suggests a standardized z-score as a measure of extremeness: Z := T µ Central range of sampling variation: Z 1.96 g(t) t µ SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 5 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 6 / 33 Notation for random samples Random sample of n m = Ω items e.g. participants of survey, Wikipedia sample,... recall importance of completely random selection Sample described by observed values of r.v. X, Y, Z,...: x 1,..., x n ; y 1,..., y n ; z 1,..., z n specific items ω 1,..., ω n are irrelevant, we are only interested in their properties x i = X (ω i ), y i = Y (ω i ), etc. Mathematically, x i, y i, z i are realisations of random variables X 1,..., X n ; Y 1,..., Y n ; Z 1,..., Z n X 1,..., X n are independent from each other and each one has the same distribution X i X i.i.d. random variables each random experiment now yields complete sample of size n SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 7 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 8 / 33

3 A simple test for the mean The sampling distribution of X Consider simplest possible H 0 : a point hypothesis H 0 : µ = µ 0, = 0 together with normality assumption, population distribution is completely determined How would you test whether µ = µ 0 is correct? An intuitive test statistic is the sample mean x = 1 n x i with x µ 0 under H 0 Reject H 0 if difference x µ 0 is sufficiently large need to work out sampling distribution of X The sample mean is also a random variable: X = 1 ( ) X1 + + X n n X is a sensible test statistic for µ because it is unbiased: [ ] E[ X 1 ] = E X i = 1 E[X i ] = 1 µ = µ n n n An important property of the Gaussian distribution: if X N(µ 1, 1 ) and Y N(µ, ) are independent, then X + Y N(µ 1 + µ, 1 + ) r X N(rµ 1, r 1) for r R SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 9 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 10 / 33 The sampling distribution of X The z test Since X 1,..., X n are i.i.d. with X i N(µ, ), we have X X n N(nµ, n ) X = 1 ( ) X1 + + X n N(µ, n n ) X has Gaussian distribution with same mean µ but smaller s.d. than the original r.v. X : X = / n explains why normality assumptions are so convenient larger samples allow more reliable hypothesis tests about µ If the sample size n is large enough, X = / n 0 and the sample mean x becomes an accurate estimate of the true population value µ (law of large numbers) SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 11 / 33 Now we can quantify the extremeness of the observed value x, given the null hypothesis H 0 : µ = µ 0, = 0 z = x µ 0 X = x µ 0 0 / n Corresponding r.v. Z has a standard normal distribution if H 0 is correct: Z N(0, 1) We can reject H 0 at significance level α if α = z > qnorm(α/) Two problems of this approach: 1. need to make hypothesis about in order to test µ = µ 0. H 0 might be rejected because of 0 even if µ = µ 0 is true SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 1 / 33

4 A test for the variance An intuitive test statistic for is the error sum of squares V = (X 1 µ) + + (X n µ) Squared error (X µ) is on average E[V ] = n reject = 0 if V n0 (variance larger than expected) reject = 0 if V n0 (variance smaller than expected) sampling distribution of V shows if difference is large enough Rewrite V in the following way: [ (X1 ) V = µ + + = (Z Z n ) ( ) ] Xn µ with Z i N(0, 1) i.i.d. standard normal variables SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 13 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 14 / 33 A test for the variance A test for the variance Note that the distribution of Z1 + + Z n does not depend on the population parameters µ and (unlike V ) Statisticians have worked out the distribution of n Z i for i.i.d. Z i N(0, 1), known as the chi-squared distribution Z i χ n with n degrees of freedom (df = n) The χ n distribution has expectation E [ i Z i ] = n and variance Var [ i Z i ] = n confirms E[V ] = n Under H 0 : = 0, we have V 0 = Z Z n χ n Appropriate rejection thresholds for the test statistic V /0 can easily be obtained with R χ n distribution is not symmetric, so one-sided tail probabilities are used (with α = α/ for two-sided test) Again, there are two problems: 1. need to make hypothesis about µ in order to test = 0. H 0 easily rejected for µ µ 0, even though = 0 may be true SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 15 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 16 / 33

5 Intermission: Distributions in R Intermission: Distributions in R R can compute density functions and tail probabilities or generate random numbers for a wide range of distributions Systematic naming scheme for such functions: dnorm() pnorm() qnorm() rnorm() density function of Gaussian (normal) distribution tail probability quantile = inverse tail probability generate random numbers Available distributions include Gaussian (norm), chi-squared (chisq), t (t), F (f), binomial (binom), Poisson (pois),... you will encounter many of them later in the course Each function accepts distribution-specific parameters > x <- rnorm(50, mean=100, sd=15) # random sample of 50 IQ scores > hist(x, freq=false, breaks=seq(45,155,10)) # histogram > xg <- seq(45, 155, 1) # theoretical density in steps of 1 IQ point > yg <- dnorm(xg, mean=100, sd=15) > lines(xg, yg, col="blue", lwd=) # What is the probability of an IQ score above 150? # (we need to compute an upper tail probability to answer this question) > pnorm(150, mean=100, sd=15, lower.tail=false) # What does it mean to be among the bottom 5% of the population? > qnorm(.5, mean=100, sd=15) # inverse tail probability SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 17 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 18 / 33 Intermission: Distributions in R The sample variance # Now do the same for a chi-squared distribution with 5 degrees of freedom # (hint: the parameter you re looking for is df=5) > xc <- seq(0, 10,.1) > yc <- dchisq(xc, df=5) > plot(xc, yc, type="l", col="blue", lwd=) Idea: replace true µ by sample value X (which is a r.v.!) V = (X 1 X ) + + (X n X ) But there are two problems: X i X N(0, ) not guaranteed because X µ terms are no longer i.i.d. because X depends on all X i # tail probability for i Z i 10 > pchisq(10, df=5, lower.tail=false) # What is the appropriate rejection criterion for a variance test with α = 0.05? > qchisq(.05, df=5, lower.tail=false) # two-sided: V / 0 > n > qchisq(.05, df=5, lower.tail=true) # two-sided: V / 0 < n SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 19 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 0 / 33

6 The sample variance The sample variance We can easily work out the distribution of V for n = : V = (X 1 X ) + (X X ) = (X 1 X 1+X ) + (X X 1+X ) = ( X 1 X ) + ( X X 1 ) = 1 (X 1 X ) where X 1 X N(0, ) for i.i.d. X 1, X N(µ, ) Can also show that V and X are independent follows from independence of X 1 X and X 1 + X this is only the case for independent Gaussian variables (Geary 1936, p. 178) SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 1 / 33 We now have ( ) V = X1 X = Z with Z χ 1 because of X 1 X N(0, ) For n > it can be shown that V = n 1 (X i X ) = j=1 with j Z j χ n 1 independent from X proof based on multivariate Gaussian and vector algebra notice that we lose one degree of freedom because one parameter (µ x) has been estimated from the sample SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org / 33 Z j Sample variance and the chi-squared test Sample data for this session This motivates the following definition of sample variance S S = 1 n 1 (X i X ) with sampling distribution (n 1)S / χ n 1 S is an unbiased estimator of variance: E[S ] = We can use S to test H 0 : = 0 without making any assumptions about the true mean µ chi-squared test # Let us take a reproducible sample from the population of Ingary > library(sigil) > Census <- simulated.census() > Survey <- Census[1:100, ] # We will be testing hypotheses about the distribution of body heights > x <- Survey$height # sample data: n items > n <- length(x) Remarks sample variance ( 1 n 1 ) vs. population variance ( 1 m ) χ distribution doesn t have parameters etc., so we need to specify the distribution of S in a roundabout way independence of S and X will play an important role later SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 3 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 4 / 33

7 Chi-squared test of variance in R # Chi-squared test for a hypothesis about the s.d. (with unknown mean) # H 0 : = 1 (one-sided test against > 0 ) > sigma0 <- 1 # you can also use the name 0 in a Unicode locale > S <- sum((x - mean(x))^) / (n-1) # unbiased estimator of > S <- var(x) # this should give exactly the same value > X <- (n-1) * S / sigma0^ # has χ distribution under H 0 > pchisq(x, df=n-1, lower.tail=false) # How do you carry out a one-sided test against < 0? # Here s a trick for an approximate two-sided test (try e.g. with 0 = 0) > alt.higher <- S > sigma0^ > * pchisq(x, df=n-1, lower.tail=!alt.higher) SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 5 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 6 / 33 for the mean for the mean Now we have the ingredients for a test of H 0 : µ = µ 0 that does not require knowledge of the true variance In the z-score for X Z = X µ 0 / n replace the unknown true s.d. by the unbiased sample estimate ˆ = S, resulting in a so-called t-score: T = X µ 0 S /n William S. Gosset worked out the precise sampling distriution of T and published it under the pseudonym Student Because X and S are independent, we find that T t n 1 under H 0 : µ = µ 0 Student s t distribution with df = n 1 degrees of freedom In order to carry out a one-sample t test, calculate the statistic t = x µ 0 s /n and reject H 0 : µ = µ 0 if t > C Rejection threshold C depends on df = n 1 and desired significance level α (in R: -qt(α/, n 1)) very close to z-score thresholds for n > 30 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 7 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 8 / 33

8 The mathematical magic behind Student s t distribution characterizes the quantity Z V /k t k where Z N(0, 1) and V χ k are independent r.v. T t n 1 under H 0 : µ = µ 0 because the unknown population variance cancels out between the independent r.v. X and S T = X X µ µ 0 0 S /n = with Z = X µ 0 / n = S n X µ 0 / n S = N(0, 1) and V = (n 1)S X µ 0 / n (n 1)S /(n 1) χ n 1 One-sample t test in R # we will use the same sample x of size n as in the previous example # Student s t-test for a hypothesis about the mean (with unknown s.d.) # H 0 : µ = 165 cm > mu0 <- 165 > x.bar <- mean(x) # sample mean x > s <- var(x) # sample variance s > t.score <- (x.bar - mu0) / sqrt(s / n) # t statistic > print(t.score) # positive indicates µ > µ 0, negative µ < µ 0 > -qt(0.05/, n-1) # two-sided rejection threshold for t at α =.05 > * pt(abs(t.score), n-1, lower=false) # two-sided p-value # Mini-task: plot density function of t distribution for different d.f. > t.test(x, mu=165) # agrees with our manual t-test # Note that t.test() also provides a confidence interval for the true µ! SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 9 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 30 / 33 If we do not have a specific H 0 to start from, estimate confidence interval for µ or by inverting hypothesis tests in principle same procedure as for binomial confidence intervals implemented in R for t test and chi-squared test Confidence interval has a particularly simple form for the t test Given H 0 : µ = a for some a R, we reject H 0 if x a t = s /n > C with C for α =.05 and n > 30 x C s n µ x + C s n this is the origin of the ± standard deviations rule of thumb SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 31 / 33 SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 3 / 33

9 Can you work out a similar confidence interval for? Test hypotheses H 0 : = a for different values a > 0 Which H 0 are rejected given the observed sample variance s? If H 0 is true, we have the sampling distribution Z := (n 1)S /a χ n 1 Reject H 0 if Z > C 1 or Z < C (not symmetric) Solve inequalities to obtain confidence interval (n 1)s /C 1 (n 1)s /C SIGIL (Baroni & Evert) 3b. Continuous Data: Inference sigil.r-forge.r-project.org 33 / 33

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests

z and t tests for the mean of a normal distribution Confidence intervals for the mean Binomial tests Chapters 3.5.1 3.5.2, 3.3.2 Prof. Tesler Math 283 Fall 2018 Prof. Tesler z and t tests for mean Math