Statistical Inference: Uses, Abuses, and Misconceptions

Size: px

Start display at page:

Download "Statistical Inference: Uses, Abuses, and Misconceptions"

Primrose Hudson
5 years ago
Views:

1 Statistical Inference: Uses, Abuses, and Misconceptions Michael W. Trosset Indiana Statistical Consulting Center Department of Statistics ISCC is part of IU s Department of Statistics, chaired by Stanley Wasserman. WIM

2 Experiments are performed for the purpose of obtaining information about a population that is imperfectly understood. The population represents all of the experimental outcomes that might occur (the sample space), together with a probability distribution P that describes the behavior of the experiment. The observed outcomes are a random sample of the population. Notation: X 1,..., X n P, x = {x 1,..., x n } Statistical inference is concerned with how one draws conclusions about P from x. WIM

3 Example Consider an experiment to investigate the effect of a gasoline additive on miles per gallon (mpg). For car i = 1,..., n, let Car i improves iff X i > 0. B i = mpg before additive, A i = mpg after additive, X i = A i B i = improvement in mpg. Write the experiment as X 1,..., X n P and observe x = {x 1,..., x n }. What might we like to know about P? For example... What is P (X i > 0)? Is P (X i > 0) > 0.5? What is EX i? Is EX i > 0? WIM

4 Two very different concerns are often conflated: 1. What would we like to know about P? 2. How can we use x to draw inferences about the attributes of P that interest us? WIM

5 1. Describing a Population Suppose that P is known. Then inference is not necessary and we can focus on deciding what attributes of P are interesting and/or important. We might begin by displaying/examining/reporting the probability density function (pdf) of P, say f. For example... WIM

6 f(x) WIM x

7 Population Attributes To summarize a probability distribution, statisticians usually begin by computing a measure of centrality (e.g., median, mean) and a measure of dispersion (e.g., interquartile range, standard deviation). These quantities are often useful, but may not reveal important features of the population. In the preceding example, the quartiles are q 1. = 0.59, q2. = 0.14, and q3. = The median is q 2. = 0.14 and the interquartile range is q3 q 1. = = The mean is µ = 0.3, the variance is σ 2 = 1.714, and the standard deviation is σ. = If one only computes measures of centrality and dispersion, then one may overlook the shape of this unusual pdf. In fact, f is a bimodal mixture. Ninety percent of the population is Normal(0, 1), but ten percent of the population is Normal(3, 0.04). For the latter subpopulation, the additive provides a definite benefit. WIM

8 Effect Size It is often recommended that researchers examine effect size. If one construes this advice to mean that one should examine the magnitude of the effect, i.e., the location of P relative to some baseline, by quantifying the salient attributes of P, then this is exemplary advice. Unfortunately, Jacob Cohen s well-intentioned efforts to help researchers design experiments have led many researchers to obsess on specific measures, e.g., d = µ/σ. Cohen was not concerned with reporting salient features of P, but on performing specific calculations related to the choice of sample size. These calculations depend on specifying a value of d and Cohen offered rough guidelines for how to do so when µ and σ are not known. WIM

9 Cohen s rough guidelines for choosing a sample size when little is known about P have led many researchers to embrace uncritical statements like d = 0.5 is an effect of medium size. In fact, Cohen anticipated and warned against such abuse: The terms small, medium, and large are relative, not only to each other but to... the specific content and research method being employed in any given investigation. In the face of this relativity, there is a certain risk inherent in offering conventional operational definitions for these terms... J. Cohen, Statistical Power Analysis for the Behavioral Sciences, 1988, p. 25. Researchers should be interested in the size of an effect, but they should also think about how to quantify and interpret its magnitude in the context of their discipline. For example... Instead of reporting whether µ < 0 or µ > 0, report the value of µ and think critically about its material significance (importance). WIM

10 2. Inference In practice, P is unknown and we draw x to obtain information about P. We may construct... Graphical displays of x: stem-and-leaf plots, box plots Estimates of P : empirical distributions & cdfs, quantile-quantile plots Estimates of f: histograms, average shifted histograms, kernel density estimates To illustrate, consider the forearm lengths (in inches) of n = 140 adult males... WIM

11 WIM

12 Box Plot Normal Q Q Plot PDF Estimate Sample Quantiles Density Theoretical Quantiles N = 140 Bandwidth = WIM

13 In practice, most inference is parametric. For example, assume that P Normal(µ, σ 2 ) and draw inferences about µ. Or... Let n = 100 and assume that in which case X 1,..., X n Bernoulli(p), Y = n X i Binomial(n; p). i=1 We observe y = 32. What might we infer about p? Point estimation is concerned with guessing the true value of p. Hypothesis testing is concerned with making a binary decision about p, e.g., is p = 0.5 or is p 0.5? Set estimation is concerned with constructing a set of plausible values of p. WIM

14 Point Estimation Plug-In Approach. Estimate p = EX i, the population mean, with the sample mean, ˆp = x = y n = = Maximum Likelihood Approach. Estimate p with In this case, p = ˆp = p = argmax a P (Y = 32 p = a). WIM

15 Decision-Theoretic Approaches. Let δ denote an estimator and let L (δ(y), p) = [δ(y) p] 2 denote the squared error loss when p is true and y is observed. Let R(p; δ) = E p L (δ(y ), p) denote the risk (expected loss) of δ when p is true. For example, δ(y) = y/n has risk function R(p; δ) = p(1 p)/n. Ideally, we would like to use an estimator that simultaneously minimizes the risk for every possible value of p. Unfortunately, this is impossible. We can proceed in one of two ways, either by restricting the class of estimators that we are willing to use, or by minimizing some real-valued attribute of the risk function. The Bayes Principle tells us to choose δ to minimize a weighted average of the possible risks: 1 0 R(p; δ)w(p) dp, WIM

16 where w is a pdf that weights the p (0, 1). The optimal estimator, δ w, is called the Bayes estimator with respect to w. Example: If w is the pdf of Beta(a, b), then δ w (y) = δ a,b (y) = a + y a + b + n = ( ) a + b a a + b + n a + b + ( ) n y a + b + n n. For n = 100 and y = 32 we obtain... a a b a+b δ a,b (32) WIM

17 w(p) WIM p

18 Hypothesis Testing Hypothesis testing is concerned with binary decisions. Partition the parameter set, here (0, 1), into null and alternative hypotheses, e.g., H 0 : p = 0.5 vs H 1 : p 0.5 or H 0 : p 0.5 vs H 1 : p < 0.5. H 0 is privileged, i.e., we retain H 0 unless we find compelling evidence against it. A test statistic measures evidence against H 0. Under H 0 : p = 0.5, EY = np = 50; hence, Y 50 measures evidence against H 0 : p = 0.5. We observed y 50 = = 18. Under H 0 : p = 0.5, p = P H0 ( Y 50 18) = P H0 (Y 32) + P H0 (Y 68). = We reject H 0 iff p is sufficiently small, i.e., p α for some predetermined α. The computed quantity p is the significance probability; the fixed quantity α is the significance level. WIM

19 Statistical significance is not material significance! Example: Let X i Normal(µ, σ 2 ) denote car i s improvement in mpg. A government agency will sanction advertising the additive iff H 0 : µ 1 is rejected at significance level α = A large corporation manufactures an additive that increases mileage by an average of µ = 1.01 miles per gallon. The corporation funds a large study of n = 900 vehicles in which x = 1.01 and s = 0.1 are observed. This results in a test statistic of t = x µ 0 s/ = n 0.1/ 900 = 3 and a significance probability of p = P (T n 3). = < 0.05 = α, so H 0 is decisively rejected and advertising is authorized. WIM

20 2. An amateur automotive mechanic invents an additive that increases mileage by an average of µ = 1.21 miles per gallon. The mechanic funds a small study of n = 9 vehicles in which x = 1.21 and s = 0.4 are observed. This results in a test statistic of t = x µ 0 s/ n and a significance probability of = / 9 = p = P (T n 1.575). = > 0.05 = α, so H 0 is retained and advertising is not authorized. WIM

21 Significance levels are error rates! Suppose that H 0 is true. A test that rejects H 0 iff p α will falsely reject H 0 with probability α. For example, suppose that I give a fair penny to each of 600 students. Each student tosses his/her penny 100 times and tests H 0 : p = 0.5 at significance level α = Let X i = 1 if student i rejects H 0 and X i = 0 if student i retains H 0. Because H 0 is true and p = 0.5, X i Bernoulli(α = 0.05) and the total number of students who falsely reject H 0 is Y = 600 i=1 Binomial(600; 0.05). It follows that EY = 30, P (Y 20). = 0.98, etc. WIM

22 Set Estimation Imagine testing H 0 : p = p 0 at significance level α for every p 0 (0, 1). The p 0 that are retained constitute a set of plausible values of p. This is an example of a confidence set. For Y Binomial(100; p), y = 32, and α = 0.1, the confidence set is [0.245, 0.404]. Notice that this interval does not contain p 0 = 0.5, which we have already rejected as implausible. Let C( x) denote the set of plausible values. Because x is a random sample, C( x) may or may not cover p. The probability that the random set C( X) covers p is 1 α, the confidence coefficient. Confidence is not a guarantee! If 600 researchers construct confidence sets with confidence coefficient 0.95, then we expect that 30 confidence sets will fail to cover the parameter of interest. The probability that at least 20 confidence sets will fail to cover is approximately WIM

23 Bayesian Inference A confidence coefficient tells us the probability that C( X) will cover p. This fact provides information about how the procedure for constructing C( x) will behave when used repeatedly; it does not tell us whether or not the observed C( x) actually covers p. In contrast, Bayesian methods endeavor to quantify the evidence in x. Suppose that X 1,..., X n P θ, θ Ω. Let p( x θ) denote the joint pdf of x and let p(θ) denote a pdf on Ω. Then, with squared error loss, the Bayes estimate of θ turns out to be θ p( x θ) p(θ) dθ Ω δ p ( x) =. p( x θ) p(θ) dθ Formally, δ p ( x) is the mean of the conditional pdf Ω Ω p( x θ) p(θ) p( x, θ) = p( x θ) p(θ) dθ p( x) = p(θ x). WIM

24 Let us conceive of the experiment as occurring in two stages: 1. Nature draws θ from the marginal pdf p(θ). At this stage, absent a sample from p( x θ) and knowing only p(θ), we would estimate θ to be the mean of p(θ). 2. We draw x from the conditional pdf p( x θ). To incorporate x into our estimate of θ, we replace p(θ) with p(θ x). Thus, we modify our initial knowledge of θ in light of the information about θ provided by x. In this context, p(θ) is the prior distribution and p(θ x) is the posterior distribution. But what if θ is not random? What if θ is a physical constant, e.g., the speed of light? WIM

25 The Basic Tenet of Bayesianity asserts that statisticians should treat known quantities as fixed and unknown quantities probabilistically. The basic tenet implies the following: Inference should be conditional on x. Uncertainty about θ should be modeled by p(θ). (This requirement may necessitate a subjectivist interpretation of probability.) Bayesians insist that inference should focus on the posterior distribution, which reveals how the prior distribution is modified by the data. One may display/report the entire posterior or certain of its attributes. Instead of a confidence set, one may compute a region of highest posterior density (HPD), i.e., the smallest subset of Ω for which the posterior probability is 1 α. L. J. Savage famously quipped that The only use I know for a confidence interval is to have confidence in it. In contrast, Bayesians can legitimately claim that the (possibly subjective) probability that θ lies in a region of HPD is 1 α. WIM

26 3. More About Hypothesis Testing True State of Nature H 0 H 1 Statistical retain H 0 Type II error Test reject H 0 Type I error Hypothesis testing is not intended for the purpose of choosing the more plausible hypothesis. It is only intended to determine if there is sufficient evidence to reject the null hypothesis. Legal Analogy: In a criminal trial, the defendant is innocent until proven guilty. The jury must decide if there is sufficient evidence to convict, i.e., if the prosecution s case is beyond a reasonable doubt. WIM

27 Power Function For simplicity, we restrict attention to a simple example. Assume that X 1,..., X n Normal(θ, 1) and that we want to test H 0 : θ 0 versus H 1 : θ > 0. Then the power function of a test φ is power φ (θ) = P (φ rejects H 0 X 1,..., X n Normal(θ, 1)). For any reasonable test, the power function should look something like this... WIM

28 P(reject H0) !1.0! population mean WIM

29 Given a significance level, α, we require that φ satisfy power φ (θ) α for θ 0. Among such level α tests, we prefer tests for which power φ (θ) is large for θ > 0. WIM

30 Let α = The usual test of H 0 : θ 0 versus H 1 : θ > 0 rejects H 0 iff x is sufficiently large, specifically, iff The power function of this test is power (θ) = P It follows that... = P = P x 0 1/ n = n x > ( ( n X > X Normal θ, 1 )) n ( X > ( X Normal θ, 1 )) n n n) θ ( X θ 1/ n > (1.645/ 1/ n ( X Normal θ, 1 )) n = P ( Z > θ n Z Normal(0, 1) ). WIM

31 power (0) = P (Z > 1.645) = 0.05 Interpretation: If θ = 0, then the probability of rejecting H 0 : θ 0, i.e., the probability of committing a Type I error, is power (0.5) = P (Z > n/2) increases as n increases Interpretation: If θ = 0, then the probability of rejecting H : θ 0 increases as n increases. Equivalently, the probability of retaining H 0, i.e., the probability of committing a Type II error, decreases as n increases. Notice that the above are properties of the test, not the data. Power is not an observed quantity. WIM

32 Power Analysis Typically a power analysis is performed when designing an experiment, to inform the choice of sample size. The test that rejects H 0 : θ 0 iff n x > satisfies power (0) = 0.05 n. To control power (0.5) = P ( Z > n/2 ), we must adjust the sample size. To achieve power (0.5) = 0.8: To achieve power (0.5) = 0.9: n/2 = qnorm(.2) qnorm(.95)-qnorm(.2) = n/2 n = 4 [qnorm(.95)-qnorm(.2)] 2. = 24.7 n = 4 [qnorm(.95)-qnorm(.1)] 2. = 34.3 WIM

33 Observed Power Suppose that we observe x = 0.5. Is this sufficient evidence to reject H 0 : θ 0 in favor of H 1 : θ > 0? The answer depends on the sample size. We need n/2 > 1.645, or n > ( ) 2. = 10.8, to reject H0. Remark: This answer confirms our intuition: the larger the sample, the more compelling the evidence that x = 0.5 was not drawn under θ 0, i.e., that H 0 is false. Suppose that n 10, in which case we retain H 0. Can we ascertain if the data provide affirmative evidence that H 0 is true, as opposed to merely providing insufficient evidence that H 0 is false? (Analogously, the defendant is found not guilty. Is s/he innocent, or was the charge merely not proven?) Remark: Hypothesis testing was not designed to answer this question. (Nor do criminal trials in Anglo-Saxon law distinguish between innocent and not proven. Scottish law does draw this distinction.) WIM

34 The answer should be obvious. Observing x > 0 tends to suggest that H 0 is false, i.e., observing x = 0.5 provides more evidence that H 0 is false than evidence that H 0 is true. Nevertheless, some authors have suggested that we can quantify the evidence that favors H 0 by computing the observed power, viz., power ( x) = power (0.5) = P ( Z > n/2 ). Recall, however, that this quantity increases as n increases. Hence, the more data that we collect (and therefore the more that x = 0.5 favors H 1 ), the larger power ( x) becomes. Large values of power ( x) should NOT be construed as evidence favoring H 0. J.M. Hoenig & D.M. Heisey, The abuse of power: the pervasive fallacy of power calculations for data analysis, The American Statistician, 55(1):19 24, Feb WIM

ECO220Y Review and Introduction to Hypothesis Testing Readings: Chapter 12

ECO220Y Review and Introduction to Hypothesis Testing Readings: Chapter 12 Winter 2012 Lecture 13 (Winter 2011) Estimation Lecture 13 1 / 33 Review of Main Concepts Sampling Distribution of Sample Mean