DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS

Size: px

Start display at page:

Download "DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS"

Maximillian Wilson
5 years ago
Views:

1 DIFFERENT APPROACHES TO STATISTICAL INFERENCE: HYPOTHESIS TESTING VERSUS BAYESIAN ANALYSIS THUY ANH NGO 1. Introduction Statistics are easily come across in our daily life. Statements such as the average college student s debt after college is $2,700, or the average person spends 24 years of their life asleep, are results of statistical analysis. Statistical inference is one method of statistical analysis. It is the process of applying statistical methods in order to make probabilistic statements about some characteristics of the observed data. In cognitive science, statistical inference allows scientists to draw some conclusion about a psychological phenomenon by statistically analyzing a set of observations. For example, a social psychologist is interested in investigating whether the majority of the population shows change blindness (i.e., a phenomenon in which people often miss large changes in their visual field). To study this phenomenon, the scientist can collect data from the entire population of Americans. Such a task, however, is not economically feasible. Conveniently, statistical inference allows her to examine the change blindness phenomenon based on the data from a randomly selected pool of participants. There are different approaches to statistical inference, including hypothesis testing and Bayesian inference. Even though hypothesis testing is traditionally used by most cognitive scientists, Bayesian inference has garnered increased interest in recent years. This paper applies probability theory to explain these approaches, discusses the strengths and weaknesses of both approaches to explain the increasing use of Bayesian inference, especially in the field of cognitive science. 2. Two approaches to statistical inference 2.1. Important Statistical Concepts. We first start by defining key concepts. Definition 2.1 (Population vs. Sample). A population is a set of all observations that can possibly be made. A sample is a set of observations that are randomly selected from a population. Examples of a population include Trinity University students, people with anxiety disorder, trees in the state of Texas, etc. 1

2 2 NGO Examples of a sample include 70 students at Trinity chosen randomly, 20 clients with anxiety disorder at a hospital, 27 trees randomly selected in Texas, etc. Definition 2.2 (Parameter). A parameter is a measurable characteristic of a population. For example, when we examine whether a machine is functioning properly, we define the parameter to be the proportion of defective items in a large manufactured lot. Definition 2. (Random Variable). Let S be the sample space. A map X : S R is a random variable that assigns a real number X(s) to each possible outcome s S (DeGroot & Schervish, 2002). Definition 2.4 (Probability Function). Let X be a discrete random variable (X can take only a finite number k of different values x 1,..., x k ). The probability function (p.f.) of X is defined as the function f such that for every real number x, f(x) = P r(x = x). (DeGroot & Schervish, 2002) Definition 2.5 (Probability Density Function). Let X be a continuous random variable (X can take every value in an interval). The probability density function (p.d.f.) of X is the nonnegative function f, defined on R, such that for every subset A R, P r(x A) = f(x) dx. (DeGroot & Schervish, 2002) A Definition 2.6 (Statistical Inference). Statistical inference is the process of applying statistical methods in order to make probabilistic statements about some characteristics of the observed data Hypothesis Testing The Null and Alternative Hypotheses. In hypothesis testing approach, a parameter θ is an unknown value that lies in a parameter space Ω. A statistician s task is to decide whether the parameter value falls in one subset of Ω or its complement. We can then define our hypotheses as follows: Definition 2.7. Ω 0 and Ω 1 are disjoint subsets of the parameter space Ω with Ω 0 Ω 1 = Ω. The null and the alternative hypotheses are: H 0 : µ = µ 0, H A : µ µ 0.

3 STATISTICAL INFERENCE The hypothesis H 0 is called the null hypothesis and the hypothesis H A is called the alternative hypothesis. A statistician has to decide to accept either H 0 or H A. With only two possible decisions, accepting H 0 is equivalent to rejecting H A, and vice versa. In most statistical problems, the two hypotheses are treated quite differently. For example, in clinical trials, the null hypothesis almost always states that a particular treatment does not have an effect while the alternative hypothesis states that the treatment has an effect. Many cognitive scientists frame their research design and analysis in terms of rejecting the null hypothesis, a routine that has recently been debated. We will return to this debate later The Critical Region and Test Statistics. Suppose that X = (X 1,..., X n ) form a random sample from the population whose distribution involves the unknown parameter θ. Let S denote the set of all possible outcomes of the random sample X. A statistician then follows a test procedure that partitions S into two subsets, one of which contains values of X for which H 0 will be accepted while the other contains values of X for which H 0 will be rejected. Definition 2.8. The subset for which H 0 will be rejected is called the critical region, C. This region is defined in terms of a test statistic, T where T = f(x), a function of the observed data X. Example 1.1. Suppose that X = (X 1,..., X n ) is a random sample from a normal distribution with unknown mean µ and known variance σ 2. The hypotheses to be tested are: H 0 : µ = µ 0, H A : µ µ 0. H 0 is rejected if X n is far from µ 0. So if we define the test statistic to be T = X n µ 0, we then choose a test procedure that rejects H 0 if T c for some c > Level of Significance. Definition 2.9. The level of significance, α is determined by P r(t C) = α where C is the critical region. If T C, we reject H 0 in favor of H A on level α. Some standard values of α are 0.05, 0.01, Example 1.2. A psychologist wants to explore the change blindness phenomenon introduced above. She sets up an experiment in which an experimenter randomly approaches a pedestrian on a college campus to ask for directions (Simons & Levin, 1998). Shortly afterward, two men carrying a door pass in between them and interrupt the conversation. Meanwhile, the first experimenter is replaced with the second experimenter who has different height, different voice, and where different cloths from

4 4 NGO the first one. After the exchange, the second experimenter continues to talk to the pedestrian. Eventually, the pedestrian is asked if he/she noticed any changes throughout the interaction. Out of 40 pedestrians who were randomly selected to participate in the study, only 8 reported noticing the change. The psychologist would like to draw the conclusion whether at least two thirds of the people exhibit change blindness. Suppose that X = (X 1,..., X 40 ) is a random sample drawn from the population distribution with parameter p, where p is the probability of exhibiting change blindness. X i s are defined as follows: { 1, if the ith participant shows change blindness X i = 0, if the i th participant does not show change blindness for i {1, 2,..., 40}. X i = 1 with probability p and X 1 = 0 with probability q = p 1. Therefore, X 1,..., X 40 Bernoulli(p) and the p.f. of each X i can be written as follows: { p x i q f(x i p) = 1 x i, for x i = 0 or 1 0, otherwise The hypotheses are stated as follows: H 0 : p 2 (the majority doesn t exhibit change blindness), H A : p > 2 (the majority exhibits change blindness). The decision to reject or accept the null hypothesis first requires us to find the test statistic T and specify the critical region C. 40 Let the test statistic be Y = X i. So Y binomial(n = 40, p) and the p.f. is: i=1 f(y n, p) = ( ) n p y q n y, y for y = 0, 1,..., n 0, otherwise The larger p is, the larger we expect Y to be. Therefore, we choose to reject H 0 if Y c for some constant c. At the level of significance α 0 = 0.05, we reject H 0 if P r(y 2 p = 2 ) < α 0. We should now compute P r(y 2 p = 2 ): ( P r Y 2 p = 2 ) ( = 1 P r Y 1 p = 2 )

5 STATISTICAL INFERENCE 5 40 ( ) 40 = 1 p k (1 p) 40 k, p = 2 k k=0 40 = 1 = k=0 ( 40 k ) ( 2 ) k ( 1 2 ) 40 k Since P r(y 2 p = 2 ) = 2 ) < 0.05, we reject H 0 and conclude that the majority of people does exhibit change blindness. Remark. Rejecting the null hypothesis does not mean we prove that it is false. It is only that the data support a rejection. Also, in our example, we reject the null hypothesis at the 5% level of significance (α 0 = 0.05) but we cannot reject the null hypothesis at the 1% level of significance. In brief, our decision depends on our choice of the significance level, which can be quite arbitrary in nature. This is a typical criticism of hypothesis testing, and we will return to it later. 2.. Bayesian Inference. Bayesian inference relies on the same logic that is used by the famous fictional detective, Sherlock Holmes: when you have eliminated the impossible, whatever remains, however improbable, must be the truth. Likewise, Bayesian inference is the method in which we begin with a set of all possibilities, then based on collected data, we eliminate some possibilities and reallocate credibilities to the remaining ones. Consequently, a parameter is a random variable which has a probability distribution over the set of all candidate values. We say that the parameter has a prior distribution, ξ(θ). The prior distribution can be determined by utilizing previous information or knowledge about the parameter. After observing the data, reallocation of credibilities is reflected in the posterior distribution, which is conditioned on observed data. The posterior distribution is denoted as ξ(θ x) where x = (x 1,..., x n ) is the vector that represents the observed data. The relationship between the prior and the posterior distribution of a parameter is given by the Bayes theorem. Theorem 2.1 (Bayes Theorem). Let X 1,..., X n be i.i.d. random variables with p.d.f. f(x i θ) and joint p.d.f. f n (x θ). Let g n (x) = f n (x 1,..., x n θ)ξ(θ)dθ. Then, Ω ξ(θ x) = f n(x θ)ξ(θ) g n (x)

6 6 NGO Example 1.. (uniform prior) Consider the experiment described in Example 1.2. Let p be the probability of exhibiting change blindness and let X i = 1 if the ith participant shows change blindness and X i = 0 if not. Given any p, the distribution of each X i is: f(x i p) = { p x i (1 p) 1 x i, for x i = 0 or 1 0, otherwise Suppose that we have no prior knowledge about what values of p are more probable than others, we assume uniform distribution for p, so ξ(p) = 1 and its plot is given in Figure 2.1.a. We are interested in finding the distribution of p after observing some participants in the study. There were 40 pedestrians and 2 of them showed change blindness. The joint distribution of X 1,...X 40 is given by: f n (x p) = p x i (1 p) 40 x i The marginal joint p.d.f. g n (x) depends on the observed values x = x 1,..., x 40 but it does not depend on p and thus may be treated as a constant. Therefore, Equation 2.1 can be replaced with the following relation: ξ(p x) f n (x θξ(θ) p x i (1 p) 40 x i so that ξ(p x) is a beta distribution with 40 α = x i + 1 = i=1 40 β = n x i + 1 = 9 Figure 1.1.b shows the plot of beta(α =, β = 9). So ( P r p < 2 ) α =, β = 9 = 0.09 and thus, ( P r p 2 ) α =, β = 9 i=1 ( = 1 P r p < 2 = ) α =, β = 9 Remark. Bayesian approach tells us the probability that the parameter obtains values in a specified subset of the parameter space rather than merely supporting one decision over the other.

7 STATISTICAL INFERENCE 7 As discussed earlier, Bayesian inference takes into account prior knowledge about the distribution of the parameter. Example 1.4.(non-uniform prior) Suppose that there are n experiments of similar nature to the experiment in Example 1.2. These experiments had different sample sizes (N 1,..., N n ) and obtained different proportions of participants who exhibited change blindness (p 1,..., p n, respectively). We want to incorporate these information into the prior distribution. Suppose that the prior distribution of p is a beta distribution with parameters α and β. We may estimate α and β using the mean ˆµ and the variance s 2 of the observed values p i s, with i {1, 2,...n}. Since a larger sample size results in better estimation of the parameter, we may instead choose to estimate µ using the average of p i s weighted by the corresponding sample size: ˆµ = N 1p N n p n N N n Likewise, we may estimate σ 2 using the sample variance: n (p k µ) 2 s 2 = Since p beta(α, β), we also have: and σ 2 = k=1 n 1 µ = α α + β with n = 7 αβ (α + β) 2 (α + β + 1) Suppose that we solve the following system of equation and get αβ (α + β) 2 (α + β + 1) α α + β = N 1p N 7 p 7 N N 7 7 = α = 10 β = 10 k=1 (p k µ) Then, the prior distribution of p is beta(α = 10, β = 10). Figure 1.1.c gives the plot of beta(α = 10, β = 10). Apply Bayes Theorem, the posterior distribution of p becomes

8 8 NGO ξ(p x ) p x i (1 p) n xi p 9 (1 p) 9 p x i +9 (1 p) n+9 x i Therefore, the posterior distribution is a beta distribution with α = 6, β = 18 (see Figure 1.1.d) which gives P r(p 2 α = 6, β = 18) = a. b c. d. Figure 2.1 Remark. Comparing P r(p 2 α =, β = 9) and P r(p 2 α = 6, β = 18), we notice that incorporating previous knowledge into the prior distribution can have an impact on the posterior distribution. In this case, after taking into account previous information, we can say with lesser certainty that more than two thirds of the population exhibits change blindness. Moreover, with a uniform distribution, the conclusion is very similar to that of the hypothesis test but with another prior, it is very different Strengths and Weaknesses of Both Approaches. In the field of cognitive science, hypothesis testing has become the institutionalized approach to statistical analysis. Psychologists have been routinely trained to frame their research design and analysis in terms of rejecting null hypotheses. There are, however, several logical shortcomings pertaining to such an approach. The first problem, and the most important one, of the hypothesis testing is that nearly all null hypotheses are obviously false on a priori grounds. One example is testing if the proportion of patients showing no relapse after using the new drug is equal to the proportion of patients showing no relapse after using the old drug). Deming (1975)

9 STATISTICAL INFERENCE 9 comments that We do not perform an experiment to find out if two drugs are equal. We know in advance, without spending a dollar on an experiment, that they are not equal. The rejection therefore hardly gives any meaningful insights. Another argument is that the level of significance α is arbitrary and without theoretical basis. It merely classifies results into meaningless categories: significant and nonsignificant. More often than not, nonsignificant results are disregarded when they actually can potentially provide more insights than significant ones. Last but not least, the critical region, according to Kruschke (2010), is not well-defined because it is dependent upon the intentions of the experimenter. Different intentions result in different critical values. Figure 2.2 shows the sampling distribution of the test statistic for two groups when the null hypothesis is true, when the intention is to fix N = 6 for both groups regardless of how long it takes, or when the intention is to fix the duration of data collection at 2 weeks, when the mean rate is N = 6 per week. Suppose that we have data from two groups, with six values per group, and compute t to find that t = 2.5. We can only make the decision about rejection if we are aware of the experimenter s intentions. Such information, however, are not always accessible. While Bayesian inference approach can get away with the above mentioned problems of hypothesis testing approach, it is also subject to some level of subjectivity. First of all, the choice of prior distribution relies on subjective belief. For instance, in Example 1., the decision to estimate µ by putting weight the average of all p i s using the sample sizes relies on subjective judgment. Finally, the problems get complex, Bayesian approach may require much heavier computation than hypothesis testing approach. Despite these setbacks, Bayesian approach is superior in terms of its ability to incorporate previous knowledge into the analysis while hypothesis testing has no way of doing so. Moreover, the logic is different in the two approaches: hypothesis testing focuses on the probability of obtaining data as extreme as the observed data, P r(data hypothesis), whereas Bayesian inference is interested in how likely a hypothesis is in light of the data, P r(hypothesis data). The logic of Bayesian inference, therefore, could be argued to be more natural.

10 10 NGO a. b. Figure 2.2

11 STATISTICAL INFERENCE 11 References DeGroot, M. H., & Schervish, M. J. (2002). Probability and Statistics. Addison- Wesley. Kruschke, J. K. (2010). Bayesian data analysis. Wiley Interdisciplinary Reviews: Cognitive Science, 1(5), Simons, D. J., & Levin, D. T. (1998). Failure to detect changes to people during a real-world interaction. Psychonomic Bulletin & Review, 5(4),

Statistical Inference

Statistical Inference Classical and Bayesian Methods Class 6 AMS-UCSC Thu 26, 2012 Winter 2012. Session 1 (Class 6) AMS-132/206 Thu 26, 2012 1 / 15 Topics Topics We will talk about... 1 Hypothesis testing