Parameter Estimation, Sampling Distributions & Hypothesis Testing

Parameter Estimation & Hypothesis Testing In doing research, we are usually interested in some feature of a population distribution (which can be described using population parameters) Since populations are difficult (or impossible) to collect data on, we estimate population parameters using point estimates based on sample statistics Sample statistics vary from sample to sample, making point estimates variable and unreliable The distribution of a statistic (estimate) computed across many different samples is called the sampling distribution of that statistic (estimate) We can use the sampling distribution to estimate the likelihood associated with a hypothesized population parameter or the margin of error (or confidence interval) associated with a point estimate 2

characterized by population parameters characterized by sample statistics 3

Law of Large Numbers Mean Sample Age vs. Size n 1 Let Xn = Xi, n i then ( n ) lim P X µ < ε = 1 n 4

Sampling Distributions How reliable are sample statistics (as estimators) for a finite sample size? 5

Central Limit Theorem Thanks to the central limit theorem we can compute the sampling distribution of the mean without having to actually draw samples and compute sample means. Central limit theorem: Given a population with mean µ and standard deviation σ, the sampling distribution of the mean (i.e., the distribution of sample means) will itself have a mean of µ and a standard deviation (standard error) of σ / n Furthermore, whatever the distribution of the parent population, this sampling distribution will approach the normal distribution as the sample size (n) increases. 6

Standard Error Just as the standard deviation (σ)of a population of scores provides a measure of the average distance between an individual score (x) and the population mean (µ), the standard error (σ X ) provides a measure of the average distance between the sample mean (X ) and the population mean (µ). σ = X σ n 7

Hypothesis Testing Procedure for traditional (NHST) hypothesis testing Roots Significance testing: (Karl) Pearson & Fisher Decision-theoretic hypothesis testing: Neyman & (Egon) Pearson Logic of the individual and combined approaches 8

Traditional (NHST) Hypothesis Testing 1. Begin with a research hypothesis H 1 (defined in terms of population parameters) 2. Set up the null hypothesis H 0 3. Construct the sampling distribution of a particular statistic under the assumption that the null hypothesis is true 4. Collect some data and use it to compute a sample statistic 5. Compare the sample statistic to the distribution constructed in step (3) 6. Reject or retain H 0 depending on the probability, under H 0, of obtaining a sample statistic as extreme as the one we observed 9

Roots: Inferential Significance Testing Significance Testing, as conceived by Fisher (and Karl Pearson) was conceived as a heuristic for building an inductive case for or against a particular model Pearson (1900) conceived of p (essentially equivalent to a modern twotailed p-value) as an index of the validity of a hypothesis. He later (1914) popularizes this index by publishing tables of this value for a number of standard Fisher (1925) suggests using p = 0.05 (or some smaller value) as a heuristic to determine whether to further consider the results of an experiment The ideas of the p-value, of the null hypothesis, and of significance come from this approach 10

Roots: Decision-Theoretic Hypothesis Testing Hypothesis testing was conceived by Jerzy Neyman and Egon Pearson (Karl s son) as an efficient and objective alternative to significance testing Neyman & Pearson (1933) write an abstract paper investigating an optimal long-run strategy for testing pairs of hypotheses. They suggest comparing the log likelihood ratio of each hypothesis to a criterion computed from a fixed tail probability of incorrectly classifying one of the two hypotheses The concepts of Type I and Type II errors, α, β, power, critical regions, and fixed-criterion hypothesis testing all come from this approach 11

Differences Between the Approaches Fisher Set up a statistical null hypothesis (must be exact) Report the exact level of significance (p) If the result is not significant, draw no conclusions. Only use this procedure to draw provisional conclusions Neyman Pearson Set up two statistical hypotheses (H 0 & H 1 ), both of which must be exact Decide on α, β, and sample size before the experiment, these will define a rejection region If the data fall into the rejection region of H 0, accept H 1, otherwise accept H 0. Always make a decision based on the available information 12

Hypothesis Testing & The Null Hypothesis Why do we test the null hypothesis H 0? Philosophical arguments: Finite observations cannot prove categorical propositions, only disprove them Puts the burden on the researcher Anyone can create an apparent difference between conditions by using very small sample sizes Assume no effect (or standard effect) until given sufficient evidence Practical argument: The null hypothesis is specific and well-defined, making it easy to predict a sampling distribution 13

Rejection Regions α= 0.05; 1-tailed test (test that µ 1 > µ 0 ) α= 0.05; 2-tailed test (test that µ 1 µ 0 ) p( X) X X Why 0.05? 14

Errors in Hypothesis Testing Because the hypothesis test relies on sample data, and sample data are variable, there is always a risk that the hypothesis test will lead to the wrong conclusion. Two types of errors are possible: Type I errors (false positives) Type II errors (false negatives) 15

Errors in Hypothesis Testing 16

Errors in Hypothesis Testing 17

σ0 = σ1 = σ Population Raw scores (x) n = 4 σ σ σ M = = n 2 Sampling β α Sample means (M) 18

Errors in Hypothesis Testing P = α P = 1-β P = 1-α P = β 19

Power The statistical power of a test is simply the probability of correctly rejecting the null hypothesis when it is false For our purposes, you can think of this as the probability that the test will classify an actual difference in population means as significant. 20

σ0 = σ1 = σ Population Raw scores (x) n = 4 σ σ σ = = X n 2 Sampling power = 1 β β α Sample means (X ) 21

Factors that Affect the Power of a Test 1. The probability of a Type I error (α), or the level of significance, and the criterion for rejecting H 0, which are directly related to each other. 2. The true difference between the underlying population means under the alternative hypothesis (μ 1 - μ 0 ). 3. The standard error(s) of the mean(s), which is a function of the sample size n and the population variance σ 2. 4. The particular research design and test used and whether the test is one or two-tailed. 22

Power as a Function of α Population x α = 0.05 β = 0.73 power = 0.27 Sampling β 1 β = power α X 23

Power as a Function of α Population x α = 0.10 β = 0.62 power = 0.38 Sampling β 1 β = power α X 24

Power as a Function of α Population x α = 0.20 β = 0.48 power = 0.52 Sampling β 1 β = power α X 25

Power as a Function of (μ 1 - μ 0 ) µ µ 1 0 = 0.5 Population x µ µ 1 0 = 0.5 β = 0.84 power = 0.16 Sampling β 1 β = power α X 26

Power as a Function of (μ 1 - μ 0 ) µ µ 1 0 = 1.0 Population x µ µ 1 0 = 1.0 β = 0.62 power = 0.38 Sampling β 1 β = power α X 27

Power as a Function of (μ 1 - μ 0 ) µ µ 1 0 = 2.0 Population x µ µ 1 0 = 2.0 β = 0.36 power = 0.84 Sampling β α 1 β = power X 28

Power as a Function of n and σ σ =1.5 Population x Sampling β n = 4 σ X σ = 0.75 = 4 0.81 β = power = 0.19 1 β = power α X 29

Power as a Function of n and σ σ = 0.75 Population x Sampling n = 4 σ X σ = 0.375 = 4 0.15 β = power = 0.85 1 β = power β X α 30

Power as a Function of n and σ σ =1.5 Population x Sampling n = 16 σ σ = = 0.375 X 16 β = 0.15 power = 0.85 1 β = power β X α 31

Some Pros & Cons of Hypothesis Testing Pros Objective method for making decisions regarding data Simple rules, do not require statistics expertise In the absence of auxiliary biases (and in scrupulous hands), guarantees correct decisions in the long run Cons Rigid, 1-bit decision making Absolves scientists from thinking carefully about analysis Long-run guarantees rely on replication and unbiased reporting & publication p-values & significance level not useful for meta-analysis 32