Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Hypothesis testing, part 2 With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal 1

CATEGORICAL IV, NUMERIC DV 2

Independent samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap 3

Is your data normal? Skewness: asymmetry Kurtosis: peakedness rel. to normal Both: within +- 2SE(s/u) is OK Or use Shapiro-Wilk (null = normal) Or look at Q-Q plot 4

T-test Already talked about Assumptions: normality, equal variances, independent samples Can use Levene to test equal variance assumption Post-test: check residuals for assumption fit For a t-test this is the same pre or post For other tests you check residual vs. fit post 5

One way ANOVA H0: m 1 = m 2 = m 3 H1: at least one doesn t match NOT H1: m 1!= m 2!= m 3 Assumptions: normality, common variance, independent errors Intuition: F statistic Variance between / Variance within Under (exact null), F=1; F >> 1 rejects null 6

One-way ANOVA F = MS b / MS w MSw = sum [sum[ (diff from mean) 2 ]] / df w df w = N-k, where k = number of conditions Sum over all conditions; sum per condition MS b = sum [(diff from grand mean) 2] / df b df b = k-1 Every observation goes in the sum 7

(example from Vibha Sazawal) 8

F-distribution rejected 10

Now what? (Contrasts) So we rejected the null. What did we learn? What *didn t* we learn? At least one is different... Which? All? This is called an omnibus test To answer our actual research question, we usually need pairwise contrasts 11

The trouble with contrasts Contrasts mess with your Type I bounds One test: 95% confident Three tests: 85.7% confident 5 conditions, all pairs: 4 + 3 + 2 + 1 = 10 tests: 59.9% UH OH 12

Planned vs. post hoc Planned: You have a theory. Really, no cheating You get n-1 pairwise comparisons for free In theory, should not be control vs. all, but prob. OK NO COMPARISONS unless omnibus passes Post-hoc Anything unplanned More than n-1 Requires correction! Doesn t necessarily require omnibus first 13

Correction Adjust {p-values, alpha} to compensate for multiple testing post-hoc Bonferroni (most conservative) Assume all possible pairs: m = k(k-1)/m (comb.) alpha c = alpha / m Once you have looked, implication is you did all the comparisons implicitly! Holm-Bonferroni is less conservative Stepwise adjusting alpha as you go Dunnett for specifically all vs. control, others 14

Independent samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 T-test Mann-Whitney U, bootstrap 2+ One-way ANOVA Kruskal-Wallis, bootstrap 15

Non-parametrics: MWU and K-W Good for non-normal data, likert data (ordinal, not actually numeric) Assumptions: independent, at least ordinal Null: P(X > Y) = P(Y > X) where X,Y are observations from the 2 distributions (MWU) If assume same distribution shape, continuous then this can can be seen as comparing medians 16

MWU and K-W continued Essentially: rank order all data (both conditions) Total ranks for condition 1, compare to expected Various procecures to correct for ties 17

Bootstrap Resampling technique(s) Intuition: Create null distribution by e.g. subtracting means so ma = mb = 0 Now you have shifted samples A-hat and B-hat Combine these to make a null distribution Draw sample of size N, with replacement Do it 1000 (or 10k) times Use this to determine critical value (alpha = 0.05) Compare this critical value to your real data for test 18

Paired samples, one IV # Conditions Normal/Parametric Non-parametric Exactly 2 Paired T-test Wilcoxon signed-rank 2+ 2-way ANOVA w/ subject random factor Mixed models (later) Friedman 19

Paired T-test Two samples per participant item Test subtracts them Then uses one-sample T-test with H0: m = 0 and H1: m!= 0 Regular T-test assumptions, plus: does subtraction make sense here? 20

Wilcoxon S.R. / Friedman H0: difference btwn pairs is symmetric around 0 H1: or not Excludes no-change items Essentially: rank by abs. difference; compare signs * ranks (Friedman = 3+ generalization) 21

One numeric IV, numeric DV SIMPLE LINEAR REGRESSION 22

Simple linear regression E(Y x) = b 0 + b 1 x looks at populations Population mean at this value of x Key H0: b 1!= 0 b 0 usually not important for significance (obv. important in model fit) b 1 : slope à change in Y per unit X Best fit: Least squares, or maximum likelihood LSq: minimize sum of squares of residuals ML: max prob. of seeing this data with this model 23

Assumptions, caveats Assumes: linearity in Y ~ X normally distributed error for each x, with constant variance at all x Error measuring X is small compared to var. Y (fixed X) Independent errors! Serial correlation, data that is grouped, etc. (later) Don t interpret widely outside available x vals Can transform for linearity! Log(Y), sqrt(y), 1/y, y^2 24

Assumption/residual checking Before: Use scatterplot for plausible linearity After: residual vs. fit Residual on Y vs. predicted on X Should be relatively even distributed around 0 (linear) Should have relatively even v. spread (eq. var) After: quantile-normal of residuals 25

Model interpretation Interpret b1, interpret the p-value CI: if it crosses 0, it s not significant R 2 : fraction of total variation accounted for Intutively: explained variance / total variance Explained = var(y) residual errors F 2 = R 2 / (1 R R 2 ); SML: 0.02, 0.15, 0.35 (cohen) 26

Robustness Brittle to linearity, independent errors Somewhat brittle to fixed-x Fairly robust to equal variance Quite robust to normality 27

CATEGORICAL OUTCOMES 28

One Cat. IV, Cat. DV, independent Contingency tables: how many people in each combination of categories 29

Chi-square test of independence H0: distribution of Var1 is the same at every level of Var2 (and vice versa) Null dist. Approaches X^2 when sample size grows Heuristic: no cells < 5 Can use FET instead Intuition: Sum over rows/columns: (observed expected)^2 / expected Expected: marginal % * count in other margin 30

Paired 2x2 tables Use McNemar s test Contigency table: matches and mismatches for each option. H0: marginals are the same Cond1: Yes Cond 1: No Cond2: Yes a b a + b Cond2: No c d c + d a + c b + d N Essentially a X^2 test on the agreement Test stat: (b-c)^2 / (b+c) 31

Paired, continued Cochran s Q: extended for more than two conditions Other similar extensions for related tasks 32

Critiques Choose a paper that has one (or more) empirical experiments as a central contribution Doesn t have to be human subjects, but can be Does have to have enough description of experiment 10-12 minute presentation Briefly: research questions, necessary background Main: describe and critique methods Experimental design, data collection, analysis Good, bad, ugly, missing Briefly, results? 33

Logistic regression (logit) Numeric IV, binary DV (or ordinal) log( E(Y)/ (1-E(Y)) ) == log ( Pr (Y=1) / Pr (Y=0)) = b 0 + b 1 x Log odds of success = linear function Odds: 0 to inf., 1 is the middle e.g.: odds = 5 = 5:1 for five successes, one fail Log odds: -inf to inf w/ 0 in the middle: good for regression Modeled as binomial distribution 34

Interpreting logistic regression Take exp(coef) to get interpretable odds. For each unit increase in x, odds increase b 1 times Note that this can make small coefs important! Use e.g., Homer-Lemeshow test for goodness of fit null == data fit the model But not a lot of power! 35

MULTIVARIATE 36

Multiple regression Linear/logistic regression with more variables! At least one numeric, 0+ categorical Still: fixed x, normal errors w/ equal variance, independent errors (linear) Linear relationship in E(Y) and one x, when other inputs held constant Effects of each x are independent! Still check q-n of residuals, residual vs. fit 37

Model selection Which covariates to keep? (more on this in a bit) 38

Adding categorical vars Indicator variables (everything is 0 or 1) Need one fewer indicator than conditions One condition is true; or none are true (baseline) Coefs are *relative to baseline*! Model selection: keep all or none for one factor Called ANCOVA when at least one each numeric + categorical 39

Interaction What if your covariates *aren t* independent? E(Y) = b0 + b 1 x 1 + b 2 x 2 + b 12 x 1 x 2 Slope for x1 is diff. for each value of x2 Superadditive: all in same direction, interaction makes effects stronger Subadditive: interaction is in opposite direction For indicator vars, all or none 40

Model selection! Which covariates to keep? From theory Keep interaction only if it s significant? If keep interaction, should keep corresponding mains Adjusted R^2? Regular R^2 always higher w/ more covars BIC and AIC Take model likelihood and penalize for more params Abs value not interpretable; lower is better All combinations? Stepwise? 41

Know they exist; look them up if relevant THINGS WE ARE ONLY GOING TO MENTION BRIEFLY 42

Multi-way ANOVA >1 cat IVs, 1 numeric DV Normality, equal variance, indep. Errors With interaction: every combo of factor levels has its own population mean Without interaction (additive): change in one var consistent as all fixed vals for others Works basically like standard ANOVA, etc. 43

Mixed models regression Explicitly model correlations in data Fixed effects: affect outcome for everyone Random effects: deviations per data item, don t want to model individually Simplest example: repeated measures Y ~ b0 + b1x1 + b2x2. + random ID intercept Each participant has their own intercept adjustment 44

POWER ANALYSIS 45

What is power? Null distribution: designed so that we d only see a test statistic this extreme 5% of the time This bounds type I but not type II Power = 1 type II error rate Heuristic: 80% is good enough 46

Alternative scenarios One null, but infinitely many alternatives! Alternative distribution: given some n, underlying variance, underlying diff. in pop. means, what is the distribution of test statistic You know the critical value, so tells you how often your p will be above 0.05 when the true scenario is as you model 47

Calculating power A priori, to think about sample size and judge value of experiment Inherently requires estimating the alternative scenario! Maybe try a few Statistic-specific, but in general: Sample size, effect size, power, alpha Consider the smallest effect size that you consider interesting and try to achieve reasonable power for that effect size 48

Example from Seltman book F statistic (ANOVA) 3 treatments 50 people each Red: sigma = 10, means: 10, 12, 14 Blue: sigma = 10, means: 10, 13, 16 49

Promoting power (Review from earlier) Raise sample size; reduce variance; aim for bigger effects 50

Walkthrough: linear regression u = model df -> number of params v = F-test error df -> N u 1 f 2 = r 2 / (1 r 2 ) r 2 = f 2 / (1 + f 2 ) 51

Retrospective power Somewhat controversial Calculate observed effect size, then determine what sample size would be needed Whole new experiment, not just collect more Not a good idea: We didn t find a significant effect, but if we had studied 12 more people 52