Introduction to the Analysis of Tabular Data

Size: px

Start display at page:

Download "Introduction to the Analysis of Tabular Data"

Francis Gordon
5 years ago
Views:

1 Introduction to the Analysis of Tabular Data Anthropological Sciences 192/292 Data Analysis in the Anthropological Sciences James Holland Jones & Ian G. Robertson March 15,

2 Tabular Data Is there an association between age at first parturition and breast cancer? Using a case-control study we can test for an association BC No BC Total < , ,245 13,465 Anthropological Sciences 192/292: Tables 2

3 Some Abstraction Case Control Total Exposed a b a + b Not Exposed c d c + d a + c b + d n = a + b + c + d Anthropological Sciences 192/292: Tables 3

4 Risks Two possibilities exist for thinking about risks Define p 1 = a/(a+b) as the probability of developing disease for exposed individuals Define p 2 individuals = c/(c + d) as the probability of developing disease for unexposed The risk difference is p 1 p 2 The risk ratio or relative risk is p 1 /p 2 RR = a/(a + b) c/(c + d If the entries of our table are large enough a normal approximation (to the binomial distribution, if you must know) applies and we can use normal theory to calculate standard errors and place confidence bounds around RR Anthropological Sciences 192/292: Tables 4

5 It turns out that log RR has a sampling distribution better approximated by a normal, so we work with that se[log RR] b an 1 + d cn 2 where n 1 = a + b and n 2 = c + d are the row sums The same logic that led to confidence bounds on a sample mean with known standard deviation then leads to the following expression for a 95% confidence interval on the relative risk: c 1 = log RR b d an 1 cn 2 c 2 = log RR b d an 1 cn 2 Anthropological Sciences 192/292: Tables 5

6 The CI is then [e c 1, e c 2] We mentioned large samples: For this approximation to be at all valid n 1ˆp 1 (1 ˆp 1 ) 5 and n 2ˆp 2 (1 ˆp 2 ) 5 Anthropological Sciences 192/292: Tables 6

7 Problems with Relative Risks The relative risk is a pretty intuitive idea The problem is that it is constrained by the denominator If p 2 = 0.5, that the biggest the relative risk could be is 2 Another problem relates to the fact that we often (usually?) don t have a prospective design for our data collection Without the prospective design, relative risks don t make much sense since we shouldn t believe that a/(a + b) is a good estimator of p 1 We can get around this with odds ratios For probability of success p, define the odds of p as: Odds = p 1 p Anthropological Sciences 192/292: Tables 7

8 Note that probabilities are (by definition) bounded by 0 and 1, odds are bounded by 0 and (This becomes important later) Look at p 1, the probability of disease given an exposure The odds of p 1 are p 1 1 p 1 = a/(a + b) b/(a + b) = a b The odds of p 2 (probability of disease without exposure) are: p 2 1 p 2 = c/(c + d) d/(c + d) = c d Define the Odds Ratio as Anthropological Sciences 192/292: Tables 8

9 OR = a/b c/d = ad bc Among the many nice features of the OR, if the probabilities (p 1, p 2 ) are low, the odds ratio is approximately equal to the relative risk ÔR RR p 1, p 2 < 0.1 Anthropological Sciences 192/292: Tables 9

10 On Prospective vs. Case-Control Studies Why can t we estimate relative risks in a case-control study? Take our table again: Case Control Total Exposed a b a + b Not Exposed c d c + d a + c b + d n = a + b + c + d This table is a sample of a larger population Case Control Total Exposed A B A + B Not Exposed C D C + D A + C B + D N = A + B + C + D Assume a random fraction of the diseased population f 1 are included in the study Anthropological Sciences 192/292: Tables 10

11 Assume a random fraction of the non-diseased population f 2 are included in the study RR = a/(a + b) c/(c + d) = f 1A/(f 1 A + f 2 B) f 1 C/(f 1 C + f 2 D) = A/(f 1A + f 2 B) C/(f 1 C + f 2 D) The only way that these will be equal is if f 1 = f 2 That is, if the sampling fraction is the same from each Chances are, we don t have that! Anthropological Sciences 192/292: Tables 11

12 But we can always calculate an odds-ratio! Anthropological Sciences 192/292: Tables 12

13 Confidence Intervals on Odds-Ratios Again, if the cells are large enough, the normal approximation works Again, we work with the logarithm of the measure of association s.e.[log OR] = 1 a + 1 b + 1 c + 1 d The 95% confidence intervals for the log-odds ratio is thus: 1 c 1 = log OR 1.96 a + 1 b + 1 c + 1 d 1 c 2 = log OR a + 1 b + 1 c + 1 d Anthropological Sciences 192/292: Tables 13

14 Back-transforming, we get the confidence interval on the unit scale of [e c 1, e c 2] Returning to our breast cancer example: > a <- 683 > b < > c < > d < > (a*d)/(b*c) # take a peek [1] > selo <- sqrt((1/a)+(1/b)+(1/c)+(1/d)) > lo <- log( (a*d)/(b*c)) > exp(lo *selo) [1] > exp(lo *selo) [1] Late first birth appears to be significantly associated with breast cancer Anthropological Sciences 192/292: Tables 14

15 What if You Have Small Cell Counts?? Anthropologists frequently have very small sample sizes This means that our n iˆp i (1 ˆp i ) values are frequently less than 5 and the normal approximations fall apart (utterly) R.A. Fisher to the rescue Consider the hypothesis that a chronic enzootic infection promotes the occurrence of some other disease of interest (I have obscured the actual diseases here, since I am posting these notes to the web and am in the process of submitting them for publication, wink, wink) Small cell counts, eh? Disease No Disease Total Chronic Not Chronic Anthropological Sciences 192/292: Tables 15

16 > chronic.table <- matrix(c(8,0,3,11), nr=2,byrow=t) > chronic.table [,1] [,2] [1,] 8 0 [2,] 3 11 > fisher.test(chronic.table) Fisher s Exact Test for Count Data data: chronic.table p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: Inf sample estimates: odds ratio Inf Fisher s test using exact probabilities from a hypergeometric distribution, so there are no approximations required That said, if your cells are too big, you will choke your computer Anthropological Sciences 192/292: Tables 16

17 A Teaser on Power Calculations Say we want to collect data to test the hypothesis that p 1 p 2 Say we want the probability of type I error α and probability of type II error β What? Type I Error The probability of falsely rejecting a true null hypothesis (i.e., your test says significantly different but reality says not different ) Type II Error The probability of falsely accepting a null hypothesis (i.e., your test says not different but reality is really different) Type I error is what we usually think about (you know, the 95% thing) But clearly, we also care about missing out on real effects just because we don t have enough power to measure it We call 1 β the power An integral part of research design should be a power calculation Anthropological Sciences 192/292: Tables 17

18 Can t emphasize this enough... Back to p 1 p 2 question Assume you will have k times as many subjects in class 2 as in class 1: n 2 = kn 1 The sample size that we need to confidently test the hypothesis p 1 p 2 is n 1 = p q ( ) z 1 α/2 + k p 1 q 1 + p 1q 1 k z 1 β / 2 where p 1, p 2 are the projected true probabilities of success in the two groups q 1 = 1 p 1 and q 2 = 1 p 2 = p 2 p 1 p = p 1+kp 2 1+k q = 1 p Anthropological Sciences 192/292: Tables 18

19 What does this mean from a practical standpoint? We want to know how many samples (e.g., respondents) we need to test our hypothesis We set our α (typically 0.05) and our 1 β (typically 0.8, though higher is better) We use theory, or previous research (or just a best guess!) as to what our expected effect size ( ) is We also use this method to make a guess as to what k will be (working on the assumption that your thing of interest is rare, otherwise you can use k = 1) We then solve the big ugly equation above for n 1 Or, should I say, we use R! The library pwr contains routines for doing all sorts of power calculations Anthropological Sciences 192/292: Tables 19

20 Another Way... Say we wanted to model the probability p of some event Say also that we imagine our probability to be a linear function of some covariates x i p = α + β 1 x β k x k Here s the rub: 0 p 1 What if our linear function gives us something that falls outside that range? Define the logit transform of a probability p logit(p) = log[p/(1 p)] Anthropological Sciences 192/292: Tables 20

21 logit is also known as log-odds, for obvious reasons Odds range from 0 to Log odds range from to logit(p) = log ( p ) 1 p = α + β 1 x β k x k We can solve for p to get p = eα+β 1x 1 + +β k x k 1 + e α+β 1x 1 + +β k x k In general, the anti-logit transform is: L = log ( p ) 1 p Anthropological Sciences 192/292: Tables 21

22 ( ) e L p = 1 + e L Anthropological Sciences 192/292: Tables 22

23 Interpretation of the Model Parameters Say that all covariates are the same except for one call it j Say that j = 0 for one individual and j = 1 for another We then have: logit(p A ) = α + β 1 x β j 1 x j 1 + β j (1) + β j+1 x j+1 β k x k logit(p B ) = α + β 1 x β j 1 x j 1 + β j (0) + β j+1 x j+1 β k x k Subtract logitp A from logitp B to get logit(p A ) logit(p B ) = β j Anthropological Sciences 192/292: Tables 23

24 From the definition of a logit, this is log[p A /(1 p A )] log[p B /(1 p B )] = β j which is just: log [ ] pa /(1 p A ) p B /(1 p B ) In other words e β j is the odds in favor of subject A = β j Say we have a variable E that defines exposure (E = 1 means exposure, E = 0 means no exposure) Then log[p/(1 p)] = α + βe Logistic regression gives you the same answer as cross-multiplying a two-way table Anthropological Sciences 192/292: Tables 24

25 Logistic Regression and Cross-Multiplication Give the Same Odds-Ratio! > FB <- cbind(c(a,c), c(a+b,c+d)-c(a,c)) > expose <- factor(c("yes", "no")) > expose [1] yes no Levels: no yes > fblm <- glm(fb~ expose, family=binomial) > summary(fblm) Call: glm(formula = FB ~ expose, family = binomial) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** exposeyes <2e-16 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: e+01 on 1 degrees of freedom Residual deviance: e-12 on 0 degrees of freedom AIC: Anthropological Sciences 192/292: Tables 25

26 Number of Fisher Scoring iterations: 2 > exp( ) [1] > > (a*d)/(b*c) [1] > # pretty durn close... Anthropological Sciences 192/292: Tables 26

27 Logistic Regression Can Be Used with Continuous Independent Variables Too Consider the contrived case where all covariates are identical between to cases, A and B except for j which differs by an additive factor logit(p A ) = α + β 1 x β j 1 x j 1 + β j (x j + ) + β j+1 x j+1 β k x k logit(p B ) = α + β 1 x β j 1 x j 1 + β j (x j ) + β j+1 x j+1 β k x k logit(p A ) logit(p B ) = β j From the definition of a logit, this is Anthropological Sciences 192/292: Tables 27

28 log[p A /(1 p A )] log[p B /(1 p B )] = β j which is just: log [ ] pa /(1 p A ) p B /(1 p B ) = β j In other words e β j is the odds in favor of subject A per unit of increase in covariate j Anthropological Sciences 192/292: Tables 28

29 Multiple Logistic Regression The tool of quantitative social science research? Almost certainly true for epidemiology... > Mroz <- read.table("/home/jhj1/teaching/a192/mroz.txt", header=true, skip=33) > mrozglm <- glm(lfp ~., family="binomial", data=mroz) > summary(mrozglm) Call: glm(formula = lfp ~., family = "binomial", data = Mroz) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-07 *** k e-13 *** k age e-07 *** wcyes *** hcyes lwg e-05 *** Anthropological Sciences 192/292: Tables 29

30 inc e-05 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 752 degrees of freedom Residual deviance: on 745 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Having kids under 5 decreases a woman s probability of labor force participation Having kids between 6 and 18 decreases a woman s probability of labor force participation Being older decreases a woman s probability of labor force participation Going to college increases a woman s probability of labor force participation Being married to a man who went to college has no significant effect Making more money (or having the potential to make more money) increases a woman s probability of labor force participation Having more money decreases a woman s probability of labor force participation Anthropological Sciences 192/292: Tables 30

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto. Introduction to Dalla Lana School of Public Health University of Toronto olli.saarela@utoronto.ca September 18, 2014 38-1 : a review 38-2 Evidence Ideal: to advance the knowledge-base of clinical medicine,