Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Size: px

Start display at page:

Download "Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv"

Prudence Bridges
5 years ago
Views:

1 Measures of Association References: ffl ffl ffl Summarize strength of associations Quantify relative risk Types of measures odds ratio correlation Pearson statistic ediction concordance/discordance Goodman, L.A. and Kruskal, W. H. (17) Measures of Association for Cross Classifications, Springer, New York. Bishop, Fienberg and Holland (175). Discrete Multivariate Analysis: Theory and actice, MIT ess, Cambridge, MA (Chapter 11). Brown, M. B. and Benedetti (177). Sampling behavior of tests for correlation in two-way contingency tables, Journal of the American Statistical Association, 72, Cohen, J. (16). A coefficient of agreement for nominal scales, Educational and Psychological Measurement, 2, Mantel, N. and Haenszel, W. (15). Statistical aspects of the analysis of data from retrospective studies of disease, Journal of the National Cancer Institute, 22, Agresti, A. (14) Analysis of Ordinal Categorical Data, Wiley, New York, (Chapters 2 & 3). Agresti, A. (22) Categorical Data Analysis, 2nd edition, Wiley, New York, (Chapter 2). The odds ratio: The most frequently used measure for 2 2 tables. Observed counts: Variable 2 j = 1 j = 2 Variable 1 i = 1 Y11 Y12 i = 2 Y21 Y

2 Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 given that it is in category j for variable 2: < variable 1 is variable 2 is = : in category 1 in category j ; < variable 1 is variable 2 is = : in category 2 in category j ; = ß 1j ß 1j +ß 2j ß 2j = ß 1j ß 2j ß 1j +ß 2j j = 1 j = 2 i = 1 ß11 ß12 i = 2 ß21 ß22 46 Odds ratio ff = ß11 ß 21 ß12 = ß 22 ß 11ß22 ß21ß12 = m 11m22 m21m12 also called the cross-product ratio. 47 Example: Chinook Salmon Early run (1) Odds of capturing a female: hook & line: Hook & Line Net Female Male net: = :511 :4 = 1:45 = :446 :554 = :

3 Estimated odds ratio: ^ff odds of capturing a female with hook & odds of capturing a female with a net = 1:45 :2 = 1:77 An approximate 5% confidence interval for ff is Conclusion: (1:3; 2:42) 1 A 1 A The odds that a captured fish is female are about 3 to 14 percenct greater with hook & line than with using a net. 473 Late run: Hook & Line Net Female 1 16 Male Estimated odds ratio: ff = :5512 :44 :353 : = 1:23 :63 = 1:6 An approximate 5% confidence interval for ff is (1.4, 2.6) Conclusion: 474 /* ogram to analyze the 1 Chinook salmon data. This program is stored in the file chinook2.sas */ /* Attach labels to categories */ data set1; infile 'chinook.dat'; input (year month day biweek run gear age sexa length) ( $1. 4.); rage=int(age/1); oage=age-(1*rage); if(sexa = 'F') then sex=1; proc format; value run 1 = 'Early' 2 = 'Late'; value sex 1 = 'Female' 2 = 'Male'; value gear 1 = 'Hook' 2 = 'Net'; run; else sex=2; run;

4 run=early proc sort data=set1; by run; run; The FREQ ocedure Table of gear by sex /* Examine partial association between gear sex sex and method of capture within each run. */ proc freq data=set1; by run; table gear*sex / chisq Fisher all nopercent nocol expected; format sex sex. gear gear. run run.; run; Frequency Expected Row Pct Female Male Total Hook Net Total Statistics for Table of gear by sex Statistic DF Value ob Chi-Square Likelihood Ratio Chi-Square Continuity Adj. Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient.146 Contingency Coefficient.132 Cramer's V.146 Statistics for Table of gear by sex Statistic Value ASE Gamma Kendall's Tau-b Stuart's Tau-c Somers' D C R Somers' D R C Fisher's Exact Test Pearson Correlation Spearman Correlation Cell (1,1) Frequency (F) 172 Left-sided <= F. Right-sided >= F 2.5E-4 Lambda Asymmetric C R Lambda Asymmetric R C Lambda Symmetric Table obability (P) Two-sided <= P.352E E-4 Uncertainty Coefficient C R Uncertainty Coefficient R C.145. Uncertainty Coefficient Sym

5 run=late Table of gear by sex gear sex Estimates of the Relative Risk (Row1/Row2) Type of Study Value 5% Confidence Limits Case-Control (Odds Ratio) Cohort (Col1 Risk) Cohort (Col2 Risk) Sample Size = 65 Frequency Expected Row Pct Female Male Total Hook Net Total Statistics for Table of gear by sex Statistics for Table of gear by sex Statistic DF Value ob Statistic Value ASE Chi-Square <.1 Likelihood Ratio Chi-Square <.1 Continuity Adj. Chi-Square <.1 Mantel-Haenszel Chi-Square <.1 Phi Coefficient.1657 Contingency Coefficient.1635 Cramer's V.1657 Fisher's Exact Test Cell (1,1) Frequency (F) 1 Left-sided <= F 1. Right-sided >= F 1.E-6 Table obability (P).75E-7 Two-sided <= P 3.361E-6 Gamma Kendall's Tau-b Stuart's Tau-c Somers' D C R Somers' D R C Pearson Correlation Spearman Correlation Lambda Asymmetric C R.1.41 Lambda Asymmetric R C.5.57 Lambda Symmetric Uncertainty Coefficient C R.2.5 Uncertainty Coefficient R C.2.5 Uncertainty Coefficient Sym

6 operties Estimates of the Relative Risk (Row1/Row2) (i) The odds ratio is not margin sensitive". Type of Study Value 5% Confidence Limits Case-Control (Odds Ratio) Cohort (Col1 Risk) Cohort (Col2 Risk) Sample Size = 77 Choose t1; t2; s1; s2 such that t1+t2 = 1 and s1+s2 = 1. Then the odds ratio for s1t1ß11 s1t2ß12 s2t1ß21 s2t2ß is (iv) Interchanging the rows of the table produces 1=ff. s1t1ß11 s2t2ß22 s1t2ß12 s2t1ß21 = ß 11ß22 ß12ß21 = ff (v) Interchanging the columns of the table produces 1=ff. (ii)» ff» 1 (iii) ff = 1 corresponds to independence (vi) Interchanging both the rows and the columns of the 2 2 table produces ff. So, ff = 4 indicates the same level of association as ff = 1 4 = :

7 Estimation: Substitute Y ij for m ij in ff: ^ff = Y 11Y22 Y12Y21 ^ff has a large sample" N(ff; ff 2 1) distribution with ff1 2 = ff 2 B 1 C A m22 m11 m12 m21 (when each m ij is large). Estimation of the large sample variance: ff 2 1;^ff = ^ff 1 Y11 Y12 Y21 1 A For small samples use and ^ff Λ = (Y 11 + :5)(Y22 + :5) (Y12 + :5)(Y21 + :5) Y22 ^ff 2 1;^ff = (^ff Λ ) 1 Y11 + :5 Y12 + :5 1 Y21 + :5 A Y22 + :5 4 4 Other smooth functions of ff are often used as measures of assocation, say f(ff). The large sample distribution for f(^ff) is N f(ff); [f (ff)] 2 ff 2 1;^ff % The asymptotic variance is obtained from the delta method log-odds ratio: log(ff) (i) 1 < log(ff) < 1 (ii) independence, ff = 1, log(ff) = (iii) log(ff) is not margin sensitive" (iv) log(ff) and log(ff) = log(1=ff) imply equal levels of association (v) log(^ff) has a more nearly symmetric distribution than ^ff for smaller sample sizes 41 42

8 log(^ff) dist! N log(ff); 1 m 11 m 12 m 21 m 22 as m ij! 1 for all (i; j) Yule's Q: Yule (1) Q = ff 1 ff operties: Approximate 5% confidence intervals: For log(ff): log(^ff) ± (1:6)s 1 Y 11 Y 12 Y 21 Y 22 = [A; B] for ff: [exp(a), exp(b)] 1. 1» Q» 1 2. Q = for the independence model 3. Q = 1 when either ß12 = or ß21 = Q = 1 when either ß11 = or ß22 = 5. Q is symmetric". When the columns (or rows) or a 2 2 table are interchanged, then ff ) 1 and Q ) Q ff Q is the Goodman-Kruskal Gamma statistic for 2 2 tables. Q = ß11ß22 ß11ß22 + ß12ß21 Estimation: ß12ß21 ß11ß22 + ß12ß21 ^Q = ^ff 1 ^ff = Y 11Y22 Y12Y21 Y11Y22 + Y12Y21 6. Q is a margin free measure 45 46

9 What is a large or substantial association? Large sample distribution: As m ij! 1 for all i = 1; 2; j = 1; 2 ^Q ο ff 1 N ff ; (1 Q2 ) 2 4 % ( 1 m 11 m 12 m 21 m 22 ) This is [f (ff)] 2 ff 2 1 m 11 m 12 m 21 m 22 where f(ff) = ff+1 ff 1 What is a large value of ff? `n(ff)? Q? 1. Use large sample distributions to construct confidence intervals or test hypotheses H : ff = 1 versus H A : ff 6= 1. Reject H if Z = `n(^ff) ^ff`n(^ff) > Z ff= Suppose you discover that ^ff = 2:4 is significantly different" from zero. Is an odds ratio of 2.4 large enough to have practical importance? There are no absolute guidelines. It depends on the subject matter or field of study. 4 A useful application of measures of association for two-way tables is to assess differences in levels of association across time, or locations. College Graduate Yes No 15 male ^ff 15 = 1:1 female Yes No 16 male ^ff 16 = 4:2 female Yes No 17 male ^ff 17 = 2:7 female Yes No 1 male ^ff 1 = 1: female 5

10 Relative Risk Heart attack No heart attack Placebo ß11 ß12 n1 Aspirin ß21 ß22 n2 Relative risk of a heart attack fheart attack jplacebog fheart attack jaspiring = ß 11 when ß11 and ß21 are small. : = = ff ß 1 ß 1 22A ß21 ß12 51 In that case, ß22 = 1 ß21 = : 1: and ß12 = 1 ß11 = : 1:. Data: Heart No heart attack attack Placebo n1 = 1134 Aspirin n2 = Estimated relative risk of heart attack for those taking the placebo versus those taking the aspirin: Odds ratio: ^ff = = : :4 = 1:2 odds of a heart attack for placebo users odds of a heart attack for aspirin users = (1)(133) (14)(145) = 1:3 Confidence interval: (case-control) ^ff = 1:325 log(^ff) = : S 2 log(^ff) = X i X 1 = :15 Y ij Then log(^ff) ± z ff=2 S log(^ff) 2 ) :654 ± (1:6) p :15 ) (:36467; :4621) ) (e :36467 ; e :4621 ) ) (1:44; 2:331) j 53 54

11 Relative risk: 1 RR col1 = = 1:17 log(rr col1 ) = : S 2 log(rr col1 ) = 1 ^ß 11 n1^ß11 ^ß 21 n2^ß21 = Y 12 n1y11 + Y 22 n2y21 = : log(rr col1 ) ± (1:6) s S 2 log(rr) ) : ± (1:6) p : ) (:357; :35467) ) An approximate 5% confidence interval is (e :357 ; e :35467 ) ) (1:433; 2:36) log(rr) = log ß 1 11C A ß21 = log(ß11) log(ß21) g(ß11; ß21) Independent binomial experiments ^ß11 = Y 11 n1 ^ß21 = Y 21 n2 Y11 ο Bin(n1; ß11) Y21 ο Bin(n2; ß21) 57 V = ψ^ß11! V ^ß21 2 ß11(1 ß11 = 6 n1 4 Delta method: V (g(^ß11; ^ß21)) = " 1 ß11 # 1 ß21 V = 1 ß 11 n1ß11 ß 21 n2ß ß11 1 ß21 ß21(

12 SAS Code /* Assign labels to values */ /* This program computes the odds ratio for a 2x2 table. It is stored in the file aspirin.sas */ DATA SET1; INPUT ROW COL COUNT; LABEL ROW = 'Treatment' COL = 'Heart attack'; CARDS; run; 5 PROC FORMAT; VALUE RFMT 1 = 'Placebo' 2 = 'Aspirin'; VALUE CFMT 1 = 'Yes' 2 = 'No'; run; /* Analyze the table of counts */ PROC FREQ DATA=SET1; TABLES ROW*COL / CHISQ ALL NOPERCENT NOCOL EXPECTED; WEIGHT COUNT; FORMAT ROW RFMT. COL CFMT.; run; 51 The FREQ ocedure Table of ROW by COL ROW(Treatment) COL(Heart attack) Frequency Expected Row Pct Yes No Total Placebo Statistics for Table of ROW by COL Statistic DF Value ob Chi-Square <.1 Likelihood Ratio Chi-Square <.1 Continuity Adj. Chi-Square <.1 Mantel-Haenszel Chi-Square <.1 Phi Coefficient.337 Contingency Coefficient.336 Cramer's V.337 Fisher's Exact Test Aspirin Cell (1,1) Frequency (F) 1 Left-sided <= F 1. Right-sided >= F 3.253E-7 Total Table obability (P) Two-sided <= P 1.516E E

13 S-PLUS Code # An S-PLUS function to compute # an odds ratio and construct an Estimates of the Relative Risk (Row1/Row2) Type of Study Value 5% Confidence Limits Case-Control (Odds Ratio) Cohort (Col1 Risk) Cohort (Col2 Risk) Sample Size = 2271 # approximate confidence interval # This code is posted int he file # oddsratio.ssc oddsratio <- function(table,conf=.5, cont=.) level <- 1-(1-conf)/2 tablec <- table + cont alpha <- tablec[1,1]*tablec[2,2]/ (tablec[1,2]*tablec[2,1]) la <- log(alpha) sla <- sqrt(sum(tablec^(-1))) sa <- alpha*sla lowera <- round(exp(la-qnorm(level)*sla),4) uppera <- round(exp(la+qnorm(level)*sla),4) confper <- round(conf*1,1) alphar <- round(alpha,4) sar <- round(sa,4) cat(" n", "odds-ratio = ", alphar) cat(" n", "std. error = ", sar) cat(" n",confper,"% confidence interval") cat(" n"," lower limit upper limit") cat(" n", " ", lowera, " ", uppera) } 515 # To run this function, first create a # table of counts # # aspirin <- matrix(c(1, 145, 14, # 133), 2, 2, byrow=t) # # Then source this function into the # command window # # source("yourdirectory/oddsratio.ssc") # # Then execute the function # # oddsratio(aspirin,conf=.5,cont=.) # 516

14 The heart attack study is an example of a propspective study odds-ratio = std. error = % confidence interval lower limit upper limit In such studies: ffl Patients are randomly assigned to treatment groups. The treatments are administered. 517 ffl The proportion that give a certain response is recorded for each treatment group. 51 placebo: Y11 n1 aspirin: Y21 n2 = observed proportion that experience a heart attack = observed proportion that experience a heart attack These are direct estimates of population proportions needed to determine relative risk. Retrospective studies: (case-control) Examine what has happened in the past Example: Take a simple random sample of n1 patient records (cases), e.g. women who have experienced a heart attack 51 52

15 Classify each women according to whether or not she ever used oral contraception. Take an independent simple random sample of n2 controls, and classify each woman in the same way. oral contraceptive Heart No heart use attack attack Y 11 n 1 Yes Y11 Y12 No Y21 Y22 estimates < : n1 n2 used oral experienced a contraceptive heart attack = ; 521 Y 12 n 2 estimates < : used oral never had a contraceptive heart attack 522 = ; These do not provide a direct estimate of RR = >< P r >: >< P r >: heart used oral attack contraception heart do not use oral attack contraception >= >; >= >; Bayes Rule to the rescue? >< P r >: >< >: heart use attack o.c. use heart o.c. attack >< >: >= >; = >= >; use o.c. >< >: >= >; heart attack >= >;

16 Then ( ) use heart. ( ) use o.c. attack o.c. RR = ( ) do not heart. ( ) do not use o.c. attack use o.c. Relative risk of heart attack cannot be estimated without additional information on n use o.c. o the proportion of women in the population who use oral contraceptives. Approximate the relative risk with an odds ratio: 2 < : 6 < 4 : ff = 2 < : 6 < 4 : 3 heart use = attack o.c. ; no heart use = 7 5 attack o.c. ; 3 heart do not = attack use o.c. ; no heart do not = 7 5 attack use o.c. ; Which is equal to Which is equal to ( ) use heart o.c. attack >< no >= use heart >: o.c. attack ( ) do not heart use o.c. attack >< no >= do not heart >: use o.c. attack >; >; ( ) heart. ( ) use attack o.c. >< no >=. ( ) heart use >: attack >; o.c. ( ) heart. ( ) do not attack use o.c. >< no >=. ( ) heart do not >: attack >; use o.c < use heart = : o.c. attack ; < do not heart = 7 5 : use o.c. attack ; 3 < use no heart = : o.c. attack ; < do not heart = 7 5 : use o.c. attack ; An estimate is (Y11=n1) (Y21=n1) (Y12=n2) (Y22=n2) = Y 11Y22 Y12Y

Testing Independence

Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1