WORKSHOP 3 Measuring Association Concepts Analysing Categorical Data o Testing of Proportions o Contingency Tables & Tests o Odds Ratios Linear Association Measures o Correlation o Simple Linear Regression Analysis Workshop 3 ~ Measuring Association Page 1 of 1
Analysing Categorical Data A review of methods used to describe the relationship between categorical variables / comparison of proportions. o Contingency Tables & Tests Goodness of Fit Association / Independence o Odds Ratios Testing of Proportions ~ can also calculate C.I.s and apply z-test to proportion(s). (Less common approach (REF: 1.7)) Contingency Tables & Tests types of test Goodness of fit Tests of association and independence Goodness of Fit Test Tests whether distribution of a variable conforms to an expected distribution. Workshop 3 ~ Measuring Association Page of 1
Example: (REF: Chapter 1) Snapdragon flowers can be coloured red, pink or white. According to Mendelian genetic model, self-pollinated pink flowers should produce progeny plants that are red, pink or white with ratio: 1::1 respectively. => H : Pr(R) =.5; Pr(P) =.5; Pr(W) =.5 Sample of 3 plants produce following colours: Red 5 Pr(R).31 Pink 1 Pr(P).51 White 5 Pr(W). To test H, USE CHI-SQUARE TEST, χ test χ ( O E) = E where O is observed frequency and E is expected frequency Calculations: O E (O-E) /E Red 5 5.5.35 Pink 1 117.1 White 5 5.5..5 Compare with χ with (# of categories 1) DF. Workshop 3 ~ Measuring Association Page 3 of 1
Pr(χ >.5 from χ DF) =.7 As p-value >.5 (signif level = 5%) we cannot reject H. Note: Critical χ DF = 5.99 @ 5 % significance level & Calculated χ is < Critical Value so cannot reject H. Tests of Association & Independence Example: The CF_Genotypes data set contains where patients were genotyped for a specific genetic variation and the patients who were with infected with Pseudomonas aeruginosa were recorded. The expectation was that those with the less common A variant would have more severe disease. SPSS Analysis (Analyse>Descriptive Statistics > Crosstabs) PA Infection Present * API Genotype Variant Crosstabulation OBSERVED Count API Genotype Variant Total A G PA Infection Present No 1 1 1 Yes 1 Total 1 1 1 Workshop 3 ~ Measuring Association Page of 1
H : Rate of PA infection present in both genotypes is the same General Formula for Expected Frequencies: E = row total X column total overall total From SPSS PA Infection Present * API Genotype Variant Crosstabulation API Genotype Variant Total A G PA Infection Present No Count 1 1 1 Expected 13.7 1.3 1. Count (O-E) Residual -1.7 1.7 Yes Count 1 Expected.3 17.7. Count (O-E) Residual 1.7-1.7 Total Count 1 1 1 Expected Count 1. 1. 1. Chi-Square Tests (sample output) Value df Asymp. Sig. (- sided) Pearson Chi- 1.9 1.193 Square Fisher's Exact Test N of Valid 1 Exact Sig. (- sided) Exact Sig. (1- sided).7.17 Cases b 1 cells (5.%) have expected count less than 5. The minimum expected count is.9. Workshop 3 ~ Measuring Association Page 5 of 1
Compare with χ with 1 DF. Note: DF = (# rows 1) X (# cols 1) Conclusion: @ 5% sig. Level cannot reject H. => There is No statistically significant evidence that PA infection rates are higher in the Genotype A group. Odds Ratios Odds of Event E is defined as the ratio of the chance that E occurs v s the chance that E does not occur. Let Pr(E) be the probability (chance) of E occurring => 1 Pr(E) is the probability of E not occurring Odds of E = Pr (E) 1 Pr(E) Example o If the probability of E is ¼, then the Odds of E are {¼ / ¾} = 1/3 or 1:3 o If the probability of E is ½, then Odds of E are 1. Odds Ratio, θ is ratio of odds of two events (or conditions). Example ~ Event 1: Low birth weight in smokers; Event : Low birth weight in non-smokers (REF: 1.9) Workshop 3 ~ Measuring Association Page of 1
CF Genotype Data Example API Genotype Variant Total A G PA Infection Present No 1 (n 11 ) 1 (n 1 ) 1 Yes (n 1 ) 1 (n ) Total 1 1 1 Odds Ratio compares: o Odds of PA infection in Genotype A group, Odds A with o Odds of PA infection in Genotype G group, Odds G Odds A = /1 =.333 1 - /1 Odds G = 1/1 =.1 1 1/1 Odds Ratio, θˆ =.333 /.1 =.5 => Estimate that the odds of a contracting a PA Infection for patients in Genotype A group are more than twice that for patients in Genotype G group. Workshop 3 ~ Measuring Association Page 7 of 1
Note: 1. For X Contingency Table θˆ = n 11 X n n 1 X n 1. Odds Ratio is not Normally Distributed but Log Odds Ratio is. We usually work with C.I. for log Odds Ratio and present results as Exponential of C.I. 3. If Exp of C.I. includes 1, it is possible that odds of both events are equal. Workshop 3 ~ Measuring Association Page of 1
Linear Association Measures A review of methods used to describe a LINEAR relationship between continuous variables. o Correlation o Simple linear regression Correlation Describes the strength of a linear relationship between continuous variables Correlation Coefficient range: -1 to 1 o -1 => Perfect Negative Linear Relationship o 1 => Perfect Positive Linear Relationship o => No Linear Relationship 1 1 1 Y 1 1 1 X Workshop 3 ~ Measuring Association Page 9 of 1
1 1 1 Y 1 1 1 1 X 1 Y 1 1 1 X 1 1 Y 1 1 1 X Workshop 3 ~ Measuring Association Page 1 of 1
Simple Linear Regression (SLR) Method of estimating the linear relationship between continuous variables Terminology: o Y: Dependent variable, variable to be predicted o X: Independent variable, Explanatory variable SLR parameters Objective is to estimate straight line that describes relationship between Y & X. Regression Line: Y = α + βx + ε, where error, ε ~ N (,σ ) Require method to estimate α and β. Use method of Least Squares Find estimators, αˆ and βˆ such that S = n i = 1 ( ˆ ) αˆ β y i x i ANOVA for SLR: is minimized o Test H : β = v s H A : β o Divide the total variation in data into: variation due to Regression Line Workshop 3 ~ Measuring Association Page 11 of 1
residual variation o Total Variation = Regression + Residual source of df sum of squares mean F- variation square ratio Regression 1 Regression SS SS/df MS reg Residual (Error) Total n- Residual SS SS/df n-1 Total SS MS res Sig. Pr (F < F-ratio) If sig < Significance Level, then reject H. Conclude β and there is evidence of a linear relationship between Y and X. Note: From ANOVA table, MS res provides an unbiased estimate of the random, unexplained variation in the data; i.e. an unbiased estimate of σ Workshop 3 ~ Measuring Association Page 1 of 1
R : Co-efficient of Determination The proportion of variation in Y that is attributed to its linear regression on X R Regression Sum = Total Sum of of Squares Squares = S S xx xy S yy Range: 1 Closer to 1 => Better fit of regression line to data R = (Correlation Co-efficient) EXAMPLE Lung Function Data Set FVC (forced vital capacity) and FEV (forced expiratory volume) measure the volume capacity of the lung and air volume expired. Both are standard measurements of lung function and are expected to be highly correlated. Dependent Variable: Y ~ FEV Independent Variable: X ~ FVC SPSS Analysis Scatter plot of FEV v s FVC Workshop 3 ~ Measuring Association Page 13 of 1
13 1 11 Forced Expiratory Volume 1 9 7 7 9 1 11 1 13 1 Forced Lung Capacity Correlation & R (SPSS: Analyze > Regression > Linear ) Model Summary Model R R Square Adjusted R Std. Error of the Estimate Square 1..3.3.1 a Predictors: (Constant), Forced Lung Capacity b Dependent Variable: Forced Expiratory Volume Workshop 3 ~ Measuring Association Page 1 of 1
ANOVA Table Testing H : β = ANOVA Model Sum of Squares df Mean Square F Sig. 1 Regression 319.7 1 319.7 9.5. Residual 9.57 11 71. Total 11.93 115 a Predictors: (Constant), Forced Lung Capacity b Dependent Variable: Forced Expiratory Volume As sig <.5 => there is strong evidence of linear relationship between FEV and FVC. Regression Estimators Coefficients Unstandardized Coefficients t Sig. 95% Confidence Interval for B Model B Std. Error Lower Bound Upper Bound 1 (Constant) 33.5 7.7.753. 19. 7.75 Forced Lung Capacity.51.9 9.3..51.7 a Dependent Variable: Forced Expiratory Volume Regression Line: FEV = 33.5+.51 * FVC T-test of β significant => evidence of linear relationship Workshop 3 ~ Measuring Association Page 15 of 1
Error Diagnostics Histogram Dependent Variable: Forced Expiratory Volume 1 1 1 1 Frequency -1.75 -.5 -.75-3.5.5 -.5 -.75-1.5.75 1.75 1.5 Std. Dev = 1. Mean =. N = 11. Regression Standardized Residual Normal P-P Plot of Regression Stand Dependent Variable: Forced Expirato 1..75 Expected Cum Prob.5.5...5.5.75 1. Observed Cum Prob Workshop 3 ~ Measuring Association Page 1 of 1