Correspondence Analysis

Similar documents
Loglinear models. STAT 526 Professor Olga Vitek

STAT 526 Advanced Statistical Methodology

Solution to Tutorial 7

Categorical Variables and Contingency Tables: Description and Inference

MSH3 Generalized linear model

Extended Mosaic and Association Plots for Visualizing (Conditional) Independence. Achim Zeileis David Meyer Kurt Hornik

Analysis of data in square contingency tables

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Statistics 3858 : Contingency Tables

Describing Contingency tables

Visualizing Independence Using Extended Association and Mosaic Plots. Achim Zeileis David Meyer Kurt Hornik

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

PhD Qualifying Examination Department of Statistics, University of Florida

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

11-2 Multinomial Experiment

Log-linear Models for Contingency Tables

Categorical Data Analysis Chapter 3

STAC51: Categorical data Analysis

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Homework 9 Sample Solution

INTRODUCTION TO LOG-LINEAR MODELING

Linear Algebra (Review) Volker Tresp 2018

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Testing Independence

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Single-level Models for Binary Responses

13.1 Categorical Data and the Multinomial Experiment

Principal Component Analysis for Mixed Quantitative and Qualitative Data

Categorical data analysis Chapter 5

Topic 21 Goodness of Fit

Discrete Multivariate Statistics

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

ANOVA: Analysis of Variance - Part I

Module 10: Analysis of Categorical Data Statistics (OA3102)

ECE 5615/4615 Computer Project

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Subject CS1 Actuarial Statistics 1 Core Principles

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Section 4.6 Simple Linear Regression

The material for categorical data follows Agresti closely.

2.3 Analysis of Categorical Data

Sleep data, two drugs Ch13.xls

3 Joint Distributions 71

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Stat 5421 Lecture Notes Simple Chi-Square Tests for Contingency Tables Charles J. Geyer March 12, 2016

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

STA 450/4000 S: January

INFORMATION THEORY AND STATISTICS

Chapter 10. Discrete Data Analysis

Minimal basis for connected Markov chain over 3 3 K contingency tables with fixed two-dimensional marginals. Satoshi AOKI and Akimichi TAKEMURA

8 Nominal and Ordinal Logistic Regression

Ling 289 Contingency Table Statistics

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

Linear Algebra Review. Vectors

Chi-Squared Tests. Semester 1. Chi-Squared Tests

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Conceptual Models for Visualizing Contingency Table Data

1 Inner Product and Orthogonality

Lecture 8: Summary Measures

Solutions for Examination Categorical Data Analysis, March 21, 2013

36-720: Log-Linear Models: Three-Way Tables

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

forms Christopher Engström November 14, 2014 MAA704: Matrix factorization and canonical forms Matrix properties Matrix factorization Canonical forms

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /8/2016 1/38

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Properties of Matrices and Operations on Matrices

Review of One-way Tables and SAS

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Chapter 1 Statistical Inference

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Multinomial Logistic Regression Models

Fundamentals of Engineering Analysis (650163)

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Various Issues in Fitting Contingency Tables

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Multiple Linear Regression

Review of Linear Algebra

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Generalized Linear Models

Linear Algebra (Review) Volker Tresp 2017

Ma/CS 6b Class 20: Spectral Graph Theory

Basic Concepts in Matrix Algebra

Statistics - Lecture 04

Stat 315c: Transposable Data Rasch model and friends

Classification. Chapter Introduction. 6.2 The Bayes classifier

Regression #5: Confidence Intervals and Hypothesis Testing (Part 1)

HANDBOOK OF APPLICABLE MATHEMATICS

Weighted Least Squares

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Introduction to General and Generalized Linear Models

Correspondence Analysis

Transcription:

Correspondence Analysis Q: when independence of a 2-way contingency table is rejected, how to know where the dependence is coming from? The interaction terms in a GLM contain dependence information; however, interpretation of interactions could be difficult Correspondence analysis: a visual residual analysis for contingency table Singular value decomposition R: an r c matrix. W.l.o.g, assume r c and rank(r) c, then R UDV T, where U: an r c column orthonormal matrix, i.e., U T U I c c ; its columns are called left singular vectors V: a c c column orthonormal matrix, i.e., V T V I c c ; its columns called right singular vectors D diag(d 1,, d c ), where d 1 d 2 d c, called singular values p. 5-11 Some properties p. 5-12 Columns of U r c are eigenvectors of (RR T ) r r Columns of V c c are eigenvectors of (R T R) c c {d 12,, d c2 } are eigenvalues of RR T and R T R Procedure of correspondence analysis on Pearson residuals a)fit a GLM corresponding to independence on the contingency table and compute its Pearson residuals, r p s (Q: what information contained in the r p s?) b)write r p s in the matrix form [R ij ] R r c as in contingency table c)perform the singular value decomposition on R: R UDV T R ij = P c k=1 U ikd k V jk d)it is not uncommon for the first few singular values of R to be much larger than the rest. Suppose that the first 2 dominate. Then, R ij U i1 d 1 V j1 + U i2 d 2 V j2 p p p p = ³U i1 d1 ³V j1 d1 + ³U i2 d2 ³V j2 d2 U i1v j1 + U i2v j2

e)the 2-dimensional correspondence plot displays Ui2 against Ui1 and Vi2 against Vi1 on the same graph (Note: because the distance between points will be of interest, it is important that the plot is scaled so that the visual distance is proportionately correct) Some notes: V11 Vj1 Vc1 V 12 Vj2 Vc2 R 1 j c U11 1 U11V 11 U11V j1 Ui1 i Ui1V U 11 i1v j1 r Ur1V U r1 j1 11 U r1 V R U (1) V T (1) + U (2) V (2) U 11V c1 U i1 V c1 U r1v c1 1 j c U12 1 U12V 12 U12V j2 Ui2 i Ui2 V U 12 i2 V j2 r Ur2V Q: what does a large positive R ij mean? a large negative R ij? k d k2 Pearson s X 2 (because ij r 2 p trace(r T R) k d k2 ) Q: what should we look for in a correspondence plot? Large values in U (k) (and V (k) ) In the contingency table, the profiles of the rows (or the columns) corresponding to the large values are different U r2 12 Ur2V j2 U 1k Uik Urk T,whereU (k) = and V (k) = p. 5-13 U 12 V c2 U i2 V c2 Ur2V c2 V1k E.g.: BLOND hair the distribution of eye colors within this group is not typical E.g.: BROWN hair the distribution of eye colors within this group close to the marginal distribution of columns Row and column levels appear close together and far from the origin A large positive R ij would be associated with the combination E.g.: BLOND hair blue eye strong association Row and column levels situate diametrically apart on either side of the origin A large negative R ij would be associated with the combination E.g.: BLOND hair brown eye relatively fewer people Points of two row (or two column) levels are close together The two rows/columns have a similar pattern of association might consider to combine the two categories E.g.: hazel eye green eye similar hair color distribution Other methods: corresp in the MASS package of R (Venables and Ripley, 22), Blasius and Greenacre (1998) Reading: Faraway, 4.2 V jk V ck p. 5-14

Matched Pairs Data: observe one categorical measure on two matched objects E.g.: left and right eye performance of a person In contrast, in the typical 2-way contingency table, observe two (different) categorical p. 5-15 1 I X 2 1 11 1I 1 I I1 II I 1 I 1 measures on one object Q: what questions we might be interested in for matched pair data? and X 2 are independent, i.e., ij i j for all i and j? [ ij ] I I is a symmetric matrix, i.e., ij ji? row and column marginals are homogeneous, i.e., i i? Symmetry implies marginal homogeneity (the reverse statement is not necessarily true) When row and column marginal totals are quite different, we might be interested in whether ij i j ij, where ij ji? The hypothesis is called quasi-symmetry Marginal homogeneity quasi-symmetric symmetry Whether ij i j for i j? it is called quasi-independent Tests for these hypotheses based on GLM, e.g., Y =(y 11,y 21,y 31,y 12,y 22,y 32,y 13,y 23,y 33 ) T Test for symmetry hypothesis: Generate a vector with I 2 components for a (I(I+1)/2)-level nomial factor with the structure: symfactor (l 1,l 2,l 3,l 2,l 4,l 5,l 3,l 5,l 6 ) T Y ~ symfactor S sym X 2 1 2 3 p. 5-16 1 y 11 y 12 y 13 y 1 2 y 21 y 22 y 23 y 2 3 y 31 y 32 y 33 y 3 y 1 y 2 y 3 y Deviance-based/Pearson X 2 goodness-of-fit test for S sym Test for quasi-symmetric hypothesis log(π ij )=log(π i+ π +j γ ij )=log(π i+ ) + log(π +j )+log(γ ij ) Y ~ + X 2 + symfactor S qsym Deviance-based/Pearson X 2 goodness-of-fit test for S qsym Test for marginal homogeneity hypothesis Deviance-based test for H : S sym v.s. H 1 :S qsym \S sym The test is only appropriate when S qsym already holds Test for quasi-independent hypothesis Omit the diagonal data, i.e., Y =(y 21,y 31,y 12,y 32,y 13,y 23 ) T ~ + X 2 S qindep Y Reading: Faraway, 4.3 Deviance-based/Pearson X 2 goodness-of-fit test for S qindep

Three-Way Contingency Table The s and y s are defined in the same manner as in the 2-way table Poisson GLM approach to investigate how, X 2, X 3 interact Mutual independence (, X 2, X 3 are independent) ijk i j k log(π ijk )=log(π i++ π +j+ π ++k ) =log(π i++ )+log(π +j+ )+log(π ++k ) Y X 2 X 3 S 1 The estimates of parameters in this model (1 i I) correspond only to the marginal totals y i, y j, and y k The coding we use will determine exactly how the parameters relate to the margin totals, e.g., let be an main effect of that codes i 1 and i 2 categories as and 1 Insignificant factor, say 1 2 I Joint independence ({, X 2 } and X 3 are independent) ijk ij k ij k ij log(π ijk )=log(π ij+ π ++k )=log(π ij+ )+log(π ++k ) Y X 2 X 2 X 3 S 2 ( S 1 ) X 3 (1 k K) X 2 (1 j J) e ˆβ/(1 + e ˆβ) =ˆπ i2 ++/(ˆπ i1 ++ +ˆπ i2 ++) =y i2 ++/(y i1 ++ + y i2 ++) p. 5-17 Conditional independence (, X 2 are independent given X 3 ) p. 5-18 ij k i k j k ijk i k jk k log(π ijk )=log(π i+k π +jk /π ++k ) =log(π i+k )+log(π +jk ) log(π ++k ) Y X 3 X 3 X 2 X 2 X 3 S 3 Note that S 3 + S 2, but the condition that {, X 3 } and X 2 are independent implies that and X 2 are independent given X 3 Q: can the conditional independence imply independence between and X 2, i.e., ij+ i++ +j+? (Hint: singular value decomposition) Uniform association Consider a model with all two-factor interactions Y X 2 X 3 X 2 X 3 X 2 X 3 S 4 ( S 3 ) S 4 is not saturated some degrees of freedoms left for goodness-of-fit test S 4 has no simple interpretation in terms of independence S 4 asserts that for every level of one variable, say X 3, we have the same association between and X 2

p. 5-19 For each levels of X3, the reduced models of S4 have different coefficients for the main effects of X1 and X2, but have the same coefficients for the interaction X1:X2 E.g., I J 2, same fitted odds-ratio between X1 and X2 for each category of X3. Note that: y y 22k π 11k π 22k β 12k = = e fitted odd-ratio = y 11k, where y π π 12k 21k 12k 21k 12k is the coefficient of the X1 X2 term (under the -1 coding) in the reduced model of X3 k Q: What does uniform association mean? How to interpret the association? How does it connect with interaction terms? p. 5-2 A saturated model corresponds to a 3-way table with different association between, say X1 and X2, across K levels of X3 whereas Y~1 corresponds to a 3-way table with constant Q: how to examine whether the X1, X2, X3 in a 3-way table are mutually independent, jointly independent, conditionally independent, or uniformly associated? 2 Ans: Perform deviance-based/pearson s X goodness-of-fit tests for S1, S2, S3, S4, respectively.

p. 5-21 However, be careful of zero or small y ijk there will be some doubt as to the accuracy of the chi-square approximation in goodness-of-fit test The chi-square approximation is better in comparing model than assessing goodness-of-fit Analysis strategy: start with complex Poisson GLM (such as saturated model) and see how far the model can be reduced (by using deviance-based test to compare models). Binomial (multinomial) GLM approach for 3-way table When y ij s are regarded as fixed, we can treat Y X3 as a response and, X 2 as covariates Q 1 : what information gone? Q 2 : what information still attainable? Ans for Q 1 : information about ij Ans for Q 2 : information about k ij Y X3 y ij1 ~ binomial(y ij+, k 1 ij ) if K=2 Y X3 (y ij1,, y ijk ) ~ multinomial(y ij+, k 1 ij,, k K ij ) if K > 2