ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS

Similar documents
Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

General IxJ Contingency Tables

Properties and Hypothesis Testing

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Topic 9: Sampling Distributions of Estimators

Efficient GMM LECTURE 12 GMM II

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Lecture 2: Monte Carlo Simulation

Chi-Squared Tests Math 6070, Spring 2006

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

Chimica Inorganica 3

Random Variables, Sampling and Estimation

Estimation for Complete Data

4. Partial Sums and the Central Limit Theorem

Topic 9: Sampling Distributions of Estimators

Algebra of Least Squares

Problem Set 4 Due Oct, 12

Statistical Pattern Recognition

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

Topic 9: Sampling Distributions of Estimators

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Describing the Relation between Two Variables

Expectation and Variance of a random variable

Polynomial identity testing and global minimum cut

Exponential Families and Bayesian Inference

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Linear Regression Demystified

Chapter 6 Principles of Data Reduction

1 Inferential Methods for Correlation and Regression Analysis

CALCULATION OF FIBONACCI VECTORS

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Information-based Feature Selection

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods


Orthogonal transformations

Rank tests and regression rank scores tests in measurement error models

The standard deviation of the mean

STA6938-Logistic Regression Model

Lecture 24: Variable selection in linear models

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

ON POINTWISE BINOMIAL APPROXIMATION

Stat 421-SP2012 Interval Estimation Section

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Determinants of order 2 and 3 were defined in Chapter 2 by the formulae (5.1)

4. Hypothesis testing (Hotelling s T 2 -statistic)

CS284A: Representations and Algorithms in Molecular Biology

Problem Set 2 Solutions

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

THE KALMAN FILTER RAUL ROJAS

4 Multidimensional quantitative data

Chi-squared tests Math 6070, Spring 2014

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Statistical Inference Based on Extremum Estimators

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

1 Models for Matched Pairs

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Element sampling: Part 2

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

1 Review of Probability & Statistics

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

Math 155 (Lecture 3)

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

MATH/STAT 352: Lecture 15

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

Lecture 7: Properties of Random Samples

Accuracy Assessment for High-Dimensional Linear Regression

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Categorical Data Analysis

Statistics 3858 : Likelihood Ratio for Multinomial Models

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Matrix Representation of Data in Experiment

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

CHAPTER 4 BIVARIATE DISTRIBUTION EXTENSION

Axioms of Measure Theory

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Machine Learning for Data Science (CS 4786)

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Lecture 7: October 18, 2017

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Infinite Sequences and Series

Transcription:

ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS Authors: S Be Ammou Departmet of Quatitative Methods, Faculté de Droit et des Scieces Ecoomiques et Politiques de Sousse, Tuisia, (salouabeammou@fdsepsrut) G Saporta Chaire de Statistique Appliquée & CEDRIC, Coservatoire Natioal des Arts et Métiers, Paris, Frace (saporta@camfr) Received: December 2002 Revised: September 2003 Accepted: October 2003 Abstract: Multiple Correspodece Aalysis (MCA) ad log-liear modelig are two techiques for multi-way cotigecy table aalysis havig differet approaches ad fields of applicatios Log-liear models are iterestig whe applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit the relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary We preset i this paper the distributio of eigevalues i MCA whe the data fit a kow log-liear model, the we costruct this model by successive applicatios of MCA We also propose a empirical procedure, fittig progressively the log-liear model where the fittig criterio is based o eigevalue diagrams The procedure is validated o several sets of data used i the literature Key-Words: Multiple Correspodece Aalysis; eigevalues; log-liear models; graphical models; ormal distributio AMS Subject Classificatio: 49A05, 78B26 We thak Professor M Bourdeau for his careful readig

42 S Be Ammou ad G Saporta

Eigevalues i MCA ad Log-Liear Models 43 1 INTRODUCTION Multiple Correspodece Aalysis ad log-liear modelig are two very differet, but mutually beeficial approaches to aalyzig multi-way cotigecy tables: log-liear models are profitably applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary After a short remider o MCA ad log-liear approaches, we study the distributio of eigevalues i MCA uder modelig hypotheses, especially i the case of idepedece At the ed we propose a algorithmic approach for fittig log-liear models where the fittig criterio is based o eigevalues diagram 2 A SHORT SURVEY OF MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS We first itroduce MCA ad log-liear modellig, the we preset some works usig both methods 21 Multiple Correspodece Aalysis Correspodece Aalysis (CA) has quite a log history as a method for the aalysis of categorical data The startig poit of this history is usually set i 1935 [28], ad sice the CA has bee reiveted several times We ca distiguish simple CA (CA of cotigecy tables) ad MCA or Multiple Correspodece Aalysis (CA of so-called idicator matrices) MCA traces back to Guttma [23], Burt [8] or Hayashi [25] I Frace, i the 1960s, Bezecri [6] proposes, other developmets of this method Outside Frace, MCA has bee developed by J de Leeuw sice 1973 [22] uder the ame of Homogeeity Aalysis, ad the ame of Dual Scalig by Nishisato [38] Multiple Correspodece Aalysis (MCA) is a multidimesioal descriptive techique of categorical data A theoretical versio of Multiple Correspodece Aalysis of p variables ca be defied as the limit, whe the umber of statistical uits icreases, of the CA of a complete disjuctive table Let X be a complete disjuctive table of p categorical variables X 1, X 2,, X p, with respectively m 1, m 2,, m p modalities observed over a sample of idividuals CA of this complete disjuctive table is equivalet to the aalysis of B [8], where B = X X is the Burt table associated with X The two aalyses have the same factors, but the eigevalues i MCA equal to the squared

44 S Be Ammou ad G Saporta root of the eigevalues i the CA of the associated Burt table MCA of X correspods to the diagoalizatio of the matrix 1 p (D 1 X X) = 1 p (D 1 B) where D = Diag(X X) = Diag(B) The structure of the eigevalue diagram depeds o the variable iteractios It is well kow that i the case of pairwise idepedet variables, the q o-trivial eigevalues are theoretically equal to to 1 p, where p (1) q = m i p 22 Log-liear modelig Log-liear modelig is a well-kow method for studyig structural relatioships betwee categorical variables i a multiple cotigecy table whe all the variables have o particular role Relatively recet ad ot as well kow i Frace as MCA, log-liear modelig has may classical refereces After first works of Birch [7] i 1963 ad Goodma [17], we must metio the basic books of Haberma [24], Bishop, Fieberg & Hollad [8], Fieberg [15] More Recetly, Dobso [12], Agresti [1], Christese [10] have writte sytheses o the subject supplemeted with persoal cotributios Whittaker [41] devotes a large part of his book to log-liear models before defiig associated graphical models 221 Log-liear modelig i the biomial case Let X = (X 1, X 2,, X p ) be a k-dimesioal radom vector, with values i {0, 1} k The expressio for the k-dimesioal probability desity of X is: f k (X) = p(0, 0,, 0) (1 x 1)(1 x 2 ) (1 x k) p(1, 0,, 0) x 1(1 x 2 ) (1 x k ) p(0, 1,, 0) (1 x 1)x 2 (1 x k) p(0, 0,, 1) (1 x 1)(1 x 2 ) x k p(1, 1,, 0) x 1x 2 (1 x k) p(1, 1,, 1) x 1x 2 x k We ca write the desity fuctio as a log-liear expasio: log[f k (X)] = u o + k u i x i + k u ij x i x j + i,j=1, i j + + u 123k x 1 x 2 x k k u ijl x i x j x l i,j, l=1, i j l where u o =log[p(0,0,,0)], u i =log[ p(0,0,,0,1,0,0) p(0,0,,0) ] ad the u-terms u ij,, u 123k are a log cross product ratio i the (k, k) probability table The u-term u ij is set to zero whe X i ad X j are idepedet variables

Eigevalues i MCA ad Log-Liear Models 45 222 Log-liear modelig i the multiomial case Let X = (X 1, X 2,, X k ) be a k-dimesioal radom vector, with values i {0, 1,, m 1 1} {0, 1,, m 2 1} {0, 1,, m k 1} istead of i {0, 1} k as i the precedig case The geeralisatio to the k-dimesioal cross-classified multiomial distributio is the log-liear expasio: k k k log[f k (X)] = u o + u i (x) + u ij (x) + u ijl (x) + + u 123k (x) i,j=1, i j i,j, l=1, i j l Each u-term is a coordiate projectio fuctio with the coordiates idicated by its idex; ad each u-term is costraied to be zero wheever oe of its idicated coordiates is zero The importace of log-liear expasios rests with the fact that may iterestig hypotheses ca be geerated by settig some u-terms to zero We are iterested particularly i this paper with graphical ad hierarchical log-liear models 2221 Graphical log-liear models Let G = (K, E) be the idepedece graph of the k-dimesioal radom vector X, with k vertices i K = {1, 2,, k} ad edge set E G is the set of pairs (i, j) such that wheever (i, j) is ot i E the variables X i ad X j are idepedet coditioally o the other variables Give a idepedece graph G, the cross classified multiomial distributio for the radom vector X is a graphical model for X, if the distributio of X is differet from costraits of the form that for all pair of coordiates ot i the edge set E of G, the u-terms costraiig the selected coordiates are idetically zero 2222 Hierarchical log-liear models A graphical model satisfies costraits of the form that all u-terms above a fixed poit have to be zero to get coditioal idepedece A larger class of models, the hierarchical models, is obtaied by allowig more flexibility i settig the u-terms to zero A log-liear model is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero

46 S Be Ammou ad G Saporta We ote here that every distributio with a log-liear expasio has a iteractio (or idepedece) graph, ad a hierarchical log-liear model is graphical if ad oly if its maximal u-terms correspod to cliques i the idepedece graph Whe all the u-terms are o-zero, we have the saturated model I the case whe oly the u i are o-zero, the model is called the mutual idepedece model: log[f k (X)] = u o (x) + k u i (x) Whe oly u i ad some of u ij are o-zero, the model is called a coditioal idepedece model: log[f k (X)] = u o (x) + k u i (x) + i,j u ij (x) These coditioal idepedece models refer to simple iteractios betwee some variables 223 Parameters estimatio ad related tests Theoretical frequecies are geerally estimated usig the maximum-likelihood method Weighted regressio, or iterative methods ca be also used as well sice log-liear models are particular cases of the geeralized liear model Usually the classical χ 2 or the G 2 tests (the likelihood ratio) are used to assess log-liear models The values of the two statistics icrease with the umber of variables, ad decrease with the umber of iteractios The closer the statistics are to zero, the better the models Model selectio becomes difficult whe the umber of variables grow: eg with four variables there are 167 differet hierarchical models To avoid the combiatory explosio we ca use criterios based o the Kullback iformatio like the Akaike criterio: AIC = 2 log( L) + 2 k (A Iformatio criterio), or the Schwartz criterio: BIC = 2 log( L) + k log() (Bayesia Iformatio criterio), where L is the maximum of the likelihood fuctio (L), ad k the umber of parameters maximisig L

Eigevalues i MCA ad Log-Liear Models 47 23 Multiple Correspodece Aalysis ad log-liear model as complemetary tools of aalysis I this sectio, we preset some works that show how CA (or MCA) ad log-liear modelig ca be related This leads to a better uderstadig of CA, ad to a combied use of both methods CA is ofte itroduced without ay referece to other methods of statistical treatmet of categorical data with prove usefuless ad flexibility A major differece betwee CA ad most other techiques for categorical data aalysis lies i the use of probability models I log-liear aalysis (LLA), for example, a distributio is assumed uder which the data are collected, the a log-liear model for the data is hypothesized ad estimatios are made uder the assumptio that this probability model is true, ad fially these estimates are compared with the observed frequecies to evaluate the log-liear model I this way it is possible to make ifereces about the populatio o the basis of the sample data I CA, it is claimed that o uderlyig distributio has to be assumed ad o model has to be hypothesized, but a decompositio of the data is obtaied to study the structure i the data A vast literature has bee devoted to the assessmet of CA (or MCA) ad LLA We briefly report here some of that literature Several works compare CA or MCA ad LLA Daudi ad Trecourt [11] demostrate empirically that LLA is better adapted to study global relatioships betwee the variables, i the sese that margial liaisos are elimiated i the computatio of profiles Goodma [17],[18],[19],[20],[21] defies two models belogig to the same family: the saturated row colum correspodece aalysis model as a geeralizatio of MCA, ad the row colum associatio model as a geeralizatio of LLA He demostrates, with illustratios by examples, that usig these models is better tha usig the classical methods Baccii, Mathieu ad Modot [3] use a example to compare the two methods Jmel [30], De Falguerolles, Jmel ad Whittaker [13],[14] use graphical models compared to MCA Other works use CA or MCA ad LLA as a combied approach to cotigecy table aalysis: Va der Heijde ad de Leeuw [26],[27],[28], Novak ad Hoffma [39] ad others, use CA as a tool for the exploratio of the residuals from log-liear models, ad give a example of the procedure Worsley [42] shows that i certai cases CA leads directly to the appropriate log-liear model Lauro ad Decarli [31] used AC as a procedure for the idetificatio of best log-liear models

48 S Be Ammou ad G Saporta 3 EIGENVALUES IN CORRESPONDENCE ANALYSIS It is well kow that MCA is a extesio of CA, although we first preset eigevalues i CA, ad some simple rules for the selectio of the umber of eigevalues 31 Asymptotic distributio of eigevalues i Correspodece Aalysis Let N be a cotigecy table with m 1 rows ad m 2 colums, ad let us assume that N is the realizatio of a multiomial distributio M(, p ij ) which is realistic I this framework the observed eigevalues µ i are estimators of the eigevalues λ i of P, where P is the table of ukow joit probabilities Lebart [32] ad O Neill [34],[35],[36] proved the followig result: if µ i = 0 the λ i has the same distributio as the correspodig eigevalues of a (m 1 1)(m 2 1) degrees of freedom from the Wishart matrix: W (m1 1)(m 2 1)(r, l) where r = mi(m 1 1, m 2 1) If µ j = 0 the λ j is asymptotically ormally distributed, but with parameters depedig o the ukow p ij Sice it is difficult to test this hypothesis, some authors have proposed a bootstrap approach, which ufortuately is ot valid: sice the empirical eigevalues, o which the replicatio is based, are geerally ot ull, we caot observe the distributio based o the Wishart matrix 32 Malivaud s test Based upo the recostitutio formula, which is a weighted sigular value decompositio of N: ij = ( i j ) 1 + (a ik b ki ) k λk, where a ik, b ik are the factorial compoets associated to the row ad colum profiles We may use a chi-square test comparig the observed ij from a sample of size to the expected frequecies uder the ull-hypothesis H k of oly k o zeros The µ i weighted least squares estimates of these expectatios are precisely the ñ ij of the recostitutio formula with its first k terms We the compute the

Eigevalues i MCA ad Log-Liear Models 49 classical chi-square goodess of fit statistic: Q k = i j (ñ ij ij ) 2 ñ ij If k = 0 (idepedece), Q 0 is compared to a chi-square with (m 1 1) (m 2 1) degrees of freedom Uder H k, Q k is asymptotically distributed like a chi-square with (m 1 k 1) (m 2 k 1) degrees of freedom However Q k suffers from the followig drawback: if ij is small, ñ ij ca be egative ad the test statistic caot be used This is ot the case for the modificatio proposed by E Malivaud [37] who proposed to use ( i j ) istead of ñ ij for the deomiator Furthermore, this leads to a simple relatio with the sum of the discarded eigevalues: Q k = i (ñ ij ij ) 2 j ( i j ) = (λ k+1 + λ k+2 + + λ r ) Q k is also asymptotically distributed like a chi-square with (p k 1) (q k 1) degrees of freedom 4 BEHAVIOUR OF EIGENVALUES IN MCA UNDER MODELING HYPOTHESES Let X = (X 1 X 2 X p ) be a disjuctive table of p categorical variables X i (with respectively m i modalities) observed o a sample of idividuals, ad q the umber of o trivial eigevalues (as defied i 21) Multiple Correspodece Aalysis is the CA of disjuctive table X The rak of X: rak(x) = mi(q+1; ), so equals q+1 if > q+1 We suppose, without loss of geerality, that is large eough, which is the usual case CA factors are the eigevectors of the matrix 1 p D 1 B (where B ad D are defied i 21) So D 1 B is a diagoal uit matrix Its trace is: Tr(D 1 B) = p m i ad 1 p Tr(D 1 B) = 1 p p m i Sice q µ i = 1 p p m i 1, we ca coclude that (2) 1 q q µ i = 1 p

50 S Be Ammou ad G Saporta ad (3) q (µ i ) 2 = 1 p p 2 (m i 1) + 1 ϕ 2 p 2 ij i j where ϕ 2 ij is the observed Pearso s ϕ2 crossig of X i with X j, ad ϕ 2 = 1 ( i ad j are margi effectives) ( ij i j i j i j ) 2 = χ2, Although MCA is a extesio of CA, the results of 3 are ot valid ad oe caot use Malivaud s test: elemets of X beig 0 or 1 ad ot frequecies, Q k ad Q k do ot follow a chi-square distributio However it is possible to get iformatio about the dispersio of the q eigevalues i particular cases [5] 41 Distributio of eigevalues i MCA uder idepedece hypothesis Uder the hypothesis of pairwise idepedece of the variables X i, all ϕ 2 ij = 0 ad equatio (3), becomes Usig (2) we get ad fially: q (µ i ) 2 = 1 p p 2 (m i 1) q (µ i ) 2 = 1 p 2 q, [ 2 q (µ i ) 2 = 1 p 2 = 1 (µ i )] q Sice the mea of the squared µ i equals their squared meas oly if all the terms are equal, we ca coclude that all the eigevalues have the same value, so that: µ i = 1 p, i We coclude that the theoretical MCA (ie for the populatio), uder the hypothesis of pairwise idepedece of the variables X i leads to oe q-multiple o-trivial o-zero eigevalue λ = 1 p Ad the eigevalue diagram has the particular shape show i Figure 1 : i

Eigevalues i MCA ad Log-Liear Models 51 λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** **************************** **************************** **************************** **************************** **************************** **************************** Figure 1: Theoretical eigevalues diagram i the idepedece case This result is ot true whe we have a fiite sample, sice samplig fluctuatios make the observed ϕ 2 ij 0 Although the trace of 1 p (D 1 B) ad µ the mea of the observed o-trivial eigevalues, are costats, we observe q differet o-trivial eigevalues µ i 1 p, ad the eigevalue diagram takes the shape show i Figure 2 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************** ************************* ************************ *********************** ********************** Figure 2: Observed eigevalues diagram i the idepedece case 411 Dispersio of eigevalues Let Sµ 2 be the measure of µ i aroud 1 p give by: Sµ 2 = 1 q ( µ i 1 ) 2 = 1 q (µ i ) 2 1 q p q p 2, which implies q (µ i ) 2 = q (S 2µ + 1p ) 2 Usig equatios (1)&(3), we have: q (µ i ) 2 = q p 2 + 1 ϕ 2 p 2 ij = q p 2 + 1 χ 2 p 2 ij i j i j

52 S Be Ammou ad G Saporta Uder the hypothesis of pairwise idepedece of the variables, the χ 2 ij are realizatios of χ2 (m i 1)(m j 1) variables, so their expected values are (m i 1) (m j 1) We ca the easily compute E( q (µ i) 2 ), ad get: ( q ) E (µ i ) 2 = q p 2 + 1 1 (mi p 2 1) (m j 1) Fially: i j E(S 2 µ) = 1 q E ( q (µ i ) 2 ) 1 p 2 ad we obtai: E(Sµ) 2 = 1 1 1 p 2 q (mi 1) (m j 1) i j Now, sice E(Sµ)=σ 2 2, we may assume that 1 p ± 2 σ cotais roughly 95% of the eigevalues Moreover, sice the kurtosis of the set of eigevalues is lower tha for a ormal distributio, this proportio is actually probably larger the 95% 412 Estimatio of the Burt table Let X be the disjuctive table associated to p categorical variables X i, with m i modalities respectively, observed o a sample of idividuals, where X i = (X i1, X i2,, X imi ), X is a matrix made (of p-block) of p blocks X i X = (X 1 X 2 X i X p ) Let (X j i1, Xj i2,, Xj ip ) be the observed value of X i o the j th idividual We ca write X11 1 X1 1m 1 X21 1 X1 2m 2 Xp1 1 X1 pm p X11 2 X = X2 1m 1 X21 2 X2 2m 2 Xp1 2 X2 pm p X11 X 1m 1 X21 X 2m 2 Xp1 X pm p The Burt table of X is the X 1 X 1 X 1 X 2 X 1 X p B 11 B 12 B 1p B = X 2 X 1 X 2 X 2 X 2 X p = B 21 B 22 B 2p, X px 1 X px 2 X px p B p1 B p2 B pp

Eigevalues i MCA ad Log-Liear Models 53 where B i ad = B ii = X ix i = (X j 1i )2 (X j 1i ) (Xj 2i ) (X j 1i ) (Xj m i i ) j=1 j=1 (X j 2i ) (Xj 1i ) (X j 2i )2 (X j 2i ) (Xj m i i ) j=1 j=1 (X j m i i ) (Xj 1i ) (X j m i i ) (Xj 2i ) j=1 j=1 j=1 j=1 j=1 (X j m i i )2 X j ki = { 0 1 with m i k=1 Xj ki = 1 Sice there is oly oe k i {1,, m i} such as Xji k = 1, all other beig zero, we obtai: (X j ki )2 = X j ki i {1,, }, k {1,, m i } ad k=1 k=1 (X j ki ) (X k i j ) = 0 k, k {1,, m i } k=1 Ad so ca coclude that,, p the diagoal sub-matrices of the Burt table are themselves diagoal matrices: (X j 1i )2 0 j=1 X ix i = B i = (X j ki )2 j=1 0 (X j m i i )2 where Furthermore, we kow that ( m i k=1 j=1 X j ki ki = ) = m i k=1 X j ki = k i j=1 ( ki ) =, is the umber of idividuals that have the k th modality of the i th variable (for 1 i p ad 1 k m i ) j=1

54 S Be Ammou ad G Saporta So the diagoal sub-matrices of the Burt table are: B i = B ii = 1 i 0 k i where m i k=1 ki = 1,, p 0 m i i Cosider ow two idepedet variables X α ad X β amogst the p variables havig respectively m α ad m β modalities Let B α be the (m α, m α ) square matrix B α = X αx α, ad B αβ the (m α, m β ) rectagular matrix B αβ = X αx β We have (B α ) ii = Xiα k = X α i ad (B α ) ij = 0 if i j, k=1 ad where (B αβ ) ij = X k iα Xk iβ Uder the hypothesis that X α ad X β are idepedet (B αβ ) ij = (B α) ij (B β ) ij = Xα i Xβ i Sice X α i = α i ad X β i = β i, we ca write [ (B αβ ) ij = k=1 X α ki Xβ kj = Xα i Xβ i = α i β j ] ad, more geerally, we ca coclude that X ix j = B ij = i 1 j 1 i 2 j 1 i m i j 1 i 1 j 2 i 2 j 2 i m i j 2 i 1 j m j i 2 j m j i m i j m j if the p variables are mutually idepedet

Eigevalues i MCA ad Log-Liear Models 55 Now cosider a sample of p multiomial radom variables X i Let p k i = p ik be the probability that a idividual be i the k th category of the i th variable, ad p k ij be the probably that the jth idividual be i the k th category of the i th variable The observed Burt table is: B = X X = X 1 X 1 X 1 X 2 X 1 X p X 2 X 1 X 2 X 2 X 2 X p X px 1 X px 2 X px p, with X ix i = N i = (Xij) 1 2 0 j=1 0 (X j ki )2 j=1 j=1 (X j m i i )2 = diag{ 1 i,, m i i } But k i = m i m i m i (Xki i )2 =p k i ad p k i =1, so that k i = p k i =,,, p j=1 ad X i X j = k=1 p 1 i 0 p k i 0 p m i i Sice X i ad X j are idepedet variables, X i X j = N ij ad (N ij ) kk = (X i X j) kk = p k i pk j, which implies k=1 k=1 X ix j = N ij = p i 1 pj 1 p i 1 pj 2 i 1 j m j p i 2 pj 1 p i 2 pj 2 p i 2 pj m j p i m i p j 1 p i m i p j 2 p i m i p j m j

56 S Be Ammou ad G Saporta ad The maximum-likelihood estimator of p k i is ˆp k i = k i, so 1 i 0 ˆN i = k i = B ii 0 m i i i 1 j 1 i 1 j 2 i 1 j m j i 2 j 1 i 2 j 2 i 2 j m j ˆN ij = = B ij i m i j 1 i m i j 2 i m i j m j We ca coclude that the the maximum-likelihood estimator ˆB of the theoretical Burt table is B the observed oe Usig the ivariace fuctioal propriety we ca affirm that the maximum-likelihood estimators of the eigevalues of D 1 B are the eigevalues of D 1 B, so that each µi is the maximum-likelihood estimator of λ i = λ Maximum-likelihood estimators are asymptotically ormal, ad so, asymptotically, each µ i is ormally distributed But due to the fact that eigevalues are ordered, the eigevalues are ot idetically ad idepedetly distributed However: E(µ 1 ) > 1 p, E(µ q) < 1 p 1 but E(µ 1 ) p ad 1 E(µ q ) p Furthermore the eigevalue variaces are ot the same Ad from simulatios of large samples of observatios ( = 100,, = 10 000), we otice that the covergece of the eigevalue distributio to a ormal oe is slow, especially for the extremes (µ 1 ad µ q ), eve for very large samples [4] 42 Distributio of eigevalues i MCA uder o-idepedece hypotheses 421 Distributio of the theoretical eigevalues Let µ be a eigevalue of D 1 X X Sice µ ca be also obtaied by diagoalizatio of 1 p XD 1 X, µ is a solutio of 1 p XD 1 X z = z, where z is a eigevector associated to µ

Eigevalues i MCA ad Log-Liear Models 57 So where P i = ( p ) 1 ( ) X i X 1X p i X i i z = µ z 1 p p P i z = µ z, p X i (X i X i) 1 X i is the orthogoal projector o the space spaed by liear combiatios of the idicators of variables categories X i Let A i the cetered projector associated to P i : A i = P i 1 1 1 1 m i m i where 1 mi m i = 1 1 1 Ad so we get (4) 1 p p A i z = µ z 4211 The Case of two-way iteractios Let us assume that amog the p studied variables, there is a two-way iteractio betwee X j ad X k, ad that the (p 2) remidig variables are mutually idepedet Multiplyig equatio (4) by A j we get: 1 ( A j A 1 p }{{} 0 + A j A 2 + + A j A j }{{}}{{} 0 Aj + + A j A k + + A j A p }{{} 0 ) z = µ A j z, sice all variables are pairwise idepedet except X j, X k, ad the A i are orthogoal projectors Thus: (5) A j A k z = (p µ 1) A j z Similarly, multiplyig (4) by A k, we get: (6) A k A j z = (p µ 1) A k z Now let us multiply (5) by A k to get: A k A j A k z = (p µ 1) A k A j z Usig (6) we obtai A k A j A k z }{{} z 1 = (p µ 1) 2 A k z }{{} z 1

58 S Be Ammou ad G Saporta With the otatio λ = (p µ 1) 2, we fially write: (7) A k A j z 1 = λ z 1 Equatio (7) implies that λ is a eigevalue of the product of the cetered projector A k A j associated to the eigevector z 1 I geeral: j, k = 1,, p, if there is a iteractio betwee X j ad X k, the orthogoal projector A j A k admits a o zero eigevalue λ = (p µ 1) 2 If λ 0 µ 1 p, the trace of Burt table beig costat, there is, at least, aother eigevalue ot equal to 1 p Let 0 be the umber of eigevalue o equal to 1 p, so that 0 λ i = 0 p Theoretically, (except for the particular case, where λ = 1, for which we have µ = 2 p ad µ = 0), the umber of o-trivial-eigevalues greater tha 1 p is equal to the umber of o-trivial eigevalues smaller tha 1 p The eigevalue diagram shape is show o Figure 3 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************ ************************ ************************ ****************** ***************** Figure 3: Theoretical eigevalues diagram i two-way iteractio case The umber 0 depeds o the umber of categories of X j ad X k, o the umber of variables ad o the umber of depedet variables Let 1 be the multiplicity of 1 p, we will show that 1 = q 2 mi((m j 1); (m k 1)), whe p > 2, ad whe there is oly oe two-way iteractio betwee the variables This result ca be show as follows: Let us cosider equatio (4), ad suppose, without loss of geerality, that X 1 ad X 2 are depedat So, upo multiplicatio by A 3 : 1 p p A iz = µz becomes 1 p (A 3A 1 + A 3 A 2 + A 3 A 3 + + A 3 A P ) z = µ A 3 z, ad we get µ = 1 p

Eigevalues i MCA ad Log-Liear Models 59 Now multiply equatio (4) by A 2 ad A 1 i tur to get: ) (A 1 A 1 + A 1 A 2 + A 1 A 3 + + A 1 A P z = p µa 1 z ) (A 2 A 1 + A 2 A 2 + A 2 A 3 + + A 2 A P z = p µa 2 z { (A1 + A 1 A 2 ) z = p µ A 1 z (A 2 A 1 + A 2 ) z = p µ A 2 z { A1 A 2 b = λ z A 2 A 1 b = λ z where λ = (p µ 1) 2, a = A 1 z ad b = A 2 z We recogize here the CA equatios, so that the CA of Burt tables, whe oly two variables are depedet is equivalet to the CA of the cotigecy tables crossig the two depedet variables It is well kow that the umber of eigevalue i CA equals q 2 mi((m j 1); (m k 1)), ad for all o trivial λ i, there correspods the values µ i ad µ i such that: µ i = 1 + λ i p ad µ i = 1 λ i p Fially, the CA of the Burt table may have 2 mi((m j 1);(m k 1)) eigevalues o trivial ad ot equal to 1 p, associated to the CA of the cotigecy table So the umber of supplemetary eigevalues equals q 2 mi((m j 1); (m k 1)) There is, i additio, oe 1 multiple eigevalue, where 1 is at least q 2 mi((m j 1); (m k 1)) 4212 The case of higher order iteractios Sice the Burt table is costructed with pairwise cross products of variables, its observatio caot give us iformatio about multiway iteractios However the observatio of the bi-dimesioal theoretical Burt sub-tables, for all pairwise variable combiatios, ca provide us with all the two-way iteractios The theoretical Burt table ca reveal the existece of higher order iteractios i the followig case: If B ij B ii 1 mj m j B jj ad B ik B ii 1 mk m k B kk : there may be a triple iteractio betwee X i, X j ad X k I geeral, a Burt table does t give either the order of the iteractios, or supplemetary iformatio o the eigevalue behavior

60 S Be Ammou ad G Saporta 422 Distributio of observed eigevalues Exceptioally, with a small umber of iteractios, we observe the particular shape of the eigevalue diagram exhibited i Figure 4, where we ca distiguish eigevalues ear 1 p (theoretically equal to 1 p ), ad so we are able to recogize the existece of the idepedet variables i the aalysis λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** *********************** ********************** ********************* ************** ************* ************ Figure 4: Observed eigevalues diagram i a two-way iteractio case Whe the umber of iteractio grows, we caot distiguish eigevalues theoretically equal to 1 p from the eigevalues o equal to 1 p To detect the existece or iteractios, we ca first check if the observed variables are mutually idepedet I that case, the eigevalues distributio diagram should have a particular shape (see 41), with more tha 95% of the eigevalues withi the cofidece iterval 1 p ± 2 σ (see 411) If there is oe or more eigevalues out of the cofidece iterval, we ca the assume the existece of oe or more two-way iteractio betwee variables 5 AN EMPIRICAL PROCEDURE FOR FITTING LOG-LINEAR MODELS BASED ON THE MCA EIGENVALUE DIAGRAM We propose a empirical procedure for progressively fittig a log-liear model where the fittig test at each step is based o the MCA eigevalues diagram Let X i, X j ad X k, three categorical variables, with respectively m i, m j ad m k modalities, ad let a cross variable with (m i m j ) modalities We suppose that X ij ad X k, have the same behavior if m k = m i m j

Eigevalues i MCA ad Log-Liear Models 61 Uder the hypothesis that two depedat variables X i ad X j have the same behaviour as the variable X k with the same characteristics of the cross variable X ij, we propose here a empirical procedure for fittig progressively, with p steps, the log-liear model where the fittig criterio at each step is based o the MCA eigevalue diagram Distributio of observed eigevalues 51 Descriptio of the procedure steps The first step of the procedure cosist to test the pairwise idepedece hypothesis of the variables To detect existece of iteractios, we must first check if all variables are mutually idepedet For that matter, we calculate the eigevalues of MCA o all the p variables, ad costruct the related cofidece iterval: the eigevalue distributio diagram should have a particular shape (cf 41) If all the eigevalues belog to the cofidece iterval 1 p ± 2 σ (cf 411), we ca coclude that the p variables are mutually idepedet The log-liear model associated to the variables is a simple additive oe: ad the procedure is stopped log[f p (X)] = u 0 (x) + p u i (x), If oe or more eigevalue are ot i the cofidece iterval, we coclude that there is at least oe double iteractio betwee variables, ad we go to the secod step of the procedure I the secod step, we look at all two-way iteractio u-terms We isolate oe variable amogst the p variables that we ote X p, without loss of geerality, ad so we obtai a set of (p 1) variables X i, ad apply the first step to test pairwise idepedece of the (p 1) variables If the (p 1) variables are idepedet, we ca coclude that the doubles iteractios are amogst X p ad at least oe of the X i, so we costruct correspodet cross variables X ip by usig the first step to test idepedece betwee variables (X i, X p ) where i = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip, ad the procedure stopped, (with δ ip = 1 if the iteractio betwee X p ad X i exists, otherwise it is set to zero) If the (p 1) variables are ot idepedet, we ca coclude that there is double iteractio betwee X i ad X j where i, j =1,, p 1, ad perhaps betwee X i ad X p

62 S Be Ammou ad G Saporta We ca costruct correspodet cross variables X ip ad X ij by usig the first step to test idepedece of variables (X i, X p ) ad variables (X i, X j ) where i, j = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip + terms due to the iteractio betwee three or more variables ad we go to the third step of the procedure I the third step, we look at three-way iteractio u-terms, by testig the depedece of variables X i ad cross variables X jk, where i, j, k = 1,, p ad i, j, k are differet, ad costruct cross variables X ijk The idepedece test is based o the eigevalue patter of the related MCA as described i the first step Cotiuig this way, i the k th step, we look at k-way iteractio u-terms, ad i the least step we look at the p-way iteractio u-term This algorithm is summarized i Figure 5 52 A example for a graphical model For this example we use a data set give by Haberma [24] that was used i Falguerolles et al [14] to fit a graphical model The data reports attitudes toward o therapeutic abortios amog white subjects crossed with three categorical variables describig the subjects The data set is a cotigecy table observed for 3181 idividuals, crossig four three modality variables X 1, X 2, X 3 ad X 4, defied i Table 1 The first step of the procedure cosists of testig the pairwise idepedece hypothesis of the variables We first trasform the cotigecy table i a complete disjuctive table, the calculate the parameters (defied i 21 ad 411) eeded for the test (Table 2) MCA o the four variables gives the eigevalues diagram of Figure 6 The shape of eigevalues diagram refers clearly to the existece of depedet variables Eigevalues λ 1, λ 7 ad λ 8 are ot i the iterval I c, so there is at least two depedet variables: there is oe or more two-way iteractios betwee variables

Eigevalues i MCA ad Log-Liear Models 63 Figure 5: Block diagram for the Empirical procedure

64 S Be Ammou ad G Saporta Table 1: Attitudes toward o therapeutic abortios amog white Year Religio: Educatio Attitude: X 4 X 1 X 2 i years: X 3 positive mixed egative 1972 orther Protestat 8 09 16 41 9 12 85 52 105 13 77 30 38 souther Protestat 8 08 08 46 9 12 35 29 54 13 37 15 22 Catholic 8 11 14 38 9 12 47 35 115 13 25 12 42 1973 orther Protestat 8 17 17 42 9 12 102 38 84 13 88 15 31 souther Protestat 8 14 11 34 9 12 61 30 59 13 49 11 19 Catholic 8 06 16 26 9 12 60 29 108 13 31 18 50 1974 orther Protestat 8 23 13 32 9 12 106 50 88 13 79 21 31 souther Protestat 8 05 15 37 9 12 38 39 54 13 52 12 32 Catholic 8 08 10 24 9 12 65 39 89 13 37 18 43 Table 2: Parameters eeded for the test (first step of the example for a graphical model) p m 1 m 2 m 3 m 4 q m σ I c 3181 4 3 3 3 3 8 025 00109 [02283, 02717] λ 1 = 03221 λ 2 = 02704 λ 3 = 02599 λ 4 = 02531 λ 5 = 02451 λ 6 = 02393 λ 7 = 02277 λ 8 = 01823 ************************** ********************* ******************** ******************* ****************** ***************** **************** *********** Figure 6: Eigevalues diagram (first step of the example for a graphical model)

Eigevalues i MCA ad Log-Liear Models 65 The secod step cosists of the detectio of two-way iteractios I a first time, we use our first step with oly three variables X 1, X 2 ad X 3 With the values of ad m i (for i = 1,, 3) still the same, the other parameters become (Table 3 ): Table 3: Parameters for the test (secod step of the example for a graphical model) q m σ I c 6 033333 00118 [03097, 03569] We get the followig eigevalue diagram (Figure 7 ): λ 1 = 03606 λ 2 = 03448 λ 3 = 03385 λ 4 = 03305 λ 5 = 03025 ************************** ************************* ************************ ********************** ********************* Figure 7: Eigevalues diagram (secod step of the example for a graphical model) λ 1 ad λ 5 are ot i iterval I c, so there is oe or more two-way iteractio betwee X 1, X 2 ad X 3, as also as iteractios betwee X 4 ad others I a secod step we look at the iteractios betwee X 4 ad X i (i = 1, 2, 3) For i = 1 to i = 3 we look at the eigevalues of the MCA of X 4 with X i, ad calculate their variaces ad itervals I c Crossig X 1 with X 4 we get (Table 4 ): Table 4: MCA o X 1 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 05389 05156 04644 04611 Crossig X 2 with X 4 we get (Table 5 ): Table 5: MCA o X 2 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 05741 05076 04924 04259

66 S Be Ammou ad G Saporta Crossig X 3 with X 4 we get (Table 6 ): Table 6: MCA o X 3 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 06112 05041 04959 03979 I the three cases, λ 1 ad λ 4 are ot i the itervals I c, so there is a twoway iteractio betwee X 1 ad X 4, X 2 ad X 4 ad betwee X 3 ad X 4, so we ca costruct cross variables X 4i havig 9 modalities (i = 1, 2, 3) Now, we use the first step with oly two variables X 1 ad X 2, after we look for iteractios betwee X 3 ad X i (i = 1, 2) Crossig X 1 with X 2 we get (Table 7 ): Table 7: MCA o X 1 ad X 2 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 05153 05045 04955 04848 All the eigevalues are i the cofidece iterval, so X 1 ad X 2 are idepedet coditioally o the other, ad there is o cross variable X 12 The correspodig u-term u 12 equals to zero Let us ow look, whe i = 1 ad i = 2, at the eigevalues of the MCA of X 3 with X i, with their variaces ad itervals I c : Crossig X 1 with X 3 we get (Table 8 ): Table 8: MCA o X 1 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 05134 05023 04978 04866 All the eigevalues are i the cofidece iterval I c, so X 1 ad X 3 are idepedet coditioally o the other, ad there is o cross variable X 13 : the correspodig u-term u 13 equals to zero Crossig ow X 2 with X 3 we get (Table 9 ): Table 9: MCA o X 2 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ 4 4 05 00125 [04750, 05250] 05401 05128 04872 04599 Here, λ 1 ad λ 4 are ot i the iterval I c, so there is a two-way iteractio betwee X 2 ad X 3, u 23 is ot set to zero, ad we ca add the cross variable X 32 (as well as X 23 ) with 9 modalities to the model

Eigevalues i MCA ad Log-Liear Models 67 The third step cosists of the detectio of triple iteractios betwee variables, that is to two-way iteractios betwee the variables X i ad the cross variables X jk We first put the cross variables (X 41, X 42, X 43, X 32 ) with the iitial variables that were deemed o depedet i the secod step of the procedure, ie X 1 ad X 2, ad the we use the first step of the procedure with the set of obtaied variables So we get the followig results (Table 10 ad Figure 8 ): Table 10: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (parameters third step of the example for a graphical model) q m σ I c 36 01667 00168 [01331, 02003] λ 1 = 05201 λ 2 = 05006 λ 3 = 03447 λ 4 = 03347 λ 5 = 03303 λ 6 = 03193 λ 7 = 01810 λ 8 = 01796 λ 9 = 01732 λ 10 = 01710 λ 11 = 01664 λ 12 = 01627 λ 13 = 01626 λ 14 = 01578 λ 15 = 01538 λ 16 = 01423 ************************** ************************* ****************** ****************** ****************** ***************** ************ *********** *********** *********** *********** *********** *********** *********** ********** ********* Figure 8: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c : there is oe or more two-way iteractio betwee the iitial variables X i, ad the crossed oes X ik, so there exists a triple iteractio betwee simple variables

68 S Be Ammou ad G Saporta We drop X 32 ad use the first step with the five other variables to get the followig results (Table 11 ad Figure 9 ): Table 11: MCA o X 1, X 2, X 41, X 42 ad X 43 (parameters for the test) q m σ I c 28 02 00162 [01671, 02324] λ 1 = 06105 λ 2 = 06006 λ 3 = 04143 λ 4 = 04028 λ 5 = 03982 λ 6 = 03831 λ 7 = 02262 λ 8 = 02220 λ 9 = 02162 λ 10 = 02083 λ 11 = 02054 λ 12 = 02017 λ 13 = 01952 λ 14 = 01986 λ 15 = 01952 λ 16 = 01928 λ 17 = 01878 λ 18 = 01837 λ 19 = 01815 λ 20 = 01711 ************************** ************************** **************** **************** **************** *************** ********** ********** ********** ********* ********* ********* ********* ********* ********* ********* ******** ******** ******** ******** Figure 9: MCA o X 1, X 2, X 41, X 42 ad X 43 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c, so there is at least oe two-way iteractio betwee the variables We kow that simple variables X 1, X 2 ad the crossed variables X 41, X 42, X 43 are depedet so we have to test depedece betwee X 1 ad X 32 oly Crossig X 1 ad X 32 we get the followig results (Table 12): Table 12: MCA o X 1 ad X 32 (parameters ad eigevalues) q m σ I c 10 05 00159 [04682, 05318] λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 05287 05194 05000 05000 05000 05000 05000 05000 04806 04713

Eigevalues i MCA ad Log-Liear Models 69 All the eigevalues are i the cofidece iterval I c, so X 1 ad X 32 are idepedet coditioally o the other, ad there is o cross variable X 132 The correspodig u-term u 123 equals zero Now we ca drop the cross variable X 43 The remaiig variables X 1, X 2, X 41, X 42 are depedet by costructio We have oly to test for depedece betwee X 1 ad X 43 Crossig X 1 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 13 ): Table 13: MCA o X 1 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 05445 05232 05000 05000 05000 05000 05000 05000 04768 04555 We remark that λ 1 ad λ 10 are ot i the iterval I c, so X 1 ad X 43 seem to be depedet But we have to fit a graphical model, that is a particular case of hierarchical models (as defied i 2222, a log-liear models is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero) Here the u-term u 13 is set to zero, so the u-term u 134 is also set to zero Crossig X 2 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 14 ): Table 14: MCA o X 2 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 05871 05466 05000 05000 05000 05000 05000 05000 04534 04143 Eigevalues λ 1, λ 2, λ 9 ad λ 10 are ot i the iterval I c, the u-terms u 23 ad u 24 are ot set to zero, ad sice X 2 ad X 43 are ot depedet the u-term u 234 is ot set to zero Crossig X 1 with X 42 (or equivaletly X 2 with X 41 ) we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues: Table 15: MCA o X 1 ad X 42 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ 10 05434 05289 05000 05000 05000 05000 05000 05000 04711 04566

70 S Be Ammou ad G Saporta Eigevalues λ 1 ad λ 10 are ot i the iterval I c, the u-term u 14 is equal to zero, X 1 ad X 42 are depedet, ad the u-term u 124 is set to zero Fially, variables X 1 ad X 41 are depedet by costructio The procedure stops here because we ca t have more tha triple iteractios i a hierarchical model whe all the two-way iteractios are ot preset We obtai the followig model (see Figure 10 for the associated graph): Figure 10: Lattice diagram (example for a graphical model) log[f 4 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 4 x 4 + u 32 x 2 x 3 + u 41 x 4 x 1 + u 42 x 4 x 2 + u 43 x 4 x 3 + u 432 x 4 x 3 x 2 53 A example for a saturated model Here we use a data set give by Israëls [29] that was also used by Va der Heijde et al [28] about shop-liftig habits Table 16 is a cotigecy table crossig three variables: sex (2 modalities), age (9 modalities) ad type of goods (13 modalities: Clothig (C), Clothig accessories (Ca), Provisio-Tobacco (PT), Writig materials (Wm), Books (B), Records (R), Household goods (Hg), Sweets (S), Toys (T), Jewellery (J), Perfume (P), Hobbies tools(ht), ad Others(O)) observed over 33 101 idividuals I the first step, we test the pairwise idepedece of variables X 1, X 2 ad X 3 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (defied i 22 & 411) eeded for the test to get (Table 17 ) A MCA o the three variables gives the eigevalue diagram of Figure 11 The eigevalue diagram shows clearly that the variables are ot idepedet: oly 8 eigevalues (λ 7,, λ 15 ) are i the cofidece iterval Usig the secod step of the procedure, we get the two-way iteractios

Eigevalues i MCA ad Log-Liear Models 71 Table 16: Multicotigecy table for the shop-liftig data Sex: Age: Goods: X 3 X 1 X 2 C Ca PT Wm B R Hg S T J P Ht O 11 81 66 150 667 67 24 47 430 743 132 32 197 209 12 14 138 204 340 1409 259 272 117 637 684 408 57 547 550 15 17 304 193 229 527 258 368 98 246 116 298 61 402 454 18 20 384 149 151 84 146 141 61 40 13 71 52 138 252 Male 21 29 942 297 313 92 251 167 193 30 16 130 111 280 624 30 39 359 109 136 36 96 67 75 11 16 31 54 200 195 40 49 178 53 121 36 48 29 50 5 6 14 41 152 88 50 64 137 68 171 37 56 27 55 17 3 11 50 211 90 65 45 28 145 17 41 7 29 28 8 10 28 111 34 11 71 19 59 224 19 7 22 137 113 162 70 15 24 12 14 241 98 111 463 60 32 29 240 98 138 178 29 58 15 17 477 114 58 91 50 27 41 80 14 548 141 9 72 18 20 436 108 76 18 32 12 32 12 10 303 70 14 67 Female 21 29 1180 207 132 30 61 21 65 16 12 74 104 30 157 30 39 1009 165 121 27 43 9 74 14 31 100 81 36 107 40 49 517 102 93 23 31 7 51 10 8 48 46 24 66 50 64 488 127 214 27 57 13 79 23 17 22 69 35 64 65 173 64 215 13 44 0 39 42 6 12 41 11 55 Table 17: Parameters eeded for the test (first step of the example for a satured model) p m 1 m 2 m 3 q m σ I c 33101 3 2 9 13 21 03333 00061 [03211, 03455] λ 1 = 05759 λ 2 = 04256 λ 3 = 03966 λ 4 = 03899 λ 5 = 03542 λ 6 = 03494 λ 7 = 03407 λ 8 = 03384 λ 9 = 03344 λ 10 = 03333 λ 11 = 03333 λ 12 = 03333 λ 13 = 03322 λ 14 = 03271 λ 15 = 03260 λ 16 = 03177 λ 17 = 03103 λ 18 = 02802 λ 19 = 02632 λ 20 = 01925 λ 21 = 01423 *************************************************** *********************************** ******************************** ******************************* **************************** **************************** *************************** ************************** ********************** ********************** ********************** ********************** ********************* ********************* ********************* ******************** ******************* ****************** **************** ************ ******* Figure 11: MCA o X 1, X 2 ad X 3 (eigevalues diagram, third step of the example for a saturated model)

72 S Be Ammou ad G Saporta MCA of X 1 ad X 3 gives the followig results (Table 18 ad Figure 12 ): Table 18: MCA o X 1 ad X 3 (parameters) p q m σ I c 33101 2 13 05 000002 [05000, 05000] λ 1 = 07032 λ 2 = 05000 λ 3 = 05000 λ 4 = 05000 λ 5 = 05000 λ 6 = 05000 λ 7 = 05000 λ 8 = 05000 λ 9 = 05000 λ 10 = 05000 λ 11 = 05000 λ 12 = 05000 λ 13 = 02968 **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********************** ********************** ********************** ********************** ********** Figure 12: MCA o X 1 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 13 is ot set to zero We otice here the peculiar form of the eigevalues diagram, due to the fact that multiple eigevalue λ = 1 2 that have a multiplicity 11 = m 3 m 1 is a artificial oe (cf 4211) MCA of X 2 ad X 3 gives the followig results (Table 19 ad Figure 13 ): Table 19: MCA o X 2 ad X 3 (parameters) p q m σ I c 33101 2 20 05 00001 [04998, 05002] The 8 first ad the 8 last eigevalues are ot i the cofidece iterval so the u-term u 23 is ot set to zero

Eigevalues i MCA ad Log-Liear Models 73 λ 1 = 07852 λ 2 = 06074 λ 3 = 05903 λ 4 = 05346 λ 5 = 05245 λ 6 = 05112 λ 7 = 05109 λ 8 = 05019 λ 9 = 05000 λ 10 = 05000 λ 11 = 05000 λ 12 = 05000 λ 13 = 04981 λ 14 = 04891 λ 15 = 04888 λ 16 = 04755 λ 17 = 04654 λ 18 = 04097 λ 19 = 03926 λ 20 = 02148 **************************************** ******************************* ***************************** ************************** ************************* ************************ ************************ *********************** ******************** ******************** ******************** ******************** ******************* ****************** ****************** ***************** **************** ************ *********** ****** Figure 13: MCA o X 2 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) MCA of X 1 ad X 2 gives the followig eigevalue results (Table 20, Figure 14 ): Table 20: MCA o X 1 ad X 2 (parameters) p q m σ I c 33101 2 9 05 00037 [04926, 05074] λ 1 = 06241 λ 2 = 05000 λ 3 = 05000 λ 4 = 05000 λ 5 = 05000 λ 6 = 05000 λ 7 = 05000 λ 8 = 05000 λ 9 = 03759 **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********** Figure 14: MCA o X 1 ad X 2 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 12 is ot set to zero At the ed of the secod step, we obtai all three

74 S Be Ammou ad G Saporta two-way iteractios To kow if the model is a saturated oe we ca built oe of the crossed variables ad test its depedece with the remaiig simple variable MCA of X 32 with X 1 gives the followig eigevalues: λ 1 = 07285, λ 2 = λ 3 = = λ 116 = 05, λ 117 = 02715 ad I c = [04615, 05384] The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 123 is ot set to zero At the ed we get the followig saturated model: log[f 3 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 12 x 1 x 2 + u 23 x 2 x 3 + u 13 x 1 x 3 + u 123 x 1 x 2 x 3 54 A example for a mutual idepedece model Here we use a data set give by Aderse [2] as a cotigecy table crossig four variables observed over 299 idividuals correspodig to a retrospective study of ovary cacer, defied i Table 21: Table 21: Retrospective study of ovary cacer X 1 X 2 X 3 X 4 stage operatio survival X-ray No Yes Early radical o 10 17 limited yes 41 64 o 1 3 yes 13 9 Advaced radical o 38 64 limited yes 6 11 o 3 13 yes 1 5 I the first step of procedure, we test for the pairwise idepedece of variables X 1, X 2, X 3 ad X 4 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (see 411) eeded for the test

Eigevalues i MCA ad Log-Liear Models 75 The MCA o the four variables gives the followig results (Table 22 ad Figure 15): Table 22: Parameters eeded for the test (first step of the example for a mutual idepedece model) p m 1 m 2 m 3 m 4 q m σ I c 299 4 2 2 2 2 4 025 00250 [02000, 03000] λ 1 = 04145 λ 2 = 02512 λ 3 = 02449 λ 4 = 00894 ********************************** ******************** ******************* ********* Figure 15: MCA o X 1, X 2, X 3 ad X 4 (eigevalues diagram, first step of the example for a mutual idepedece model) The eigevalue diagram shows clearly that variables are ot idepedet, oly λ 2 ad λ 3 are i the cofidece iterval Let s drop X 4 ad use the secod step of the procedure MCA o the three remaiig variables gives the followig results (Table 23 ad Figure 16 ): Table 23: MCA o X 1, X 2 ad X 3 (parameters) p q m σ I c 299 3 3 03333 00273 [02787, 03879] λ 1 = 03639 λ 2 = 03342 λ 3 = 03019 ********************** ******************** ******************* Figure 16: MCA o X 1, X 2 ad X 3 (eigevalues diagram) The eigevalue diagram shows clearly that variables are idepedet, sice all the eigevalues are i the cofidece iterval, so there is surely oe or more iteractio X 4 ad X i,,, 3