ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS

Size: px

Start display at page:

Download "ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS"

Marvin Melton
6 years ago
Views:

1 ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS Authors: S Be Ammou Departmet of Quatitative Methods, Faculté de Droit et des Scieces Ecoomiques et Politiques de Sousse, Tuisia, (salouabeammou@fdsepsrut) G Saporta Chaire de Statistique Appliquée & CEDRIC, Coservatoire Natioal des Arts et Métiers, Paris, Frace (saporta@camfr) Received: December 2002 Revised: September 2003 Accepted: October 2003 Abstract: Multiple Correspodece Aalysis (MCA) ad log-liear modelig are two techiques for multi-way cotigecy table aalysis havig differet approaches ad fields of applicatios Log-liear models are iterestig whe applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit the relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary We preset i this paper the distributio of eigevalues i MCA whe the data fit a kow log-liear model, the we costruct this model by successive applicatios of MCA We also propose a empirical procedure, fittig progressively the log-liear model where the fittig criterio is based o eigevalue diagrams The procedure is validated o several sets of data used i the literature Key-Words: Multiple Correspodece Aalysis; eigevalues; log-liear models; graphical models; ormal distributio AMS Subject Classificatio: 49A05, 78B26 We thak Professor M Bourdeau for his careful readig

2 42 S Be Ammou ad G Saporta

3 Eigevalues i MCA ad Log-Liear Models 43 1 INTRODUCTION Multiple Correspodece Aalysis ad log-liear modelig are two very differet, but mutually beeficial approaches to aalyzig multi-way cotigecy tables: log-liear models are profitably applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary After a short remider o MCA ad log-liear approaches, we study the distributio of eigevalues i MCA uder modelig hypotheses, especially i the case of idepedece At the ed we propose a algorithmic approach for fittig log-liear models where the fittig criterio is based o eigevalues diagram 2 A SHORT SURVEY OF MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS We first itroduce MCA ad log-liear modellig, the we preset some works usig both methods 21 Multiple Correspodece Aalysis Correspodece Aalysis (CA) has quite a log history as a method for the aalysis of categorical data The startig poit of this history is usually set i 1935 [28], ad sice the CA has bee reiveted several times We ca distiguish simple CA (CA of cotigecy tables) ad MCA or Multiple Correspodece Aalysis (CA of so-called idicator matrices) MCA traces back to Guttma [23], Burt [8] or Hayashi [25] I Frace, i the 1960s, Bezecri [6] proposes, other developmets of this method Outside Frace, MCA has bee developed by J de Leeuw sice 1973 [22] uder the ame of Homogeeity Aalysis, ad the ame of Dual Scalig by Nishisato [38] Multiple Correspodece Aalysis (MCA) is a multidimesioal descriptive techique of categorical data A theoretical versio of Multiple Correspodece Aalysis of p variables ca be defied as the limit, whe the umber of statistical uits icreases, of the CA of a complete disjuctive table Let X be a complete disjuctive table of p categorical variables X 1, X 2,, X p, with respectively m 1, m 2,, m p modalities observed over a sample of idividuals CA of this complete disjuctive table is equivalet to the aalysis of B [8], where B = X X is the Burt table associated with X The two aalyses have the same factors, but the eigevalues i MCA equal to the squared

4 44 S Be Ammou ad G Saporta root of the eigevalues i the CA of the associated Burt table MCA of X correspods to the diagoalizatio of the matrix 1 p (D 1 X X) = 1 p (D 1 B) where D = Diag(X X) = Diag(B) The structure of the eigevalue diagram depeds o the variable iteractios It is well kow that i the case of pairwise idepedet variables, the q o-trivial eigevalues are theoretically equal to to 1 p, where p (1) q = m i p 22 Log-liear modelig Log-liear modelig is a well-kow method for studyig structural relatioships betwee categorical variables i a multiple cotigecy table whe all the variables have o particular role Relatively recet ad ot as well kow i Frace as MCA, log-liear modelig has may classical refereces After first works of Birch [7] i 1963 ad Goodma [17], we must metio the basic books of Haberma [24], Bishop, Fieberg & Hollad [8], Fieberg [15] More Recetly, Dobso [12], Agresti [1], Christese [10] have writte sytheses o the subject supplemeted with persoal cotributios Whittaker [41] devotes a large part of his book to log-liear models before defiig associated graphical models 221 Log-liear modelig i the biomial case Let X = (X 1, X 2,, X p ) be a k-dimesioal radom vector, with values i {0, 1} k The expressio for the k-dimesioal probability desity of X is: f k (X) = p(0, 0,, 0) (1 x 1)(1 x 2 ) (1 x k) p(1, 0,, 0) x 1(1 x 2 ) (1 x k ) p(0, 1,, 0) (1 x 1)x 2 (1 x k) p(0, 0,, 1) (1 x 1)(1 x 2 ) x k p(1, 1,, 0) x 1x 2 (1 x k) p(1, 1,, 1) x 1x 2 x k We ca write the desity fuctio as a log-liear expasio: log[f k (X)] = u o + k u i x i + k u ij x i x j + i,j=1, i j + + u 123k x 1 x 2 x k k u ijl x i x j x l i,j, l=1, i j l where u o =log[p(0,0,,0)], u i =log[ p(0,0,,0,1,0,0) p(0,0,,0) ] ad the u-terms u ij,, u 123k are a log cross product ratio i the (k, k) probability table The u-term u ij is set to zero whe X i ad X j are idepedet variables

5 Eigevalues i MCA ad Log-Liear Models Log-liear modelig i the multiomial case Let X = (X 1, X 2,, X k ) be a k-dimesioal radom vector, with values i {0, 1,, m 1 1} {0, 1,, m 2 1} {0, 1,, m k 1} istead of i {0, 1} k as i the precedig case The geeralisatio to the k-dimesioal cross-classified multiomial distributio is the log-liear expasio: k k k log[f k (X)] = u o + u i (x) + u ij (x) + u ijl (x) + + u 123k (x) i,j=1, i j i,j, l=1, i j l Each u-term is a coordiate projectio fuctio with the coordiates idicated by its idex; ad each u-term is costraied to be zero wheever oe of its idicated coordiates is zero The importace of log-liear expasios rests with the fact that may iterestig hypotheses ca be geerated by settig some u-terms to zero We are iterested particularly i this paper with graphical ad hierarchical log-liear models 2221 Graphical log-liear models Let G = (K, E) be the idepedece graph of the k-dimesioal radom vector X, with k vertices i K = {1, 2,, k} ad edge set E G is the set of pairs (i, j) such that wheever (i, j) is ot i E the variables X i ad X j are idepedet coditioally o the other variables Give a idepedece graph G, the cross classified multiomial distributio for the radom vector X is a graphical model for X, if the distributio of X is differet from costraits of the form that for all pair of coordiates ot i the edge set E of G, the u-terms costraiig the selected coordiates are idetically zero 2222 Hierarchical log-liear models A graphical model satisfies costraits of the form that all u-terms above a fixed poit have to be zero to get coditioal idepedece A larger class of models, the hierarchical models, is obtaied by allowig more flexibility i settig the u-terms to zero A log-liear model is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero

6 46 S Be Ammou ad G Saporta We ote here that every distributio with a log-liear expasio has a iteractio (or idepedece) graph, ad a hierarchical log-liear model is graphical if ad oly if its maximal u-terms correspod to cliques i the idepedece graph Whe all the u-terms are o-zero, we have the saturated model I the case whe oly the u i are o-zero, the model is called the mutual idepedece model: log[f k (X)] = u o (x) + k u i (x) Whe oly u i ad some of u ij are o-zero, the model is called a coditioal idepedece model: log[f k (X)] = u o (x) + k u i (x) + i,j u ij (x) These coditioal idepedece models refer to simple iteractios betwee some variables 223 Parameters estimatio ad related tests Theoretical frequecies are geerally estimated usig the maximum-likelihood method Weighted regressio, or iterative methods ca be also used as well sice log-liear models are particular cases of the geeralized liear model Usually the classical χ 2 or the G 2 tests (the likelihood ratio) are used to assess log-liear models The values of the two statistics icrease with the umber of variables, ad decrease with the umber of iteractios The closer the statistics are to zero, the better the models Model selectio becomes difficult whe the umber of variables grow: eg with four variables there are 167 differet hierarchical models To avoid the combiatory explosio we ca use criterios based o the Kullback iformatio like the Akaike criterio: AIC = 2 log( L) + 2 k (A Iformatio criterio), or the Schwartz criterio: BIC = 2 log( L) + k log() (Bayesia Iformatio criterio), where L is the maximum of the likelihood fuctio (L), ad k the umber of parameters maximisig L

7 Eigevalues i MCA ad Log-Liear Models Multiple Correspodece Aalysis ad log-liear model as complemetary tools of aalysis I this sectio, we preset some works that show how CA (or MCA) ad log-liear modelig ca be related This leads to a better uderstadig of CA, ad to a combied use of both methods CA is ofte itroduced without ay referece to other methods of statistical treatmet of categorical data with prove usefuless ad flexibility A major differece betwee CA ad most other techiques for categorical data aalysis lies i the use of probability models I log-liear aalysis (LLA), for example, a distributio is assumed uder which the data are collected, the a log-liear model for the data is hypothesized ad estimatios are made uder the assumptio that this probability model is true, ad fially these estimates are compared with the observed frequecies to evaluate the log-liear model I this way it is possible to make ifereces about the populatio o the basis of the sample data I CA, it is claimed that o uderlyig distributio has to be assumed ad o model has to be hypothesized, but a decompositio of the data is obtaied to study the structure i the data A vast literature has bee devoted to the assessmet of CA (or MCA) ad LLA We briefly report here some of that literature Several works compare CA or MCA ad LLA Daudi ad Trecourt [11] demostrate empirically that LLA is better adapted to study global relatioships betwee the variables, i the sese that margial liaisos are elimiated i the computatio of profiles Goodma [17],[18],[19],[20],[21] defies two models belogig to the same family: the saturated row colum correspodece aalysis model as a geeralizatio of MCA, ad the row colum associatio model as a geeralizatio of LLA He demostrates, with illustratios by examples, that usig these models is better tha usig the classical methods Baccii, Mathieu ad Modot [3] use a example to compare the two methods Jmel [30], De Falguerolles, Jmel ad Whittaker [13],[14] use graphical models compared to MCA Other works use CA or MCA ad LLA as a combied approach to cotigecy table aalysis: Va der Heijde ad de Leeuw [26],[27],[28], Novak ad Hoffma [39] ad others, use CA as a tool for the exploratio of the residuals from log-liear models, ad give a example of the procedure Worsley [42] shows that i certai cases CA leads directly to the appropriate log-liear model Lauro ad Decarli [31] used AC as a procedure for the idetificatio of best log-liear models

8 48 S Be Ammou ad G Saporta 3 EIGENVALUES IN CORRESPONDENCE ANALYSIS It is well kow that MCA is a extesio of CA, although we first preset eigevalues i CA, ad some simple rules for the selectio of the umber of eigevalues 31 Asymptotic distributio of eigevalues i Correspodece Aalysis Let N be a cotigecy table with m 1 rows ad m 2 colums, ad let us assume that N is the realizatio of a multiomial distributio M(, p ij ) which is realistic I this framework the observed eigevalues µ i are estimators of the eigevalues λ i of P, where P is the table of ukow joit probabilities Lebart [32] ad O Neill [34],[35],[36] proved the followig result: if µ i = 0 the λ i has the same distributio as the correspodig eigevalues of a (m 1 1)(m 2 1) degrees of freedom from the Wishart matrix: W (m1 1)(m 2 1)(r, l) where r = mi(m 1 1, m 2 1) If µ j = 0 the λ j is asymptotically ormally distributed, but with parameters depedig o the ukow p ij Sice it is difficult to test this hypothesis, some authors have proposed a bootstrap approach, which ufortuately is ot valid: sice the empirical eigevalues, o which the replicatio is based, are geerally ot ull, we caot observe the distributio based o the Wishart matrix 32 Malivaud s test Based upo the recostitutio formula, which is a weighted sigular value decompositio of N: ij = ( i j ) 1 + (a ik b ki ) k λk, where a ik, b ik are the factorial compoets associated to the row ad colum profiles We may use a chi-square test comparig the observed ij from a sample of size to the expected frequecies uder the ull-hypothesis H k of oly k o zeros The µ i weighted least squares estimates of these expectatios are precisely the ñ ij of the recostitutio formula with its first k terms We the compute the

9 Eigevalues i MCA ad Log-Liear Models 49 classical chi-square goodess of fit statistic: Q k = i j (ñ ij ij ) 2 ñ ij If k = 0 (idepedece), Q 0 is compared to a chi-square with (m 1 1) (m 2 1) degrees of freedom Uder H k, Q k is asymptotically distributed like a chi-square with (m 1 k 1) (m 2 k 1) degrees of freedom However Q k suffers from the followig drawback: if ij is small, ñ ij ca be egative ad the test statistic caot be used This is ot the case for the modificatio proposed by E Malivaud [37] who proposed to use ( i j ) istead of ñ ij for the deomiator Furthermore, this leads to a simple relatio with the sum of the discarded eigevalues: Q k = i (ñ ij ij ) 2 j ( i j ) = (λ k+1 + λ k λ r ) Q k is also asymptotically distributed like a chi-square with (p k 1) (q k 1) degrees of freedom 4 BEHAVIOUR OF EIGENVALUES IN MCA UNDER MODELING HYPOTHESES Let X = (X 1 X 2 X p ) be a disjuctive table of p categorical variables X i (with respectively m i modalities) observed o a sample of idividuals, ad q the umber of o trivial eigevalues (as defied i 21) Multiple Correspodece Aalysis is the CA of disjuctive table X The rak of X: rak(x) = mi(q+1; ), so equals q+1 if > q+1 We suppose, without loss of geerality, that is large eough, which is the usual case CA factors are the eigevectors of the matrix 1 p D 1 B (where B ad D are defied i 21) So D 1 B is a diagoal uit matrix Its trace is: Tr(D 1 B) = p m i ad 1 p Tr(D 1 B) = 1 p p m i Sice q µ i = 1 p p m i 1, we ca coclude that (2) 1 q q µ i = 1 p

10 50 S Be Ammou ad G Saporta ad (3) q (µ i ) 2 = 1 p p 2 (m i 1) + 1 ϕ 2 p 2 ij i j where ϕ 2 ij is the observed Pearso s ϕ2 crossig of X i with X j, ad ϕ 2 = 1 ( i ad j are margi effectives) ( ij i j i j i j ) 2 = χ2, Although MCA is a extesio of CA, the results of 3 are ot valid ad oe caot use Malivaud s test: elemets of X beig 0 or 1 ad ot frequecies, Q k ad Q k do ot follow a chi-square distributio However it is possible to get iformatio about the dispersio of the q eigevalues i particular cases [5] 41 Distributio of eigevalues i MCA uder idepedece hypothesis Uder the hypothesis of pairwise idepedece of the variables X i, all ϕ 2 ij = 0 ad equatio (3), becomes Usig (2) we get ad fially: q (µ i ) 2 = 1 p p 2 (m i 1) q (µ i ) 2 = 1 p 2 q, [ 2 q (µ i ) 2 = 1 p 2 = 1 (µ i )] q Sice the mea of the squared µ i equals their squared meas oly if all the terms are equal, we ca coclude that all the eigevalues have the same value, so that: µ i = 1 p, i We coclude that the theoretical MCA (ie for the populatio), uder the hypothesis of pairwise idepedece of the variables X i leads to oe q-multiple o-trivial o-zero eigevalue λ = 1 p Ad the eigevalue diagram has the particular shape show i Figure 1 : i

11 Eigevalues i MCA ad Log-Liear Models 51 λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** **************************** **************************** **************************** **************************** **************************** **************************** Figure 1: Theoretical eigevalues diagram i the idepedece case This result is ot true whe we have a fiite sample, sice samplig fluctuatios make the observed ϕ 2 ij 0 Although the trace of 1 p (D 1 B) ad µ the mea of the observed o-trivial eigevalues, are costats, we observe q differet o-trivial eigevalues µ i 1 p, ad the eigevalue diagram takes the shape show i Figure 2 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************** ************************* ************************ *********************** ********************** Figure 2: Observed eigevalues diagram i the idepedece case 411 Dispersio of eigevalues Let Sµ 2 be the measure of µ i aroud 1 p give by: Sµ 2 = 1 q ( µ i 1 ) 2 = 1 q (µ i ) 2 1 q p q p 2, which implies q (µ i ) 2 = q (S 2µ + 1p ) 2 Usig equatios (1)&(3), we have: q (µ i ) 2 = q p ϕ 2 p 2 ij = q p χ 2 p 2 ij i j i j

12 52 S Be Ammou ad G Saporta Uder the hypothesis of pairwise idepedece of the variables, the χ 2 ij are realizatios of χ2 (m i 1)(m j 1) variables, so their expected values are (m i 1) (m j 1) We ca the easily compute E( q (µ i) 2 ), ad get: ( q ) E (µ i ) 2 = q p (mi p 2 1) (m j 1) Fially: i j E(S 2 µ) = 1 q E ( q (µ i ) 2 ) 1 p 2 ad we obtai: E(Sµ) 2 = p 2 q (mi 1) (m j 1) i j Now, sice E(Sµ)=σ 2 2, we may assume that 1 p ± 2 σ cotais roughly 95% of the eigevalues Moreover, sice the kurtosis of the set of eigevalues is lower tha for a ormal distributio, this proportio is actually probably larger the 95% 412 Estimatio of the Burt table Let X be the disjuctive table associated to p categorical variables X i, with m i modalities respectively, observed o a sample of idividuals, where X i = (X i1, X i2,, X imi ), X is a matrix made (of p-block) of p blocks X i X = (X 1 X 2 X i X p ) Let (X j i1, Xj i2,, Xj ip ) be the observed value of X i o the j th idividual We ca write X11 1 X1 1m 1 X21 1 X1 2m 2 Xp1 1 X1 pm p X11 2 X = X2 1m 1 X21 2 X2 2m 2 Xp1 2 X2 pm p X11 X 1m 1 X21 X 2m 2 Xp1 X pm p The Burt table of X is the X 1 X 1 X 1 X 2 X 1 X p B 11 B 12 B 1p B = X 2 X 1 X 2 X 2 X 2 X p = B 21 B 22 B 2p, X px 1 X px 2 X px p B p1 B p2 B pp

13 Eigevalues i MCA ad Log-Liear Models 53 where B i ad = B ii = X ix i = (X j 1i )2 (X j 1i ) (Xj 2i ) (X j 1i ) (Xj m i i ) j=1 j=1 (X j 2i ) (Xj 1i ) (X j 2i )2 (X j 2i ) (Xj m i i ) j=1 j=1 (X j m i i ) (Xj 1i ) (X j m i i ) (Xj 2i ) j=1 j=1 j=1 j=1 j=1 (X j m i i )2 X j ki = { 0 1 with m i k=1 Xj ki = 1 Sice there is oly oe k i {1,, m i} such as Xji k = 1, all other beig zero, we obtai: (X j ki )2 = X j ki i {1,, }, k {1,, m i } ad k=1 k=1 (X j ki ) (X k i j ) = 0 k, k {1,, m i } k=1 Ad so ca coclude that,, p the diagoal sub-matrices of the Burt table are themselves diagoal matrices: (X j 1i )2 0 j=1 X ix i = B i = (X j ki )2 j=1 0 (X j m i i )2 where Furthermore, we kow that ( m i k=1 j=1 X j ki ki = ) = m i k=1 X j ki = k i j=1 ( ki ) =, is the umber of idividuals that have the k th modality of the i th variable (for 1 i p ad 1 k m i ) j=1

14 54 S Be Ammou ad G Saporta So the diagoal sub-matrices of the Burt table are: B i = B ii = 1 i 0 k i where m i k=1 ki = 1,, p 0 m i i Cosider ow two idepedet variables X α ad X β amogst the p variables havig respectively m α ad m β modalities Let B α be the (m α, m α ) square matrix B α = X αx α, ad B αβ the (m α, m β ) rectagular matrix B αβ = X αx β We have (B α ) ii = Xiα k = X α i ad (B α ) ij = 0 if i j, k=1 ad where (B αβ ) ij = X k iα Xk iβ Uder the hypothesis that X α ad X β are idepedet (B αβ ) ij = (B α) ij (B β ) ij = Xα i Xβ i Sice X α i = α i ad X β i = β i, we ca write [ (B αβ ) ij = k=1 X α ki Xβ kj = Xα i Xβ i = α i β j ] ad, more geerally, we ca coclude that X ix j = B ij = i 1 j 1 i 2 j 1 i m i j 1 i 1 j 2 i 2 j 2 i m i j 2 i 1 j m j i 2 j m j i m i j m j if the p variables are mutually idepedet

15 Eigevalues i MCA ad Log-Liear Models 55 Now cosider a sample of p multiomial radom variables X i Let p k i = p ik be the probability that a idividual be i the k th category of the i th variable, ad p k ij be the probably that the jth idividual be i the k th category of the i th variable The observed Burt table is: B = X X = X 1 X 1 X 1 X 2 X 1 X p X 2 X 1 X 2 X 2 X 2 X p X px 1 X px 2 X px p, with X ix i = N i = (Xij) j=1 0 (X j ki )2 j=1 j=1 (X j m i i )2 = diag{ 1 i,, m i i } But k i = m i m i m i (Xki i )2 =p k i ad p k i =1, so that k i = p k i =,,, p j=1 ad X i X j = k=1 p 1 i 0 p k i 0 p m i i Sice X i ad X j are idepedet variables, X i X j = N ij ad (N ij ) kk = (X i X j) kk = p k i pk j, which implies k=1 k=1 X ix j = N ij = p i 1 pj 1 p i 1 pj 2 i 1 j m j p i 2 pj 1 p i 2 pj 2 p i 2 pj m j p i m i p j 1 p i m i p j 2 p i m i p j m j

16 56 S Be Ammou ad G Saporta ad The maximum-likelihood estimator of p k i is ˆp k i = k i, so 1 i 0 ˆN i = k i = B ii 0 m i i i 1 j 1 i 1 j 2 i 1 j m j i 2 j 1 i 2 j 2 i 2 j m j ˆN ij = = B ij i m i j 1 i m i j 2 i m i j m j We ca coclude that the the maximum-likelihood estimator ˆB of the theoretical Burt table is B the observed oe Usig the ivariace fuctioal propriety we ca affirm that the maximum-likelihood estimators of the eigevalues of D 1 B are the eigevalues of D 1 B, so that each µi is the maximum-likelihood estimator of λ i = λ Maximum-likelihood estimators are asymptotically ormal, ad so, asymptotically, each µ i is ormally distributed But due to the fact that eigevalues are ordered, the eigevalues are ot idetically ad idepedetly distributed However: E(µ 1 ) > 1 p, E(µ q) < 1 p 1 but E(µ 1 ) p ad 1 E(µ q ) p Furthermore the eigevalue variaces are ot the same Ad from simulatios of large samples of observatios ( = 100,, = ), we otice that the covergece of the eigevalue distributio to a ormal oe is slow, especially for the extremes (µ 1 ad µ q ), eve for very large samples [4] 42 Distributio of eigevalues i MCA uder o-idepedece hypotheses 421 Distributio of the theoretical eigevalues Let µ be a eigevalue of D 1 X X Sice µ ca be also obtaied by diagoalizatio of 1 p XD 1 X, µ is a solutio of 1 p XD 1 X z = z, where z is a eigevector associated to µ

17 Eigevalues i MCA ad Log-Liear Models 57 So where P i = ( p ) 1 ( ) X i X 1X p i X i i z = µ z 1 p p P i z = µ z, p X i (X i X i) 1 X i is the orthogoal projector o the space spaed by liear combiatios of the idicators of variables categories X i Let A i the cetered projector associated to P i : A i = P i m i m i where 1 mi m i = Ad so we get (4) 1 p p A i z = µ z 4211 The Case of two-way iteractios Let us assume that amog the p studied variables, there is a two-way iteractio betwee X j ad X k, ad that the (p 2) remidig variables are mutually idepedet Multiplyig equatio (4) by A j we get: 1 ( A j A 1 p }{{} 0 + A j A A j A j }{{}}{{} 0 Aj + + A j A k + + A j A p }{{} 0 ) z = µ A j z, sice all variables are pairwise idepedet except X j, X k, ad the A i are orthogoal projectors Thus: (5) A j A k z = (p µ 1) A j z Similarly, multiplyig (4) by A k, we get: (6) A k A j z = (p µ 1) A k z Now let us multiply (5) by A k to get: A k A j A k z = (p µ 1) A k A j z Usig (6) we obtai A k A j A k z }{{} z 1 = (p µ 1) 2 A k z }{{} z 1

18 58 S Be Ammou ad G Saporta With the otatio λ = (p µ 1) 2, we fially write: (7) A k A j z 1 = λ z 1 Equatio (7) implies that λ is a eigevalue of the product of the cetered projector A k A j associated to the eigevector z 1 I geeral: j, k = 1,, p, if there is a iteractio betwee X j ad X k, the orthogoal projector A j A k admits a o zero eigevalue λ = (p µ 1) 2 If λ 0 µ 1 p, the trace of Burt table beig costat, there is, at least, aother eigevalue ot equal to 1 p Let 0 be the umber of eigevalue o equal to 1 p, so that 0 λ i = 0 p Theoretically, (except for the particular case, where λ = 1, for which we have µ = 2 p ad µ = 0), the umber of o-trivial-eigevalues greater tha 1 p is equal to the umber of o-trivial eigevalues smaller tha 1 p The eigevalue diagram shape is show o Figure 3 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************ ************************ ************************ ****************** ***************** Figure 3: Theoretical eigevalues diagram i two-way iteractio case The umber 0 depeds o the umber of categories of X j ad X k, o the umber of variables ad o the umber of depedet variables Let 1 be the multiplicity of 1 p, we will show that 1 = q 2 mi((m j 1); (m k 1)), whe p > 2, ad whe there is oly oe two-way iteractio betwee the variables This result ca be show as follows: Let us cosider equatio (4), ad suppose, without loss of geerality, that X 1 ad X 2 are depedat So, upo multiplicatio by A 3 : 1 p p A iz = µz becomes 1 p (A 3A 1 + A 3 A 2 + A 3 A A 3 A P ) z = µ A 3 z, ad we get µ = 1 p

19 Eigevalues i MCA ad Log-Liear Models 59 Now multiply equatio (4) by A 2 ad A 1 i tur to get: ) (A 1 A 1 + A 1 A 2 + A 1 A A 1 A P z = p µa 1 z ) (A 2 A 1 + A 2 A 2 + A 2 A A 2 A P z = p µa 2 z { (A1 + A 1 A 2 ) z = p µ A 1 z (A 2 A 1 + A 2 ) z = p µ A 2 z { A1 A 2 b = λ z A 2 A 1 b = λ z where λ = (p µ 1) 2, a = A 1 z ad b = A 2 z We recogize here the CA equatios, so that the CA of Burt tables, whe oly two variables are depedet is equivalet to the CA of the cotigecy tables crossig the two depedet variables It is well kow that the umber of eigevalue i CA equals q 2 mi((m j 1); (m k 1)), ad for all o trivial λ i, there correspods the values µ i ad µ i such that: µ i = 1 + λ i p ad µ i = 1 λ i p Fially, the CA of the Burt table may have 2 mi((m j 1);(m k 1)) eigevalues o trivial ad ot equal to 1 p, associated to the CA of the cotigecy table So the umber of supplemetary eigevalues equals q 2 mi((m j 1); (m k 1)) There is, i additio, oe 1 multiple eigevalue, where 1 is at least q 2 mi((m j 1); (m k 1)) 4212 The case of higher order iteractios Sice the Burt table is costructed with pairwise cross products of variables, its observatio caot give us iformatio about multiway iteractios However the observatio of the bi-dimesioal theoretical Burt sub-tables, for all pairwise variable combiatios, ca provide us with all the two-way iteractios The theoretical Burt table ca reveal the existece of higher order iteractios i the followig case: If B ij B ii 1 mj m j B jj ad B ik B ii 1 mk m k B kk : there may be a triple iteractio betwee X i, X j ad X k I geeral, a Burt table does t give either the order of the iteractios, or supplemetary iformatio o the eigevalue behavior

20 60 S Be Ammou ad G Saporta 422 Distributio of observed eigevalues Exceptioally, with a small umber of iteractios, we observe the particular shape of the eigevalue diagram exhibited i Figure 4, where we ca distiguish eigevalues ear 1 p (theoretically equal to 1 p ), ad so we are able to recogize the existece of the idepedet variables i the aalysis λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** *********************** ********************** ********************* ************** ************* ************ Figure 4: Observed eigevalues diagram i a two-way iteractio case Whe the umber of iteractio grows, we caot distiguish eigevalues theoretically equal to 1 p from the eigevalues o equal to 1 p To detect the existece or iteractios, we ca first check if the observed variables are mutually idepedet I that case, the eigevalues distributio diagram should have a particular shape (see 41), with more tha 95% of the eigevalues withi the cofidece iterval 1 p ± 2 σ (see 411) If there is oe or more eigevalues out of the cofidece iterval, we ca the assume the existece of oe or more two-way iteractio betwee variables 5 AN EMPIRICAL PROCEDURE FOR FITTING LOG-LINEAR MODELS BASED ON THE MCA EIGENVALUE DIAGRAM We propose a empirical procedure for progressively fittig a log-liear model where the fittig test at each step is based o the MCA eigevalues diagram Let X i, X j ad X k, three categorical variables, with respectively m i, m j ad m k modalities, ad let a cross variable with (m i m j ) modalities We suppose that X ij ad X k, have the same behavior if m k = m i m j

21 Eigevalues i MCA ad Log-Liear Models 61 Uder the hypothesis that two depedat variables X i ad X j have the same behaviour as the variable X k with the same characteristics of the cross variable X ij, we propose here a empirical procedure for fittig progressively, with p steps, the log-liear model where the fittig criterio at each step is based o the MCA eigevalue diagram Distributio of observed eigevalues 51 Descriptio of the procedure steps The first step of the procedure cosist to test the pairwise idepedece hypothesis of the variables To detect existece of iteractios, we must first check if all variables are mutually idepedet For that matter, we calculate the eigevalues of MCA o all the p variables, ad costruct the related cofidece iterval: the eigevalue distributio diagram should have a particular shape (cf 41) If all the eigevalues belog to the cofidece iterval 1 p ± 2 σ (cf 411), we ca coclude that the p variables are mutually idepedet The log-liear model associated to the variables is a simple additive oe: ad the procedure is stopped log[f p (X)] = u 0 (x) + p u i (x), If oe or more eigevalue are ot i the cofidece iterval, we coclude that there is at least oe double iteractio betwee variables, ad we go to the secod step of the procedure I the secod step, we look at all two-way iteractio u-terms We isolate oe variable amogst the p variables that we ote X p, without loss of geerality, ad so we obtai a set of (p 1) variables X i, ad apply the first step to test pairwise idepedece of the (p 1) variables If the (p 1) variables are idepedet, we ca coclude that the doubles iteractios are amogst X p ad at least oe of the X i, so we costruct correspodet cross variables X ip by usig the first step to test idepedece betwee variables (X i, X p ) where i = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip, ad the procedure stopped, (with δ ip = 1 if the iteractio betwee X p ad X i exists, otherwise it is set to zero) If the (p 1) variables are ot idepedet, we ca coclude that there is double iteractio betwee X i ad X j where i, j =1,, p 1, ad perhaps betwee X i ad X p

22 62 S Be Ammou ad G Saporta We ca costruct correspodet cross variables X ip ad X ij by usig the first step to test idepedece of variables (X i, X p ) ad variables (X i, X j ) where i, j = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip + terms due to the iteractio betwee three or more variables ad we go to the third step of the procedure I the third step, we look at three-way iteractio u-terms, by testig the depedece of variables X i ad cross variables X jk, where i, j, k = 1,, p ad i, j, k are differet, ad costruct cross variables X ijk The idepedece test is based o the eigevalue patter of the related MCA as described i the first step Cotiuig this way, i the k th step, we look at k-way iteractio u-terms, ad i the least step we look at the p-way iteractio u-term This algorithm is summarized i Figure 5 52 A example for a graphical model For this example we use a data set give by Haberma [24] that was used i Falguerolles et al [14] to fit a graphical model The data reports attitudes toward o therapeutic abortios amog white subjects crossed with three categorical variables describig the subjects The data set is a cotigecy table observed for 3181 idividuals, crossig four three modality variables X 1, X 2, X 3 ad X 4, defied i Table 1 The first step of the procedure cosists of testig the pairwise idepedece hypothesis of the variables We first trasform the cotigecy table i a complete disjuctive table, the calculate the parameters (defied i 21 ad 411) eeded for the test (Table 2) MCA o the four variables gives the eigevalues diagram of Figure 6 The shape of eigevalues diagram refers clearly to the existece of depedet variables Eigevalues λ 1, λ 7 ad λ 8 are ot i the iterval I c, so there is at least two depedet variables: there is oe or more two-way iteractios betwee variables

23 Eigevalues i MCA ad Log-Liear Models 63 Figure 5: Block diagram for the Empirical procedure

24 64 S Be Ammou ad G Saporta Table 1: Attitudes toward o therapeutic abortios amog white Year Religio: Educatio Attitude: X 4 X 1 X 2 i years: X 3 positive mixed egative 1972 orther Protestat souther Protestat Catholic orther Protestat souther Protestat Catholic orther Protestat souther Protestat Catholic Table 2: Parameters eeded for the test (first step of the example for a graphical model) p m 1 m 2 m 3 m 4 q m σ I c [02283, 02717] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = ************************** ********************* ******************** ******************* ****************** ***************** **************** *********** Figure 6: Eigevalues diagram (first step of the example for a graphical model)

25 Eigevalues i MCA ad Log-Liear Models 65 The secod step cosists of the detectio of two-way iteractios I a first time, we use our first step with oly three variables X 1, X 2 ad X 3 With the values of ad m i (for i = 1,, 3) still the same, the other parameters become (Table 3 ): Table 3: Parameters for the test (secod step of the example for a graphical model) q m σ I c [03097, 03569] We get the followig eigevalue diagram (Figure 7 ): λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = ************************** ************************* ************************ ********************** ********************* Figure 7: Eigevalues diagram (secod step of the example for a graphical model) λ 1 ad λ 5 are ot i iterval I c, so there is oe or more two-way iteractio betwee X 1, X 2 ad X 3, as also as iteractios betwee X 4 ad others I a secod step we look at the iteractios betwee X 4 ad X i (i = 1, 2, 3) For i = 1 to i = 3 we look at the eigevalues of the MCA of X 4 with X i, ad calculate their variaces ad itervals I c Crossig X 1 with X 4 we get (Table 4 ): Table 4: MCA o X 1 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] Crossig X 2 with X 4 we get (Table 5 ): Table 5: MCA o X 2 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250]

26 66 S Be Ammou ad G Saporta Crossig X 3 with X 4 we get (Table 6 ): Table 6: MCA o X 3 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] I the three cases, λ 1 ad λ 4 are ot i the itervals I c, so there is a twoway iteractio betwee X 1 ad X 4, X 2 ad X 4 ad betwee X 3 ad X 4, so we ca costruct cross variables X 4i havig 9 modalities (i = 1, 2, 3) Now, we use the first step with oly two variables X 1 ad X 2, after we look for iteractios betwee X 3 ad X i (i = 1, 2) Crossig X 1 with X 2 we get (Table 7 ): Table 7: MCA o X 1 ad X 2 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] All the eigevalues are i the cofidece iterval, so X 1 ad X 2 are idepedet coditioally o the other, ad there is o cross variable X 12 The correspodig u-term u 12 equals to zero Let us ow look, whe i = 1 ad i = 2, at the eigevalues of the MCA of X 3 with X i, with their variaces ad itervals I c : Crossig X 1 with X 3 we get (Table 8 ): Table 8: MCA o X 1 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] All the eigevalues are i the cofidece iterval I c, so X 1 ad X 3 are idepedet coditioally o the other, ad there is o cross variable X 13 : the correspodig u-term u 13 equals to zero Crossig ow X 2 with X 3 we get (Table 9 ): Table 9: MCA o X 2 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] Here, λ 1 ad λ 4 are ot i the iterval I c, so there is a two-way iteractio betwee X 2 ad X 3, u 23 is ot set to zero, ad we ca add the cross variable X 32 (as well as X 23 ) with 9 modalities to the model

27 Eigevalues i MCA ad Log-Liear Models 67 The third step cosists of the detectio of triple iteractios betwee variables, that is to two-way iteractios betwee the variables X i ad the cross variables X jk We first put the cross variables (X 41, X 42, X 43, X 32 ) with the iitial variables that were deemed o depedet i the secod step of the procedure, ie X 1 ad X 2, ad the we use the first step of the procedure with the set of obtaied variables So we get the followig results (Table 10 ad Figure 8 ): Table 10: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (parameters third step of the example for a graphical model) q m σ I c [01331, 02003] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = ************************** ************************* ****************** ****************** ****************** ***************** ************ *********** *********** *********** *********** *********** *********** *********** ********** ********* Figure 8: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c : there is oe or more two-way iteractio betwee the iitial variables X i, ad the crossed oes X ik, so there exists a triple iteractio betwee simple variables

28 68 S Be Ammou ad G Saporta We drop X 32 ad use the first step with the five other variables to get the followig results (Table 11 ad Figure 9 ): Table 11: MCA o X 1, X 2, X 41, X 42 ad X 43 (parameters for the test) q m σ I c [01671, 02324] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = ************************** ************************** **************** **************** **************** *************** ********** ********** ********** ********* ********* ********* ********* ********* ********* ********* ******** ******** ******** ******** Figure 9: MCA o X 1, X 2, X 41, X 42 ad X 43 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c, so there is at least oe two-way iteractio betwee the variables We kow that simple variables X 1, X 2 ad the crossed variables X 41, X 42, X 43 are depedet so we have to test depedece betwee X 1 ad X 32 oly Crossig X 1 ad X 32 we get the followig results (Table 12): Table 12: MCA o X 1 ad X 32 (parameters ad eigevalues) q m σ I c [04682, 05318] λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ

29 Eigevalues i MCA ad Log-Liear Models 69 All the eigevalues are i the cofidece iterval I c, so X 1 ad X 32 are idepedet coditioally o the other, ad there is o cross variable X 132 The correspodig u-term u 123 equals zero Now we ca drop the cross variable X 43 The remaiig variables X 1, X 2, X 41, X 42 are depedet by costructio We have oly to test for depedece betwee X 1 ad X 43 Crossig X 1 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 13 ): Table 13: MCA o X 1 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ We remark that λ 1 ad λ 10 are ot i the iterval I c, so X 1 ad X 43 seem to be depedet But we have to fit a graphical model, that is a particular case of hierarchical models (as defied i 2222, a log-liear models is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero) Here the u-term u 13 is set to zero, so the u-term u 134 is also set to zero Crossig X 2 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 14 ): Table 14: MCA o X 2 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ Eigevalues λ 1, λ 2, λ 9 ad λ 10 are ot i the iterval I c, the u-terms u 23 ad u 24 are ot set to zero, ad sice X 2 ad X 43 are ot depedet the u-term u 234 is ot set to zero Crossig X 1 with X 42 (or equivaletly X 2 with X 41 ) we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues: Table 15: MCA o X 1 ad X 42 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ

30 70 S Be Ammou ad G Saporta Eigevalues λ 1 ad λ 10 are ot i the iterval I c, the u-term u 14 is equal to zero, X 1 ad X 42 are depedet, ad the u-term u 124 is set to zero Fially, variables X 1 ad X 41 are depedet by costructio The procedure stops here because we ca t have more tha triple iteractios i a hierarchical model whe all the two-way iteractios are ot preset We obtai the followig model (see Figure 10 for the associated graph): Figure 10: Lattice diagram (example for a graphical model) log[f 4 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 4 x 4 + u 32 x 2 x 3 + u 41 x 4 x 1 + u 42 x 4 x 2 + u 43 x 4 x 3 + u 432 x 4 x 3 x 2 53 A example for a saturated model Here we use a data set give by Israëls [29] that was also used by Va der Heijde et al [28] about shop-liftig habits Table 16 is a cotigecy table crossig three variables: sex (2 modalities), age (9 modalities) ad type of goods (13 modalities: Clothig (C), Clothig accessories (Ca), Provisio-Tobacco (PT), Writig materials (Wm), Books (B), Records (R), Household goods (Hg), Sweets (S), Toys (T), Jewellery (J), Perfume (P), Hobbies tools(ht), ad Others(O)) observed over idividuals I the first step, we test the pairwise idepedece of variables X 1, X 2 ad X 3 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (defied i 22 & 411) eeded for the test to get (Table 17 ) A MCA o the three variables gives the eigevalue diagram of Figure 11 The eigevalue diagram shows clearly that the variables are ot idepedet: oly 8 eigevalues (λ 7,, λ 15 ) are i the cofidece iterval Usig the secod step of the procedure, we get the two-way iteractios

31 Eigevalues i MCA ad Log-Liear Models 71 Table 16: Multicotigecy table for the shop-liftig data Sex: Age: Goods: X 3 X 1 X 2 C Ca PT Wm B R Hg S T J P Ht O Male Female Table 17: Parameters eeded for the test (first step of the example for a satured model) p m 1 m 2 m 3 q m σ I c [03211, 03455] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = λ 21 = *************************************************** *********************************** ******************************** ******************************* **************************** **************************** *************************** ************************** ********************** ********************** ********************** ********************** ********************* ********************* ********************* ******************** ******************* ****************** **************** ************ ******* Figure 11: MCA o X 1, X 2 ad X 3 (eigevalues diagram, third step of the example for a saturated model)

32 72 S Be Ammou ad G Saporta MCA of X 1 ad X 3 gives the followig results (Table 18 ad Figure 12 ): Table 18: MCA o X 1 ad X 3 (parameters) p q m σ I c [05000, 05000] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********************** ********************** ********************** ********************** ********** Figure 12: MCA o X 1 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 13 is ot set to zero We otice here the peculiar form of the eigevalues diagram, due to the fact that multiple eigevalue λ = 1 2 that have a multiplicity 11 = m 3 m 1 is a artificial oe (cf 4211) MCA of X 2 ad X 3 gives the followig results (Table 19 ad Figure 13 ): Table 19: MCA o X 2 ad X 3 (parameters) p q m σ I c [04998, 05002] The 8 first ad the 8 last eigevalues are ot i the cofidece iterval so the u-term u 23 is ot set to zero

33 Eigevalues i MCA ad Log-Liear Models 73 λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = **************************************** ******************************* ***************************** ************************** ************************* ************************ ************************ *********************** ******************** ******************** ******************** ******************** ******************* ****************** ****************** ***************** **************** ************ *********** ****** Figure 13: MCA o X 2 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) MCA of X 1 ad X 2 gives the followig eigevalue results (Table 20, Figure 14 ): Table 20: MCA o X 1 ad X 2 (parameters) p q m σ I c [04926, 05074] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********** Figure 14: MCA o X 1 ad X 2 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 12 is ot set to zero At the ed of the secod step, we obtai all three

34 74 S Be Ammou ad G Saporta two-way iteractios To kow if the model is a saturated oe we ca built oe of the crossed variables ad test its depedece with the remaiig simple variable MCA of X 32 with X 1 gives the followig eigevalues: λ 1 = 07285, λ 2 = λ 3 = = λ 116 = 05, λ 117 = ad I c = [04615, 05384] The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 123 is ot set to zero At the ed we get the followig saturated model: log[f 3 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 12 x 1 x 2 + u 23 x 2 x 3 + u 13 x 1 x 3 + u 123 x 1 x 2 x 3 54 A example for a mutual idepedece model Here we use a data set give by Aderse [2] as a cotigecy table crossig four variables observed over 299 idividuals correspodig to a retrospective study of ovary cacer, defied i Table 21: Table 21: Retrospective study of ovary cacer X 1 X 2 X 3 X 4 stage operatio survival X-ray No Yes Early radical o limited yes o 1 3 yes 13 9 Advaced radical o limited yes 6 11 o 3 13 yes 1 5 I the first step of procedure, we test for the pairwise idepedece of variables X 1, X 2, X 3 ad X 4 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (see 411) eeded for the test

35 Eigevalues i MCA ad Log-Liear Models 75 The MCA o the four variables gives the followig results (Table 22 ad Figure 15): Table 22: Parameters eeded for the test (first step of the example for a mutual idepedece model) p m 1 m 2 m 3 m 4 q m σ I c [02000, 03000] λ 1 = λ 2 = λ 3 = λ 4 = ********************************** ******************** ******************* ********* Figure 15: MCA o X 1, X 2, X 3 ad X 4 (eigevalues diagram, first step of the example for a mutual idepedece model) The eigevalue diagram shows clearly that variables are ot idepedet, oly λ 2 ad λ 3 are i the cofidece iterval Let s drop X 4 ad use the secod step of the procedure MCA o the three remaiig variables gives the followig results (Table 23 ad Figure 16 ): Table 23: MCA o X 1, X 2 ad X 3 (parameters) p q m σ I c [02787, 03879] λ 1 = λ 2 = λ 3 = ********************** ******************** ******************* Figure 16: MCA o X 1, X 2 ad X 3 (eigevalues diagram) The eigevalue diagram shows clearly that variables are idepedet, sice all the eigevalues are i the cofidece iterval, so there is surely oe or more iteractio X 4 ad X i,,, 3

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet