ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS

Size: px
Start display at page:

Download "ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS"

Transcription

1 ON THE CONNECTION BETWEEN THE DISTRIBUTION OF EIGENVALUES IN MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS Authors: S Be Ammou Departmet of Quatitative Methods, Faculté de Droit et des Scieces Ecoomiques et Politiques de Sousse, Tuisia, (salouabeammou@fdsepsrut) G Saporta Chaire de Statistique Appliquée & CEDRIC, Coservatoire Natioal des Arts et Métiers, Paris, Frace (saporta@camfr) Received: December 2002 Revised: September 2003 Accepted: October 2003 Abstract: Multiple Correspodece Aalysis (MCA) ad log-liear modelig are two techiques for multi-way cotigecy table aalysis havig differet approaches ad fields of applicatios Log-liear models are iterestig whe applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit the relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary We preset i this paper the distributio of eigevalues i MCA whe the data fit a kow log-liear model, the we costruct this model by successive applicatios of MCA We also propose a empirical procedure, fittig progressively the log-liear model where the fittig criterio is based o eigevalue diagrams The procedure is validated o several sets of data used i the literature Key-Words: Multiple Correspodece Aalysis; eigevalues; log-liear models; graphical models; ormal distributio AMS Subject Classificatio: 49A05, 78B26 We thak Professor M Bourdeau for his careful readig

2 42 S Be Ammou ad G Saporta

3 Eigevalues i MCA ad Log-Liear Models 43 1 INTRODUCTION Multiple Correspodece Aalysis ad log-liear modelig are two very differet, but mutually beeficial approaches to aalyzig multi-way cotigecy tables: log-liear models are profitably applied to a small umber of variables Multiple Correspodece Aalysis is useful i large tables This efficiecy is balaced by the fact that MCA is ot able to explicit relatios betwee more tha two variables, as ca be doe through log-liear modelig The two approaches are complemetary After a short remider o MCA ad log-liear approaches, we study the distributio of eigevalues i MCA uder modelig hypotheses, especially i the case of idepedece At the ed we propose a algorithmic approach for fittig log-liear models where the fittig criterio is based o eigevalues diagram 2 A SHORT SURVEY OF MULTIPLE CORRESPONDENCE ANALYSIS AND LOG-LINEAR MODELS We first itroduce MCA ad log-liear modellig, the we preset some works usig both methods 21 Multiple Correspodece Aalysis Correspodece Aalysis (CA) has quite a log history as a method for the aalysis of categorical data The startig poit of this history is usually set i 1935 [28], ad sice the CA has bee reiveted several times We ca distiguish simple CA (CA of cotigecy tables) ad MCA or Multiple Correspodece Aalysis (CA of so-called idicator matrices) MCA traces back to Guttma [23], Burt [8] or Hayashi [25] I Frace, i the 1960s, Bezecri [6] proposes, other developmets of this method Outside Frace, MCA has bee developed by J de Leeuw sice 1973 [22] uder the ame of Homogeeity Aalysis, ad the ame of Dual Scalig by Nishisato [38] Multiple Correspodece Aalysis (MCA) is a multidimesioal descriptive techique of categorical data A theoretical versio of Multiple Correspodece Aalysis of p variables ca be defied as the limit, whe the umber of statistical uits icreases, of the CA of a complete disjuctive table Let X be a complete disjuctive table of p categorical variables X 1, X 2,, X p, with respectively m 1, m 2,, m p modalities observed over a sample of idividuals CA of this complete disjuctive table is equivalet to the aalysis of B [8], where B = X X is the Burt table associated with X The two aalyses have the same factors, but the eigevalues i MCA equal to the squared

4 44 S Be Ammou ad G Saporta root of the eigevalues i the CA of the associated Burt table MCA of X correspods to the diagoalizatio of the matrix 1 p (D 1 X X) = 1 p (D 1 B) where D = Diag(X X) = Diag(B) The structure of the eigevalue diagram depeds o the variable iteractios It is well kow that i the case of pairwise idepedet variables, the q o-trivial eigevalues are theoretically equal to to 1 p, where p (1) q = m i p 22 Log-liear modelig Log-liear modelig is a well-kow method for studyig structural relatioships betwee categorical variables i a multiple cotigecy table whe all the variables have o particular role Relatively recet ad ot as well kow i Frace as MCA, log-liear modelig has may classical refereces After first works of Birch [7] i 1963 ad Goodma [17], we must metio the basic books of Haberma [24], Bishop, Fieberg & Hollad [8], Fieberg [15] More Recetly, Dobso [12], Agresti [1], Christese [10] have writte sytheses o the subject supplemeted with persoal cotributios Whittaker [41] devotes a large part of his book to log-liear models before defiig associated graphical models 221 Log-liear modelig i the biomial case Let X = (X 1, X 2,, X p ) be a k-dimesioal radom vector, with values i {0, 1} k The expressio for the k-dimesioal probability desity of X is: f k (X) = p(0, 0,, 0) (1 x 1)(1 x 2 ) (1 x k) p(1, 0,, 0) x 1(1 x 2 ) (1 x k ) p(0, 1,, 0) (1 x 1)x 2 (1 x k) p(0, 0,, 1) (1 x 1)(1 x 2 ) x k p(1, 1,, 0) x 1x 2 (1 x k) p(1, 1,, 1) x 1x 2 x k We ca write the desity fuctio as a log-liear expasio: log[f k (X)] = u o + k u i x i + k u ij x i x j + i,j=1, i j + + u 123k x 1 x 2 x k k u ijl x i x j x l i,j, l=1, i j l where u o =log[p(0,0,,0)], u i =log[ p(0,0,,0,1,0,0) p(0,0,,0) ] ad the u-terms u ij,, u 123k are a log cross product ratio i the (k, k) probability table The u-term u ij is set to zero whe X i ad X j are idepedet variables

5 Eigevalues i MCA ad Log-Liear Models Log-liear modelig i the multiomial case Let X = (X 1, X 2,, X k ) be a k-dimesioal radom vector, with values i {0, 1,, m 1 1} {0, 1,, m 2 1} {0, 1,, m k 1} istead of i {0, 1} k as i the precedig case The geeralisatio to the k-dimesioal cross-classified multiomial distributio is the log-liear expasio: k k k log[f k (X)] = u o + u i (x) + u ij (x) + u ijl (x) + + u 123k (x) i,j=1, i j i,j, l=1, i j l Each u-term is a coordiate projectio fuctio with the coordiates idicated by its idex; ad each u-term is costraied to be zero wheever oe of its idicated coordiates is zero The importace of log-liear expasios rests with the fact that may iterestig hypotheses ca be geerated by settig some u-terms to zero We are iterested particularly i this paper with graphical ad hierarchical log-liear models 2221 Graphical log-liear models Let G = (K, E) be the idepedece graph of the k-dimesioal radom vector X, with k vertices i K = {1, 2,, k} ad edge set E G is the set of pairs (i, j) such that wheever (i, j) is ot i E the variables X i ad X j are idepedet coditioally o the other variables Give a idepedece graph G, the cross classified multiomial distributio for the radom vector X is a graphical model for X, if the distributio of X is differet from costraits of the form that for all pair of coordiates ot i the edge set E of G, the u-terms costraiig the selected coordiates are idetically zero 2222 Hierarchical log-liear models A graphical model satisfies costraits of the form that all u-terms above a fixed poit have to be zero to get coditioal idepedece A larger class of models, the hierarchical models, is obtaied by allowig more flexibility i settig the u-terms to zero A log-liear model is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero

6 46 S Be Ammou ad G Saporta We ote here that every distributio with a log-liear expasio has a iteractio (or idepedece) graph, ad a hierarchical log-liear model is graphical if ad oly if its maximal u-terms correspod to cliques i the idepedece graph Whe all the u-terms are o-zero, we have the saturated model I the case whe oly the u i are o-zero, the model is called the mutual idepedece model: log[f k (X)] = u o (x) + k u i (x) Whe oly u i ad some of u ij are o-zero, the model is called a coditioal idepedece model: log[f k (X)] = u o (x) + k u i (x) + i,j u ij (x) These coditioal idepedece models refer to simple iteractios betwee some variables 223 Parameters estimatio ad related tests Theoretical frequecies are geerally estimated usig the maximum-likelihood method Weighted regressio, or iterative methods ca be also used as well sice log-liear models are particular cases of the geeralized liear model Usually the classical χ 2 or the G 2 tests (the likelihood ratio) are used to assess log-liear models The values of the two statistics icrease with the umber of variables, ad decrease with the umber of iteractios The closer the statistics are to zero, the better the models Model selectio becomes difficult whe the umber of variables grow: eg with four variables there are 167 differet hierarchical models To avoid the combiatory explosio we ca use criterios based o the Kullback iformatio like the Akaike criterio: AIC = 2 log( L) + 2 k (A Iformatio criterio), or the Schwartz criterio: BIC = 2 log( L) + k log() (Bayesia Iformatio criterio), where L is the maximum of the likelihood fuctio (L), ad k the umber of parameters maximisig L

7 Eigevalues i MCA ad Log-Liear Models Multiple Correspodece Aalysis ad log-liear model as complemetary tools of aalysis I this sectio, we preset some works that show how CA (or MCA) ad log-liear modelig ca be related This leads to a better uderstadig of CA, ad to a combied use of both methods CA is ofte itroduced without ay referece to other methods of statistical treatmet of categorical data with prove usefuless ad flexibility A major differece betwee CA ad most other techiques for categorical data aalysis lies i the use of probability models I log-liear aalysis (LLA), for example, a distributio is assumed uder which the data are collected, the a log-liear model for the data is hypothesized ad estimatios are made uder the assumptio that this probability model is true, ad fially these estimates are compared with the observed frequecies to evaluate the log-liear model I this way it is possible to make ifereces about the populatio o the basis of the sample data I CA, it is claimed that o uderlyig distributio has to be assumed ad o model has to be hypothesized, but a decompositio of the data is obtaied to study the structure i the data A vast literature has bee devoted to the assessmet of CA (or MCA) ad LLA We briefly report here some of that literature Several works compare CA or MCA ad LLA Daudi ad Trecourt [11] demostrate empirically that LLA is better adapted to study global relatioships betwee the variables, i the sese that margial liaisos are elimiated i the computatio of profiles Goodma [17],[18],[19],[20],[21] defies two models belogig to the same family: the saturated row colum correspodece aalysis model as a geeralizatio of MCA, ad the row colum associatio model as a geeralizatio of LLA He demostrates, with illustratios by examples, that usig these models is better tha usig the classical methods Baccii, Mathieu ad Modot [3] use a example to compare the two methods Jmel [30], De Falguerolles, Jmel ad Whittaker [13],[14] use graphical models compared to MCA Other works use CA or MCA ad LLA as a combied approach to cotigecy table aalysis: Va der Heijde ad de Leeuw [26],[27],[28], Novak ad Hoffma [39] ad others, use CA as a tool for the exploratio of the residuals from log-liear models, ad give a example of the procedure Worsley [42] shows that i certai cases CA leads directly to the appropriate log-liear model Lauro ad Decarli [31] used AC as a procedure for the idetificatio of best log-liear models

8 48 S Be Ammou ad G Saporta 3 EIGENVALUES IN CORRESPONDENCE ANALYSIS It is well kow that MCA is a extesio of CA, although we first preset eigevalues i CA, ad some simple rules for the selectio of the umber of eigevalues 31 Asymptotic distributio of eigevalues i Correspodece Aalysis Let N be a cotigecy table with m 1 rows ad m 2 colums, ad let us assume that N is the realizatio of a multiomial distributio M(, p ij ) which is realistic I this framework the observed eigevalues µ i are estimators of the eigevalues λ i of P, where P is the table of ukow joit probabilities Lebart [32] ad O Neill [34],[35],[36] proved the followig result: if µ i = 0 the λ i has the same distributio as the correspodig eigevalues of a (m 1 1)(m 2 1) degrees of freedom from the Wishart matrix: W (m1 1)(m 2 1)(r, l) where r = mi(m 1 1, m 2 1) If µ j = 0 the λ j is asymptotically ormally distributed, but with parameters depedig o the ukow p ij Sice it is difficult to test this hypothesis, some authors have proposed a bootstrap approach, which ufortuately is ot valid: sice the empirical eigevalues, o which the replicatio is based, are geerally ot ull, we caot observe the distributio based o the Wishart matrix 32 Malivaud s test Based upo the recostitutio formula, which is a weighted sigular value decompositio of N: ij = ( i j ) 1 + (a ik b ki ) k λk, where a ik, b ik are the factorial compoets associated to the row ad colum profiles We may use a chi-square test comparig the observed ij from a sample of size to the expected frequecies uder the ull-hypothesis H k of oly k o zeros The µ i weighted least squares estimates of these expectatios are precisely the ñ ij of the recostitutio formula with its first k terms We the compute the

9 Eigevalues i MCA ad Log-Liear Models 49 classical chi-square goodess of fit statistic: Q k = i j (ñ ij ij ) 2 ñ ij If k = 0 (idepedece), Q 0 is compared to a chi-square with (m 1 1) (m 2 1) degrees of freedom Uder H k, Q k is asymptotically distributed like a chi-square with (m 1 k 1) (m 2 k 1) degrees of freedom However Q k suffers from the followig drawback: if ij is small, ñ ij ca be egative ad the test statistic caot be used This is ot the case for the modificatio proposed by E Malivaud [37] who proposed to use ( i j ) istead of ñ ij for the deomiator Furthermore, this leads to a simple relatio with the sum of the discarded eigevalues: Q k = i (ñ ij ij ) 2 j ( i j ) = (λ k+1 + λ k λ r ) Q k is also asymptotically distributed like a chi-square with (p k 1) (q k 1) degrees of freedom 4 BEHAVIOUR OF EIGENVALUES IN MCA UNDER MODELING HYPOTHESES Let X = (X 1 X 2 X p ) be a disjuctive table of p categorical variables X i (with respectively m i modalities) observed o a sample of idividuals, ad q the umber of o trivial eigevalues (as defied i 21) Multiple Correspodece Aalysis is the CA of disjuctive table X The rak of X: rak(x) = mi(q+1; ), so equals q+1 if > q+1 We suppose, without loss of geerality, that is large eough, which is the usual case CA factors are the eigevectors of the matrix 1 p D 1 B (where B ad D are defied i 21) So D 1 B is a diagoal uit matrix Its trace is: Tr(D 1 B) = p m i ad 1 p Tr(D 1 B) = 1 p p m i Sice q µ i = 1 p p m i 1, we ca coclude that (2) 1 q q µ i = 1 p

10 50 S Be Ammou ad G Saporta ad (3) q (µ i ) 2 = 1 p p 2 (m i 1) + 1 ϕ 2 p 2 ij i j where ϕ 2 ij is the observed Pearso s ϕ2 crossig of X i with X j, ad ϕ 2 = 1 ( i ad j are margi effectives) ( ij i j i j i j ) 2 = χ2, Although MCA is a extesio of CA, the results of 3 are ot valid ad oe caot use Malivaud s test: elemets of X beig 0 or 1 ad ot frequecies, Q k ad Q k do ot follow a chi-square distributio However it is possible to get iformatio about the dispersio of the q eigevalues i particular cases [5] 41 Distributio of eigevalues i MCA uder idepedece hypothesis Uder the hypothesis of pairwise idepedece of the variables X i, all ϕ 2 ij = 0 ad equatio (3), becomes Usig (2) we get ad fially: q (µ i ) 2 = 1 p p 2 (m i 1) q (µ i ) 2 = 1 p 2 q, [ 2 q (µ i ) 2 = 1 p 2 = 1 (µ i )] q Sice the mea of the squared µ i equals their squared meas oly if all the terms are equal, we ca coclude that all the eigevalues have the same value, so that: µ i = 1 p, i We coclude that the theoretical MCA (ie for the populatio), uder the hypothesis of pairwise idepedece of the variables X i leads to oe q-multiple o-trivial o-zero eigevalue λ = 1 p Ad the eigevalue diagram has the particular shape show i Figure 1 : i

11 Eigevalues i MCA ad Log-Liear Models 51 λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** **************************** **************************** **************************** **************************** **************************** **************************** Figure 1: Theoretical eigevalues diagram i the idepedece case This result is ot true whe we have a fiite sample, sice samplig fluctuatios make the observed ϕ 2 ij 0 Although the trace of 1 p (D 1 B) ad µ the mea of the observed o-trivial eigevalues, are costats, we observe q differet o-trivial eigevalues µ i 1 p, ad the eigevalue diagram takes the shape show i Figure 2 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************** ************************* ************************ *********************** ********************** Figure 2: Observed eigevalues diagram i the idepedece case 411 Dispersio of eigevalues Let Sµ 2 be the measure of µ i aroud 1 p give by: Sµ 2 = 1 q ( µ i 1 ) 2 = 1 q (µ i ) 2 1 q p q p 2, which implies q (µ i ) 2 = q (S 2µ + 1p ) 2 Usig equatios (1)&(3), we have: q (µ i ) 2 = q p ϕ 2 p 2 ij = q p χ 2 p 2 ij i j i j

12 52 S Be Ammou ad G Saporta Uder the hypothesis of pairwise idepedece of the variables, the χ 2 ij are realizatios of χ2 (m i 1)(m j 1) variables, so their expected values are (m i 1) (m j 1) We ca the easily compute E( q (µ i) 2 ), ad get: ( q ) E (µ i ) 2 = q p (mi p 2 1) (m j 1) Fially: i j E(S 2 µ) = 1 q E ( q (µ i ) 2 ) 1 p 2 ad we obtai: E(Sµ) 2 = p 2 q (mi 1) (m j 1) i j Now, sice E(Sµ)=σ 2 2, we may assume that 1 p ± 2 σ cotais roughly 95% of the eigevalues Moreover, sice the kurtosis of the set of eigevalues is lower tha for a ormal distributio, this proportio is actually probably larger the 95% 412 Estimatio of the Burt table Let X be the disjuctive table associated to p categorical variables X i, with m i modalities respectively, observed o a sample of idividuals, where X i = (X i1, X i2,, X imi ), X is a matrix made (of p-block) of p blocks X i X = (X 1 X 2 X i X p ) Let (X j i1, Xj i2,, Xj ip ) be the observed value of X i o the j th idividual We ca write X11 1 X1 1m 1 X21 1 X1 2m 2 Xp1 1 X1 pm p X11 2 X = X2 1m 1 X21 2 X2 2m 2 Xp1 2 X2 pm p X11 X 1m 1 X21 X 2m 2 Xp1 X pm p The Burt table of X is the X 1 X 1 X 1 X 2 X 1 X p B 11 B 12 B 1p B = X 2 X 1 X 2 X 2 X 2 X p = B 21 B 22 B 2p, X px 1 X px 2 X px p B p1 B p2 B pp

13 Eigevalues i MCA ad Log-Liear Models 53 where B i ad = B ii = X ix i = (X j 1i )2 (X j 1i ) (Xj 2i ) (X j 1i ) (Xj m i i ) j=1 j=1 (X j 2i ) (Xj 1i ) (X j 2i )2 (X j 2i ) (Xj m i i ) j=1 j=1 (X j m i i ) (Xj 1i ) (X j m i i ) (Xj 2i ) j=1 j=1 j=1 j=1 j=1 (X j m i i )2 X j ki = { 0 1 with m i k=1 Xj ki = 1 Sice there is oly oe k i {1,, m i} such as Xji k = 1, all other beig zero, we obtai: (X j ki )2 = X j ki i {1,, }, k {1,, m i } ad k=1 k=1 (X j ki ) (X k i j ) = 0 k, k {1,, m i } k=1 Ad so ca coclude that,, p the diagoal sub-matrices of the Burt table are themselves diagoal matrices: (X j 1i )2 0 j=1 X ix i = B i = (X j ki )2 j=1 0 (X j m i i )2 where Furthermore, we kow that ( m i k=1 j=1 X j ki ki = ) = m i k=1 X j ki = k i j=1 ( ki ) =, is the umber of idividuals that have the k th modality of the i th variable (for 1 i p ad 1 k m i ) j=1

14 54 S Be Ammou ad G Saporta So the diagoal sub-matrices of the Burt table are: B i = B ii = 1 i 0 k i where m i k=1 ki = 1,, p 0 m i i Cosider ow two idepedet variables X α ad X β amogst the p variables havig respectively m α ad m β modalities Let B α be the (m α, m α ) square matrix B α = X αx α, ad B αβ the (m α, m β ) rectagular matrix B αβ = X αx β We have (B α ) ii = Xiα k = X α i ad (B α ) ij = 0 if i j, k=1 ad where (B αβ ) ij = X k iα Xk iβ Uder the hypothesis that X α ad X β are idepedet (B αβ ) ij = (B α) ij (B β ) ij = Xα i Xβ i Sice X α i = α i ad X β i = β i, we ca write [ (B αβ ) ij = k=1 X α ki Xβ kj = Xα i Xβ i = α i β j ] ad, more geerally, we ca coclude that X ix j = B ij = i 1 j 1 i 2 j 1 i m i j 1 i 1 j 2 i 2 j 2 i m i j 2 i 1 j m j i 2 j m j i m i j m j if the p variables are mutually idepedet

15 Eigevalues i MCA ad Log-Liear Models 55 Now cosider a sample of p multiomial radom variables X i Let p k i = p ik be the probability that a idividual be i the k th category of the i th variable, ad p k ij be the probably that the jth idividual be i the k th category of the i th variable The observed Burt table is: B = X X = X 1 X 1 X 1 X 2 X 1 X p X 2 X 1 X 2 X 2 X 2 X p X px 1 X px 2 X px p, with X ix i = N i = (Xij) j=1 0 (X j ki )2 j=1 j=1 (X j m i i )2 = diag{ 1 i,, m i i } But k i = m i m i m i (Xki i )2 =p k i ad p k i =1, so that k i = p k i =,,, p j=1 ad X i X j = k=1 p 1 i 0 p k i 0 p m i i Sice X i ad X j are idepedet variables, X i X j = N ij ad (N ij ) kk = (X i X j) kk = p k i pk j, which implies k=1 k=1 X ix j = N ij = p i 1 pj 1 p i 1 pj 2 i 1 j m j p i 2 pj 1 p i 2 pj 2 p i 2 pj m j p i m i p j 1 p i m i p j 2 p i m i p j m j

16 56 S Be Ammou ad G Saporta ad The maximum-likelihood estimator of p k i is ˆp k i = k i, so 1 i 0 ˆN i = k i = B ii 0 m i i i 1 j 1 i 1 j 2 i 1 j m j i 2 j 1 i 2 j 2 i 2 j m j ˆN ij = = B ij i m i j 1 i m i j 2 i m i j m j We ca coclude that the the maximum-likelihood estimator ˆB of the theoretical Burt table is B the observed oe Usig the ivariace fuctioal propriety we ca affirm that the maximum-likelihood estimators of the eigevalues of D 1 B are the eigevalues of D 1 B, so that each µi is the maximum-likelihood estimator of λ i = λ Maximum-likelihood estimators are asymptotically ormal, ad so, asymptotically, each µ i is ormally distributed But due to the fact that eigevalues are ordered, the eigevalues are ot idetically ad idepedetly distributed However: E(µ 1 ) > 1 p, E(µ q) < 1 p 1 but E(µ 1 ) p ad 1 E(µ q ) p Furthermore the eigevalue variaces are ot the same Ad from simulatios of large samples of observatios ( = 100,, = ), we otice that the covergece of the eigevalue distributio to a ormal oe is slow, especially for the extremes (µ 1 ad µ q ), eve for very large samples [4] 42 Distributio of eigevalues i MCA uder o-idepedece hypotheses 421 Distributio of the theoretical eigevalues Let µ be a eigevalue of D 1 X X Sice µ ca be also obtaied by diagoalizatio of 1 p XD 1 X, µ is a solutio of 1 p XD 1 X z = z, where z is a eigevector associated to µ

17 Eigevalues i MCA ad Log-Liear Models 57 So where P i = ( p ) 1 ( ) X i X 1X p i X i i z = µ z 1 p p P i z = µ z, p X i (X i X i) 1 X i is the orthogoal projector o the space spaed by liear combiatios of the idicators of variables categories X i Let A i the cetered projector associated to P i : A i = P i m i m i where 1 mi m i = Ad so we get (4) 1 p p A i z = µ z 4211 The Case of two-way iteractios Let us assume that amog the p studied variables, there is a two-way iteractio betwee X j ad X k, ad that the (p 2) remidig variables are mutually idepedet Multiplyig equatio (4) by A j we get: 1 ( A j A 1 p }{{} 0 + A j A A j A j }{{}}{{} 0 Aj + + A j A k + + A j A p }{{} 0 ) z = µ A j z, sice all variables are pairwise idepedet except X j, X k, ad the A i are orthogoal projectors Thus: (5) A j A k z = (p µ 1) A j z Similarly, multiplyig (4) by A k, we get: (6) A k A j z = (p µ 1) A k z Now let us multiply (5) by A k to get: A k A j A k z = (p µ 1) A k A j z Usig (6) we obtai A k A j A k z }{{} z 1 = (p µ 1) 2 A k z }{{} z 1

18 58 S Be Ammou ad G Saporta With the otatio λ = (p µ 1) 2, we fially write: (7) A k A j z 1 = λ z 1 Equatio (7) implies that λ is a eigevalue of the product of the cetered projector A k A j associated to the eigevector z 1 I geeral: j, k = 1,, p, if there is a iteractio betwee X j ad X k, the orthogoal projector A j A k admits a o zero eigevalue λ = (p µ 1) 2 If λ 0 µ 1 p, the trace of Burt table beig costat, there is, at least, aother eigevalue ot equal to 1 p Let 0 be the umber of eigevalue o equal to 1 p, so that 0 λ i = 0 p Theoretically, (except for the particular case, where λ = 1, for which we have µ = 2 p ad µ = 0), the umber of o-trivial-eigevalues greater tha 1 p is equal to the umber of o-trivial eigevalues smaller tha 1 p The eigevalue diagram shape is show o Figure 3 : λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** ************************ ************************ ************************ ****************** ***************** Figure 3: Theoretical eigevalues diagram i two-way iteractio case The umber 0 depeds o the umber of categories of X j ad X k, o the umber of variables ad o the umber of depedet variables Let 1 be the multiplicity of 1 p, we will show that 1 = q 2 mi((m j 1); (m k 1)), whe p > 2, ad whe there is oly oe two-way iteractio betwee the variables This result ca be show as follows: Let us cosider equatio (4), ad suppose, without loss of geerality, that X 1 ad X 2 are depedat So, upo multiplicatio by A 3 : 1 p p A iz = µz becomes 1 p (A 3A 1 + A 3 A 2 + A 3 A A 3 A P ) z = µ A 3 z, ad we get µ = 1 p

19 Eigevalues i MCA ad Log-Liear Models 59 Now multiply equatio (4) by A 2 ad A 1 i tur to get: ) (A 1 A 1 + A 1 A 2 + A 1 A A 1 A P z = p µa 1 z ) (A 2 A 1 + A 2 A 2 + A 2 A A 2 A P z = p µa 2 z { (A1 + A 1 A 2 ) z = p µ A 1 z (A 2 A 1 + A 2 ) z = p µ A 2 z { A1 A 2 b = λ z A 2 A 1 b = λ z where λ = (p µ 1) 2, a = A 1 z ad b = A 2 z We recogize here the CA equatios, so that the CA of Burt tables, whe oly two variables are depedet is equivalet to the CA of the cotigecy tables crossig the two depedet variables It is well kow that the umber of eigevalue i CA equals q 2 mi((m j 1); (m k 1)), ad for all o trivial λ i, there correspods the values µ i ad µ i such that: µ i = 1 + λ i p ad µ i = 1 λ i p Fially, the CA of the Burt table may have 2 mi((m j 1);(m k 1)) eigevalues o trivial ad ot equal to 1 p, associated to the CA of the cotigecy table So the umber of supplemetary eigevalues equals q 2 mi((m j 1); (m k 1)) There is, i additio, oe 1 multiple eigevalue, where 1 is at least q 2 mi((m j 1); (m k 1)) 4212 The case of higher order iteractios Sice the Burt table is costructed with pairwise cross products of variables, its observatio caot give us iformatio about multiway iteractios However the observatio of the bi-dimesioal theoretical Burt sub-tables, for all pairwise variable combiatios, ca provide us with all the two-way iteractios The theoretical Burt table ca reveal the existece of higher order iteractios i the followig case: If B ij B ii 1 mj m j B jj ad B ik B ii 1 mk m k B kk : there may be a triple iteractio betwee X i, X j ad X k I geeral, a Burt table does t give either the order of the iteractios, or supplemetary iformatio o the eigevalue behavior

20 60 S Be Ammou ad G Saporta 422 Distributio of observed eigevalues Exceptioally, with a small umber of iteractios, we observe the particular shape of the eigevalue diagram exhibited i Figure 4, where we ca distiguish eigevalues ear 1 p (theoretically equal to 1 p ), ad so we are able to recogize the existece of the idepedet variables i the aalysis λ I λ 1 λ 2 λ 3 λ 4 λ 5 λ q Eigevalues diagram **************************** *************************** *********************** ********************** ********************* ************** ************* ************ Figure 4: Observed eigevalues diagram i a two-way iteractio case Whe the umber of iteractio grows, we caot distiguish eigevalues theoretically equal to 1 p from the eigevalues o equal to 1 p To detect the existece or iteractios, we ca first check if the observed variables are mutually idepedet I that case, the eigevalues distributio diagram should have a particular shape (see 41), with more tha 95% of the eigevalues withi the cofidece iterval 1 p ± 2 σ (see 411) If there is oe or more eigevalues out of the cofidece iterval, we ca the assume the existece of oe or more two-way iteractio betwee variables 5 AN EMPIRICAL PROCEDURE FOR FITTING LOG-LINEAR MODELS BASED ON THE MCA EIGENVALUE DIAGRAM We propose a empirical procedure for progressively fittig a log-liear model where the fittig test at each step is based o the MCA eigevalues diagram Let X i, X j ad X k, three categorical variables, with respectively m i, m j ad m k modalities, ad let a cross variable with (m i m j ) modalities We suppose that X ij ad X k, have the same behavior if m k = m i m j

21 Eigevalues i MCA ad Log-Liear Models 61 Uder the hypothesis that two depedat variables X i ad X j have the same behaviour as the variable X k with the same characteristics of the cross variable X ij, we propose here a empirical procedure for fittig progressively, with p steps, the log-liear model where the fittig criterio at each step is based o the MCA eigevalue diagram Distributio of observed eigevalues 51 Descriptio of the procedure steps The first step of the procedure cosist to test the pairwise idepedece hypothesis of the variables To detect existece of iteractios, we must first check if all variables are mutually idepedet For that matter, we calculate the eigevalues of MCA o all the p variables, ad costruct the related cofidece iterval: the eigevalue distributio diagram should have a particular shape (cf 41) If all the eigevalues belog to the cofidece iterval 1 p ± 2 σ (cf 411), we ca coclude that the p variables are mutually idepedet The log-liear model associated to the variables is a simple additive oe: ad the procedure is stopped log[f p (X)] = u 0 (x) + p u i (x), If oe or more eigevalue are ot i the cofidece iterval, we coclude that there is at least oe double iteractio betwee variables, ad we go to the secod step of the procedure I the secod step, we look at all two-way iteractio u-terms We isolate oe variable amogst the p variables that we ote X p, without loss of geerality, ad so we obtai a set of (p 1) variables X i, ad apply the first step to test pairwise idepedece of the (p 1) variables If the (p 1) variables are idepedet, we ca coclude that the doubles iteractios are amogst X p ad at least oe of the X i, so we costruct correspodet cross variables X ip by usig the first step to test idepedece betwee variables (X i, X p ) where i = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip, ad the procedure stopped, (with δ ip = 1 if the iteractio betwee X p ad X i exists, otherwise it is set to zero) If the (p 1) variables are ot idepedet, we ca coclude that there is double iteractio betwee X i ad X j where i, j =1,, p 1, ad perhaps betwee X i ad X p

22 62 S Be Ammou ad G Saporta We ca costruct correspodet cross variables X ip ad X ij by usig the first step to test idepedece of variables (X i, X p ) ad variables (X i, X j ) where i, j = 1,, p 1 The log-liear model associated to the variables is: log[f p (X)] = u 0 (x) + p p 1 u i (x) + u ip (x) δ ip + terms due to the iteractio betwee three or more variables ad we go to the third step of the procedure I the third step, we look at three-way iteractio u-terms, by testig the depedece of variables X i ad cross variables X jk, where i, j, k = 1,, p ad i, j, k are differet, ad costruct cross variables X ijk The idepedece test is based o the eigevalue patter of the related MCA as described i the first step Cotiuig this way, i the k th step, we look at k-way iteractio u-terms, ad i the least step we look at the p-way iteractio u-term This algorithm is summarized i Figure 5 52 A example for a graphical model For this example we use a data set give by Haberma [24] that was used i Falguerolles et al [14] to fit a graphical model The data reports attitudes toward o therapeutic abortios amog white subjects crossed with three categorical variables describig the subjects The data set is a cotigecy table observed for 3181 idividuals, crossig four three modality variables X 1, X 2, X 3 ad X 4, defied i Table 1 The first step of the procedure cosists of testig the pairwise idepedece hypothesis of the variables We first trasform the cotigecy table i a complete disjuctive table, the calculate the parameters (defied i 21 ad 411) eeded for the test (Table 2) MCA o the four variables gives the eigevalues diagram of Figure 6 The shape of eigevalues diagram refers clearly to the existece of depedet variables Eigevalues λ 1, λ 7 ad λ 8 are ot i the iterval I c, so there is at least two depedet variables: there is oe or more two-way iteractios betwee variables

23 Eigevalues i MCA ad Log-Liear Models 63 Figure 5: Block diagram for the Empirical procedure

24 64 S Be Ammou ad G Saporta Table 1: Attitudes toward o therapeutic abortios amog white Year Religio: Educatio Attitude: X 4 X 1 X 2 i years: X 3 positive mixed egative 1972 orther Protestat souther Protestat Catholic orther Protestat souther Protestat Catholic orther Protestat souther Protestat Catholic Table 2: Parameters eeded for the test (first step of the example for a graphical model) p m 1 m 2 m 3 m 4 q m σ I c [02283, 02717] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = ************************** ********************* ******************** ******************* ****************** ***************** **************** *********** Figure 6: Eigevalues diagram (first step of the example for a graphical model)

25 Eigevalues i MCA ad Log-Liear Models 65 The secod step cosists of the detectio of two-way iteractios I a first time, we use our first step with oly three variables X 1, X 2 ad X 3 With the values of ad m i (for i = 1,, 3) still the same, the other parameters become (Table 3 ): Table 3: Parameters for the test (secod step of the example for a graphical model) q m σ I c [03097, 03569] We get the followig eigevalue diagram (Figure 7 ): λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = ************************** ************************* ************************ ********************** ********************* Figure 7: Eigevalues diagram (secod step of the example for a graphical model) λ 1 ad λ 5 are ot i iterval I c, so there is oe or more two-way iteractio betwee X 1, X 2 ad X 3, as also as iteractios betwee X 4 ad others I a secod step we look at the iteractios betwee X 4 ad X i (i = 1, 2, 3) For i = 1 to i = 3 we look at the eigevalues of the MCA of X 4 with X i, ad calculate their variaces ad itervals I c Crossig X 1 with X 4 we get (Table 4 ): Table 4: MCA o X 1 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] Crossig X 2 with X 4 we get (Table 5 ): Table 5: MCA o X 2 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250]

26 66 S Be Ammou ad G Saporta Crossig X 3 with X 4 we get (Table 6 ): Table 6: MCA o X 3 ad X 4 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] I the three cases, λ 1 ad λ 4 are ot i the itervals I c, so there is a twoway iteractio betwee X 1 ad X 4, X 2 ad X 4 ad betwee X 3 ad X 4, so we ca costruct cross variables X 4i havig 9 modalities (i = 1, 2, 3) Now, we use the first step with oly two variables X 1 ad X 2, after we look for iteractios betwee X 3 ad X i (i = 1, 2) Crossig X 1 with X 2 we get (Table 7 ): Table 7: MCA o X 1 ad X 2 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] All the eigevalues are i the cofidece iterval, so X 1 ad X 2 are idepedet coditioally o the other, ad there is o cross variable X 12 The correspodig u-term u 12 equals to zero Let us ow look, whe i = 1 ad i = 2, at the eigevalues of the MCA of X 3 with X i, with their variaces ad itervals I c : Crossig X 1 with X 3 we get (Table 8 ): Table 8: MCA o X 1 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] All the eigevalues are i the cofidece iterval I c, so X 1 ad X 3 are idepedet coditioally o the other, ad there is o cross variable X 13 : the correspodig u-term u 13 equals to zero Crossig ow X 2 with X 3 we get (Table 9 ): Table 9: MCA o X 2 ad X 3 (parameters ad eigevalues) q m σ I c λ 1 λ 2 λ 3 λ [04750, 05250] Here, λ 1 ad λ 4 are ot i the iterval I c, so there is a two-way iteractio betwee X 2 ad X 3, u 23 is ot set to zero, ad we ca add the cross variable X 32 (as well as X 23 ) with 9 modalities to the model

27 Eigevalues i MCA ad Log-Liear Models 67 The third step cosists of the detectio of triple iteractios betwee variables, that is to two-way iteractios betwee the variables X i ad the cross variables X jk We first put the cross variables (X 41, X 42, X 43, X 32 ) with the iitial variables that were deemed o depedet i the secod step of the procedure, ie X 1 ad X 2, ad the we use the first step of the procedure with the set of obtaied variables So we get the followig results (Table 10 ad Figure 8 ): Table 10: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (parameters third step of the example for a graphical model) q m σ I c [01331, 02003] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = ************************** ************************* ****************** ****************** ****************** ***************** ************ *********** *********** *********** *********** *********** *********** *********** ********** ********* Figure 8: MCA o X 1, X 2, X 41, X 42, X 43 ad X 32 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c : there is oe or more two-way iteractio betwee the iitial variables X i, ad the crossed oes X ik, so there exists a triple iteractio betwee simple variables

28 68 S Be Ammou ad G Saporta We drop X 32 ad use the first step with the five other variables to get the followig results (Table 11 ad Figure 9 ): Table 11: MCA o X 1, X 2, X 41, X 42 ad X 43 (parameters for the test) q m σ I c [01671, 02324] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = ************************** ************************** **************** **************** **************** *************** ********** ********** ********** ********* ********* ********* ********* ********* ********* ********* ******** ******** ******** ******** Figure 9: MCA o X 1, X 2, X 41, X 42 ad X 43 (eigevalues diagram, third step of the example for a graphical model) The first six eigevalues are ot i I c, so there is at least oe two-way iteractio betwee the variables We kow that simple variables X 1, X 2 ad the crossed variables X 41, X 42, X 43 are depedet so we have to test depedece betwee X 1 ad X 32 oly Crossig X 1 ad X 32 we get the followig results (Table 12): Table 12: MCA o X 1 ad X 32 (parameters ad eigevalues) q m σ I c [04682, 05318] λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ

29 Eigevalues i MCA ad Log-Liear Models 69 All the eigevalues are i the cofidece iterval I c, so X 1 ad X 32 are idepedet coditioally o the other, ad there is o cross variable X 132 The correspodig u-term u 123 equals zero Now we ca drop the cross variable X 43 The remaiig variables X 1, X 2, X 41, X 42 are depedet by costructio We have oly to test for depedece betwee X 1 ad X 43 Crossig X 1 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 13 ): Table 13: MCA o X 1 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ We remark that λ 1 ad λ 10 are ot i the iterval I c, so X 1 ad X 43 seem to be depedet But we have to fit a graphical model, that is a particular case of hierarchical models (as defied i 2222, a log-liear models is hierarchical if, wheever oe particular u-term is costraied to zero the all higher u-terms cotaiig the same set of subscripts are also set to zero) Here the u-term u 13 is set to zero, so the u-term u 134 is also set to zero Crossig X 2 with X 43, we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues (Table 14 ): Table 14: MCA o X 2 ad X 43 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ Eigevalues λ 1, λ 2, λ 9 ad λ 10 are ot i the iterval I c, the u-terms u 23 ad u 24 are ot set to zero, ad sice X 2 ad X 43 are ot depedet the u-term u 234 is ot set to zero Crossig X 1 with X 42 (or equivaletly X 2 with X 41 ) we get the same parameter as the crossig of X 1 ad X 32, ad the followig eigevalues: Table 15: MCA o X 1 ad X 42 (eigevalues) λ 1 λ 2 λ 3 λ 4 λ 5 λ 6 λ 7 λ 8 λ 9 λ

30 70 S Be Ammou ad G Saporta Eigevalues λ 1 ad λ 10 are ot i the iterval I c, the u-term u 14 is equal to zero, X 1 ad X 42 are depedet, ad the u-term u 124 is set to zero Fially, variables X 1 ad X 41 are depedet by costructio The procedure stops here because we ca t have more tha triple iteractios i a hierarchical model whe all the two-way iteractios are ot preset We obtai the followig model (see Figure 10 for the associated graph): Figure 10: Lattice diagram (example for a graphical model) log[f 4 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 4 x 4 + u 32 x 2 x 3 + u 41 x 4 x 1 + u 42 x 4 x 2 + u 43 x 4 x 3 + u 432 x 4 x 3 x 2 53 A example for a saturated model Here we use a data set give by Israëls [29] that was also used by Va der Heijde et al [28] about shop-liftig habits Table 16 is a cotigecy table crossig three variables: sex (2 modalities), age (9 modalities) ad type of goods (13 modalities: Clothig (C), Clothig accessories (Ca), Provisio-Tobacco (PT), Writig materials (Wm), Books (B), Records (R), Household goods (Hg), Sweets (S), Toys (T), Jewellery (J), Perfume (P), Hobbies tools(ht), ad Others(O)) observed over idividuals I the first step, we test the pairwise idepedece of variables X 1, X 2 ad X 3 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (defied i 22 & 411) eeded for the test to get (Table 17 ) A MCA o the three variables gives the eigevalue diagram of Figure 11 The eigevalue diagram shows clearly that the variables are ot idepedet: oly 8 eigevalues (λ 7,, λ 15 ) are i the cofidece iterval Usig the secod step of the procedure, we get the two-way iteractios

31 Eigevalues i MCA ad Log-Liear Models 71 Table 16: Multicotigecy table for the shop-liftig data Sex: Age: Goods: X 3 X 1 X 2 C Ca PT Wm B R Hg S T J P Ht O Male Female Table 17: Parameters eeded for the test (first step of the example for a satured model) p m 1 m 2 m 3 q m σ I c [03211, 03455] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = λ 21 = *************************************************** *********************************** ******************************** ******************************* **************************** **************************** *************************** ************************** ********************** ********************** ********************** ********************** ********************* ********************* ********************* ******************** ******************* ****************** **************** ************ ******* Figure 11: MCA o X 1, X 2 ad X 3 (eigevalues diagram, third step of the example for a saturated model)

32 72 S Be Ammou ad G Saporta MCA of X 1 ad X 3 gives the followig results (Table 18 ad Figure 12 ): Table 18: MCA o X 1 ad X 3 (parameters) p q m σ I c [05000, 05000] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********************** ********************** ********************** ********************** ********** Figure 12: MCA o X 1 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 13 is ot set to zero We otice here the peculiar form of the eigevalues diagram, due to the fact that multiple eigevalue λ = 1 2 that have a multiplicity 11 = m 3 m 1 is a artificial oe (cf 4211) MCA of X 2 ad X 3 gives the followig results (Table 19 ad Figure 13 ): Table 19: MCA o X 2 ad X 3 (parameters) p q m σ I c [04998, 05002] The 8 first ad the 8 last eigevalues are ot i the cofidece iterval so the u-term u 23 is ot set to zero

33 Eigevalues i MCA ad Log-Liear Models 73 λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = λ 10 = λ 11 = λ 12 = λ 13 = λ 14 = λ 15 = λ 16 = λ 17 = λ 18 = λ 19 = λ 20 = **************************************** ******************************* ***************************** ************************** ************************* ************************ ************************ *********************** ******************** ******************** ******************** ******************** ******************* ****************** ****************** ***************** **************** ************ *********** ****** Figure 13: MCA o X 2 ad X 3 (eigevalues diagram, secod step of the example for a saturated model) MCA of X 1 ad X 2 gives the followig eigevalue results (Table 20, Figure 14 ): Table 20: MCA o X 1 ad X 2 (parameters) p q m σ I c [04926, 05074] λ 1 = λ 2 = λ 3 = λ 4 = λ 5 = λ 6 = λ 7 = λ 8 = λ 9 = **************************************** ************************* ************************* ************************* ************************* ************************* ************************* ************************* ********** Figure 14: MCA o X 1 ad X 2 (eigevalues diagram, secod step of the example for a saturated model) The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 12 is ot set to zero At the ed of the secod step, we obtai all three

34 74 S Be Ammou ad G Saporta two-way iteractios To kow if the model is a saturated oe we ca built oe of the crossed variables ad test its depedece with the remaiig simple variable MCA of X 32 with X 1 gives the followig eigevalues: λ 1 = 07285, λ 2 = λ 3 = = λ 116 = 05, λ 117 = ad I c = [04615, 05384] The first ad the last eigevalues are ot i the cofidece iterval so the u-term u 123 is ot set to zero At the ed we get the followig saturated model: log[f 3 (X)] = u 0 + u 1 x 1 + u 2 x 2 + u 3 x 3 + u 12 x 1 x 2 + u 23 x 2 x 3 + u 13 x 1 x 3 + u 123 x 1 x 2 x 3 54 A example for a mutual idepedece model Here we use a data set give by Aderse [2] as a cotigecy table crossig four variables observed over 299 idividuals correspodig to a retrospective study of ovary cacer, defied i Table 21: Table 21: Retrospective study of ovary cacer X 1 X 2 X 3 X 4 stage operatio survival X-ray No Yes Early radical o limited yes o 1 3 yes 13 9 Advaced radical o limited yes 6 11 o 3 13 yes 1 5 I the first step of procedure, we test for the pairwise idepedece of variables X 1, X 2, X 3 ad X 4 We first trasform the cotigecy table i a complete disjuctive table, the compute the parameters (see 411) eeded for the test

35 Eigevalues i MCA ad Log-Liear Models 75 The MCA o the four variables gives the followig results (Table 22 ad Figure 15): Table 22: Parameters eeded for the test (first step of the example for a mutual idepedece model) p m 1 m 2 m 3 m 4 q m σ I c [02000, 03000] λ 1 = λ 2 = λ 3 = λ 4 = ********************************** ******************** ******************* ********* Figure 15: MCA o X 1, X 2, X 3 ad X 4 (eigevalues diagram, first step of the example for a mutual idepedece model) The eigevalue diagram shows clearly that variables are ot idepedet, oly λ 2 ad λ 3 are i the cofidece iterval Let s drop X 4 ad use the secod step of the procedure MCA o the three remaiig variables gives the followig results (Table 23 ad Figure 16 ): Table 23: MCA o X 1, X 2 ad X 3 (parameters) p q m σ I c [02787, 03879] λ 1 = λ 2 = λ 3 = ********************** ******************** ******************* Figure 16: MCA o X 1, X 2 ad X 3 (eigevalues diagram) The eigevalue diagram shows clearly that variables are idepedet, sice all the eigevalues are i the cofidece iterval, so there is surely oe or more iteractio X 4 ad X i,,, 3

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab

Table 12.1: Contingency table. Feature b. 1 N 11 N 12 N 1b 2 N 21 N 22 N 2b. ... a N a1 N a2 N ab Sectio 12 Tests of idepedece ad homogeeity I this lecture we will cosider a situatio whe our observatios are classified by two differet features ad we would like to test if these features are idepedet

More information

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen) Goodess-of-Fit Tests ad Categorical Data Aalysis (Devore Chapter Fourtee) MATH-252-01: Probability ad Statistics II Sprig 2019 Cotets 1 Chi-Squared Tests with Kow Probabilities 1 1.1 Chi-Squared Testig................

More information

General IxJ Contingency Tables

General IxJ Contingency Tables page1 Geeral x Cotigecy Tables We ow geeralize our previous results from the prospective, retrospective ad cross-sectioal studies ad the Poisso samplig case to x cotigecy tables. For such tables, the test

More information

Properties and Hypothesis Testing

Properties and Hypothesis Testing Chapter 3 Properties ad Hypothesis Testig 3.1 Types of data The regressio techiques developed i previous chapters ca be applied to three differet kids of data. 1. Cross-sectioal data. 2. Time series data.

More information

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator

Economics 241B Relation to Method of Moments and Maximum Likelihood OLSE as a Maximum Likelihood Estimator Ecoomics 24B Relatio to Method of Momets ad Maximum Likelihood OLSE as a Maximum Likelihood Estimator Uder Assumptio 5 we have speci ed the distributio of the error, so we ca estimate the model parameters

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2016 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Efficient GMM LECTURE 12 GMM II

Efficient GMM LECTURE 12 GMM II DECEMBER 1 010 LECTURE 1 II Efficiet The estimator depeds o the choice of the weight matrix A. The efficiet estimator is the oe that has the smallest asymptotic variace amog all estimators defied by differet

More information

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering CEE 5 Autum 005 Ucertaity Cocepts for Geotechical Egieerig Basic Termiology Set A set is a collectio of (mutually exclusive) objects or evets. The sample space is the (collectively exhaustive) collectio

More information

Lecture 2: Monte Carlo Simulation

Lecture 2: Monte Carlo Simulation STAT/Q SCI 43: Itroductio to Resamplig ethods Sprig 27 Istructor: Ye-Chi Che Lecture 2: ote Carlo Simulatio 2 ote Carlo Itegratio Assume we wat to evaluate the followig itegratio: e x3 dx What ca we do?

More information

Chi-Squared Tests Math 6070, Spring 2006

Chi-Squared Tests Math 6070, Spring 2006 Chi-Squared Tests Math 6070, Sprig 2006 Davar Khoshevisa Uiversity of Utah February XXX, 2006 Cotets MLE for Goodess-of Fit 2 2 The Multiomial Distributio 3 3 Applicatio to Goodess-of-Fit 6 3 Testig for

More information

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2. SAMPLE STATISTICS A radom sample x 1,x,,x from a distributio f(x) is a set of idepedetly ad idetically variables with x i f(x) for all i Their joit pdf is f(x 1,x,,x )=f(x 1 )f(x ) f(x )= f(x i ) The sample

More information

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So,

First, note that the LS residuals are orthogonal to the regressors. X Xb X y = 0 ( normal equations ; (k 1) ) So, 0 2. OLS Part II The OLS residuals are orthogoal to the regressors. If the model icludes a itercept, the orthogoality of the residuals ad regressors gives rise to three results, which have limited practical

More information

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference

t distribution [34] : used to test a mean against an hypothesized value (H 0 : µ = µ 0 ) or the difference EXST30 Backgroud material Page From the textbook The Statistical Sleuth Mea [0]: I your text the word mea deotes a populatio mea (µ) while the work average deotes a sample average ( ). Variace [0]: The

More information

Chimica Inorganica 3

Chimica Inorganica 3 himica Iorgaica Irreducible Represetatios ad haracter Tables Rather tha usig geometrical operatios, it is ofte much more coveiet to employ a ew set of group elemets which are matrices ad to make the rule

More information

Random Variables, Sampling and Estimation

Random Variables, Sampling and Estimation Chapter 1 Radom Variables, Samplig ad Estimatio 1.1 Itroductio This chapter will cover the most importat basic statistical theory you eed i order to uderstad the ecoometric material that will be comig

More information

Estimation for Complete Data

Estimation for Complete Data Estimatio for Complete Data complete data: there is o loss of iformatio durig study. complete idividual complete data= grouped data A complete idividual data is the oe i which the complete iformatio of

More information

4. Partial Sums and the Central Limit Theorem

4. Partial Sums and the Central Limit Theorem 1 of 10 7/16/2009 6:05 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 4. Partial Sums ad the Cetral Limit Theorem The cetral limit theorem ad the law of large umbers are the two fudametal theorems

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Algebra of Least Squares

Algebra of Least Squares October 19, 2018 Algebra of Least Squares Geometry of Least Squares Recall that out data is like a table [Y X] where Y collects observatios o the depedet variable Y ad X collects observatios o the k-dimesioal

More information

Problem Set 4 Due Oct, 12

Problem Set 4 Due Oct, 12 EE226: Radom Processes i Systems Lecturer: Jea C. Walrad Problem Set 4 Due Oct, 12 Fall 06 GSI: Assae Gueye This problem set essetially reviews detectio theory ad hypothesis testig ad some basic otios

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Patter Recogitio Classificatio: No-Parametric Modelig Hamid R. Rabiee Jafar Muhammadi Sprig 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Ageda Parametric Modelig No-Parametric Modelig

More information

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity

LINEAR REGRESSION ANALYSIS. MODULE IX Lecture Multicollinearity LINEAR REGRESSION ANALYSIS MODULE IX Lecture - 9 Multicolliearity Dr Shalabh Departmet of Mathematics ad Statistics Idia Istitute of Techology Kapur Multicolliearity diagostics A importat questio that

More information

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION

[412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION [412] A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION BY ALAN STUART Divisio of Research Techiques, Lodo School of Ecoomics 1. INTRODUCTION There are several circumstaces

More information

Topic 9: Sampling Distributions of Estimators

Topic 9: Sampling Distributions of Estimators Topic 9: Samplig Distributios of Estimators Course 003, 2018 Page 0 Samplig distributios of estimators Sice our estimators are statistics (particular fuctios of radom variables), their distributio ca be

More information

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise)

Lecture 22: Review for Exam 2. 1 Basic Model Assumptions (without Gaussian Noise) Lecture 22: Review for Exam 2 Basic Model Assumptios (without Gaussia Noise) We model oe cotiuous respose variable Y, as a liear fuctio of p umerical predictors, plus oise: Y = β 0 + β X +... β p X p +

More information

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS

A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS J. Japa Statist. Soc. Vol. 41 No. 1 2011 67 73 A RANK STATISTIC FOR NON-PARAMETRIC K-SAMPLE AND CHANGE POINT PROBLEMS Yoichi Nishiyama* We cosider k-sample ad chage poit problems for idepedet data i a

More information

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n. Jauary 1, 2019 Resamplig Methods Motivatio We have so may estimators with the property θ θ d N 0, σ 2 We ca also write θ a N θ, σ 2 /, where a meas approximately distributed as Oce we have a cosistet estimator

More information

Describing the Relation between Two Variables

Describing the Relation between Two Variables Copyright 010 Pearso Educatio, Ic. Tables ad Formulas for Sulliva, Statistics: Iformed Decisios Usig Data 010 Pearso Educatio, Ic Chapter Orgaizig ad Summarizig Data Relative frequecy = frequecy sum of

More information

Expectation and Variance of a random variable

Expectation and Variance of a random variable Chapter 11 Expectatio ad Variace of a radom variable The aim of this lecture is to defie ad itroduce mathematical Expectatio ad variace of a fuctio of discrete & cotiuous radom variables ad the distributio

More information

Polynomial identity testing and global minimum cut

Polynomial identity testing and global minimum cut CHAPTER 6 Polyomial idetity testig ad global miimum cut I this lecture we will cosider two further problems that ca be solved usig probabilistic algorithms. I the first half, we will cosider the problem

More information

Exponential Families and Bayesian Inference

Exponential Families and Bayesian Inference Computer Visio Expoetial Families ad Bayesia Iferece Lecture Expoetial Families A expoetial family of distributios is a d-parameter family f(x; havig the followig form: f(x; = h(xe g(t T (x B(, (. where

More information

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic

A Relationship Between the One-Way MANOVA Test Statistic and the Hotelling Lawley Trace Test Statistic http://ijspccseetorg Iteratioal Joural of Statistics ad Probability Vol 7, No 6; 2018 A Relatioship Betwee the Oe-Way MANOVA Test Statistic ad the Hotellig Lawley Trace Test Statistic Hasthika S Rupasighe

More information

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10 DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set

More information

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract

Goodness-Of-Fit For The Generalized Exponential Distribution. Abstract Goodess-Of-Fit For The Geeralized Expoetial Distributio By Amal S. Hassa stitute of Statistical Studies & Research Cairo Uiversity Abstract Recetly a ew distributio called geeralized expoetial or expoetiated

More information

Linear Regression Demystified

Linear Regression Demystified Liear Regressio Demystified Liear regressio is a importat subject i statistics. I elemetary statistics courses, formulae related to liear regressio are ofte stated without derivatio. This ote iteds to

More information

Chapter 6 Principles of Data Reduction

Chapter 6 Principles of Data Reduction Chapter 6 for BST 695: Special Topics i Statistical Theory. Kui Zhag, 0 Chapter 6 Priciples of Data Reductio Sectio 6. Itroductio Goal: To summarize or reduce the data X, X,, X to get iformatio about a

More information

1 Inferential Methods for Correlation and Regression Analysis

1 Inferential Methods for Correlation and Regression Analysis 1 Iferetial Methods for Correlatio ad Regressio Aalysis I the chapter o Correlatio ad Regressio Aalysis tools for describig bivariate cotiuous data were itroduced. The sample Pearso Correlatio Coefficiet

More information

CALCULATION OF FIBONACCI VECTORS

CALCULATION OF FIBONACCI VECTORS CALCULATION OF FIBONACCI VECTORS Stuart D. Aderso Departmet of Physics, Ithaca College 953 Daby Road, Ithaca NY 14850, USA email: saderso@ithaca.edu ad Dai Novak Departmet of Mathematics, Ithaca College

More information

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT

Geometry of LS. LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT OCTOBER 7, 2016 LECTURE 3 GEOMETRY OF LS, PROPERTIES OF σ 2, PARTITIONED REGRESSION, GOODNESS OF FIT Geometry of LS We ca thik of y ad the colums of X as members of the -dimesioal Euclidea space R Oe ca

More information

Information-based Feature Selection

Information-based Feature Selection Iformatio-based Feature Selectio Farza Faria, Abbas Kazeroui, Afshi Babveyh Email: {faria,abbask,afshib}@staford.edu 1 Itroductio Feature selectio is a topic of great iterest i applicatios dealig with

More information

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods

TMA4205 Numerical Linear Algebra. The Poisson problem in R 2 : diagonalization methods TMA4205 Numerical Liear Algebra The Poisso problem i R 2 : diagoalizatio methods September 3, 2007 c Eiar M Røquist Departmet of Mathematical Scieces NTNU, N-749 Trodheim, Norway All rights reserved A

More information

1 of 7 7/16/2009 6:06 AM Virtual Laboratories > 6. Radom Samples > 1 2 3 4 5 6 7 6. Order Statistics Defiitios Suppose agai that we have a basic radom experimet, ad that X is a real-valued radom variable

More information

Orthogonal transformations

Orthogonal transformations Orthogoal trasformatios October 12, 2014 1 Defiig property The squared legth of a vector is give by takig the dot product of a vector with itself, v 2 v v g ij v i v j A orthogoal trasformatio is a liear

More information

Rank tests and regression rank scores tests in measurement error models

Rank tests and regression rank scores tests in measurement error models Rak tests ad regressio rak scores tests i measuremet error models J. Jurečková ad A.K.Md.E. Saleh Charles Uiversity i Prague ad Carleto Uiversity i Ottawa Abstract The rak ad regressio rak score tests

More information

The standard deviation of the mean

The standard deviation of the mean Physics 6C Fall 20 The stadard deviatio of the mea These otes provide some clarificatio o the distictio betwee the stadard deviatio ad the stadard deviatio of the mea.. The sample mea ad variace Cosider

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Yig Zhag STA6938-Logistic Regressio Model Topic -Simple (Uivariate) Logistic Regressio Model Outlies:. Itroductio. A Example-Does the liear regressio model always work? 3. Maximum Likelihood Curve

More information

Lecture 24: Variable selection in linear models

Lecture 24: Variable selection in linear models Lecture 24: Variable selectio i liear models Cosider liear model X = Z β + ε, β R p ad Varε = σ 2 I. Like the LSE, the ridge regressio estimator does ot give 0 estimate to a compoet of β eve if that compoet

More information

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

Discrete Mathematics for CS Spring 2008 David Wagner Note 22 CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig

More information

ON POINTWISE BINOMIAL APPROXIMATION

ON POINTWISE BINOMIAL APPROXIMATION Iteratioal Joural of Pure ad Applied Mathematics Volume 71 No. 1 2011, 57-66 ON POINTWISE BINOMIAL APPROXIMATION BY w-functions K. Teerapabolar 1, P. Wogkasem 2 Departmet of Mathematics Faculty of Sciece

More information

Stat 421-SP2012 Interval Estimation Section

Stat 421-SP2012 Interval Estimation Section Stat 41-SP01 Iterval Estimatio Sectio 11.1-11. We ow uderstad (Chapter 10) how to fid poit estimators of a ukow parameter. o However, a poit estimate does ot provide ay iformatio about the ucertaity (possible

More information

Session 5. (1) Principal component analysis and Karhunen-Loève transformation

Session 5. (1) Principal component analysis and Karhunen-Loève transformation 200 Autum semester Patter Iformatio Processig Topic 2 Image compressio by orthogoal trasformatio Sessio 5 () Pricipal compoet aalysis ad Karhue-Loève trasformatio Topic 2 of this course explais the image

More information

Determinants of order 2 and 3 were defined in Chapter 2 by the formulae (5.1)

Determinants of order 2 and 3 were defined in Chapter 2 by the formulae (5.1) 5. Determiats 5.. Itroductio 5.2. Motivatio for the Choice of Axioms for a Determiat Fuctios 5.3. A Set of Axioms for a Determiat Fuctio 5.4. The Determiat of a Diagoal Matrix 5.5. The Determiat of a Upper

More information

4. Hypothesis testing (Hotelling s T 2 -statistic)

4. Hypothesis testing (Hotelling s T 2 -statistic) 4. Hypothesis testig (Hotellig s T -statistic) Cosider the test of hypothesis H 0 : = 0 H A = 6= 0 4. The Uio-Itersectio Priciple W accept the hypothesis H 0 as valid if ad oly if H 0 (a) : a T = a T 0

More information

CS284A: Representations and Algorithms in Molecular Biology

CS284A: Representations and Algorithms in Molecular Biology CS284A: Represetatios ad Algorithms i Molecular Biology Scribe Notes o Lectures 3 & 4: Motif Discovery via Eumeratio & Motif Represetatio Usig Positio Weight Matrix Joshua Gervi Based o presetatios by

More information

Problem Set 2 Solutions

Problem Set 2 Solutions CS271 Radomess & Computatio, Sprig 2018 Problem Set 2 Solutios Poit totals are i the margi; the maximum total umber of poits was 52. 1. Probabilistic method for domiatig sets 6pts Pick a radom subset S

More information

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d

Linear regression. Daniel Hsu (COMS 4771) (y i x T i β)2 2πσ. 2 2σ 2. 1 n. (x T i β y i ) 2. 1 ˆβ arg min. β R n d Liear regressio Daiel Hsu (COMS 477) Maximum likelihood estimatio Oe of the simplest liear regressio models is the followig: (X, Y ),..., (X, Y ), (X, Y ) are iid radom pairs takig values i R d R, ad Y

More information

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution

Double Stage Shrinkage Estimator of Two Parameters. Generalized Exponential Distribution Iteratioal Mathematical Forum, Vol., 3, o. 3, 3-53 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/.9/imf.3.335 Double Stage Shrikage Estimator of Two Parameters Geeralized Expoetial Distributio Alaa M.

More information

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector

Summary and Discussion on Simultaneous Analysis of Lasso and Dantzig Selector Summary ad Discussio o Simultaeous Aalysis of Lasso ad Datzig Selector STAT732, Sprig 28 Duzhe Wag May 4, 28 Abstract This is a discussio o the work i Bickel, Ritov ad Tsybakov (29). We begi with a short

More information

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable.

It should be unbiased, or approximately unbiased. Variance of the variance estimator should be small. That is, the variance estimator is stable. Chapter 10 Variace Estimatio 10.1 Itroductio Variace estimatio is a importat practical problem i survey samplig. Variace estimates are used i two purposes. Oe is the aalytic purpose such as costructig

More information

THE KALMAN FILTER RAUL ROJAS

THE KALMAN FILTER RAUL ROJAS THE KALMAN FILTER RAUL ROJAS Abstract. This paper provides a getle itroductio to the Kalma filter, a umerical method that ca be used for sesor fusio or for calculatio of trajectories. First, we cosider

More information

4 Multidimensional quantitative data

4 Multidimensional quantitative data Chapter 4 Multidimesioal quatitative data 4 Multidimesioal statistics Basic statistics are ow part of the curriculum of most ecologists However, statistical techiques based o such simple distributios as

More information

Chi-squared tests Math 6070, Spring 2014

Chi-squared tests Math 6070, Spring 2014 Chi-squared tests Math 6070, Sprig 204 Davar Khoshevisa Uiversity of Utah March, 204 Cotets MLE for goodess-of fit 2 2 The Multivariate ormal distributio 3 3 Cetral limit theorems 5 4 Applicatio to goodess-of-fit

More information

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence

Chapter 3. Strong convergence. 3.1 Definition of almost sure convergence Chapter 3 Strog covergece As poited out i the Chapter 2, there are multiple ways to defie the otio of covergece of a sequece of radom variables. That chapter defied covergece i probability, covergece i

More information

Statistical Inference Based on Extremum Estimators

Statistical Inference Based on Extremum Estimators T. Rotheberg Fall, 2007 Statistical Iferece Based o Extremum Estimators Itroductio Suppose 0, the true value of a p-dimesioal parameter, is kow to lie i some subset S R p : Ofte we choose to estimate 0

More information

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices

Random Matrices with Blocks of Intermediate Scale Strongly Correlated Band Matrices Radom Matrices with Blocks of Itermediate Scale Strogly Correlated Bad Matrices Jiayi Tog Advisor: Dr. Todd Kemp May 30, 07 Departmet of Mathematics Uiversity of Califoria, Sa Diego Cotets Itroductio Notatio

More information

1 Models for Matched Pairs

1 Models for Matched Pairs 1 Models for Matched Pairs Matched pairs occur whe we aalyse samples such that for each measuremet i oe of the samples there is a measuremet i the other sample that directly relates to the measuremet i

More information

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation

Some Properties of the Exact and Score Methods for Binomial Proportion and Sample Size Calculation Some Properties of the Exact ad Score Methods for Biomial Proportio ad Sample Size Calculatio K. KRISHNAMOORTHY AND JIE PENG Departmet of Mathematics, Uiversity of Louisiaa at Lafayette Lafayette, LA 70504-1010,

More information

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test.

Because it tests for differences between multiple pairs of means in one test, it is called an omnibus test. Math 308 Sprig 018 Classes 19 ad 0: Aalysis of Variace (ANOVA) Page 1 of 6 Itroductio ANOVA is a statistical procedure for determiig whether three or more sample meas were draw from populatios with equal

More information

Element sampling: Part 2

Element sampling: Part 2 Chapter 4 Elemet samplig: Part 2 4.1 Itroductio We ow cosider uequal probability samplig desigs which is very popular i practice. I the uequal probability samplig, we ca improve the efficiecy of the resultig

More information

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions

Overview. p 2. Chapter 9. Pooled Estimate of. q = 1 p. Notation for Two Proportions. Inferences about Two Proportions. Assumptions Chapter 9 Slide Ifereces from Two Samples 9- Overview 9- Ifereces about Two Proportios 9- Ifereces about Two Meas: Idepedet Samples 9-4 Ifereces about Matched Pairs 9-5 Comparig Variatio i Two Samples

More information

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc. Chapter 22 Comparig Two Proportios Copyright 2010 Pearso Educatio, Ic. Comparig Two Proportios Comparisos betwee two percetages are much more commo tha questios about isolated percetages. Ad they are more

More information

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random

10. Comparative Tests among Spatial Regression Models. Here we revisit the example in Section 8.1 of estimating the mean of a normal random Part III. Areal Data Aalysis 0. Comparative Tests amog Spatial Regressio Models While the otio of relative likelihood values for differet models is somewhat difficult to iterpret directly (as metioed above),

More information

1 Review of Probability & Statistics

1 Review of Probability & Statistics 1 Review of Probability & Statistics a. I a group of 000 people, it has bee reported that there are: 61 smokers 670 over 5 960 people who imbibe (drik alcohol) 86 smokers who imbibe 90 imbibers over 5

More information

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4 MATH 30: Probability ad Statistics 9. Estimatio ad Testig of Parameters Estimatio ad Testig of Parameters We have bee dealig situatios i which we have full kowledge of the distributio of a radom variable.

More information

Math 155 (Lecture 3)

Math 155 (Lecture 3) Math 55 (Lecture 3) September 8, I this lecture, we ll cosider the aswer to oe of the most basic coutig problems i combiatorics Questio How may ways are there to choose a -elemet subset of the set {,,,

More information

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium

Statistical Hypothesis Testing. STAT 536: Genetic Statistics. Statistical Hypothesis Testing - Terminology. Hardy-Weinberg Disequilibrium Statistical Hypothesis Testig STAT 536: Geetic Statistics Kari S. Dorma Departmet of Statistics Iowa State Uiversity September 7, 006 Idetify a hypothesis, a idea you wat to test for its applicability

More information

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition

6. Kalman filter implementation for linear algebraic equations. Karhunen-Loeve decomposition 6. Kalma filter implemetatio for liear algebraic equatios. Karhue-Loeve decompositio 6.1. Solvable liear algebraic systems. Probabilistic iterpretatio. Let A be a quadratic matrix (ot obligatory osigular.

More information

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates

Investigating the Significance of a Correlation Coefficient using Jackknife Estimates Iteratioal Joural of Scieces: Basic ad Applied Research (IJSBAR) ISSN 2307-4531 (Prit & Olie) http://gssrr.org/idex.php?joural=jouralofbasicadapplied ---------------------------------------------------------------------------------------------------------------------------

More information

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample. Statistical Iferece (Chapter 10) Statistical iferece = lear about a populatio based o the iformatio provided by a sample. Populatio: The set of all values of a radom variable X of iterest. Characterized

More information

MATH/STAT 352: Lecture 15

MATH/STAT 352: Lecture 15 MATH/STAT 352: Lecture 15 Sectios 5.2 ad 5.3. Large sample CI for a proportio ad small sample CI for a mea. 1 5.2: Cofidece Iterval for a Proportio Estimatig proportio of successes i a biomial experimet

More information

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND.

MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND. XI-1 (1074) MOST PEOPLE WOULD RATHER LIVE WITH A PROBLEM THEY CAN'T SOLVE, THAN ACCEPT A SOLUTION THEY CAN'T UNDERSTAND. R. E. D. WOOLSEY AND H. S. SWANSON XI-2 (1075) STATISTICAL DECISION MAKING Advaced

More information

Lecture 7: Properties of Random Samples

Lecture 7: Properties of Random Samples Lecture 7: Properties of Radom Samples 1 Cotiued From Last Class Theorem 1.1. Let X 1, X,...X be a radom sample from a populatio with mea µ ad variace σ

More information

Accuracy Assessment for High-Dimensional Linear Regression

Accuracy Assessment for High-Dimensional Linear Regression Uiversity of Pesylvaia ScholarlyCommos Statistics Papers Wharto Faculty Research -016 Accuracy Assessmet for High-Dimesioal Liear Regressio Toy Cai Uiversity of Pesylvaia Zijia Guo Uiversity of Pesylvaia

More information

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara

Econ 325 Notes on Point Estimator and Confidence Interval 1 By Hiro Kasahara Poit Estimator Eco 325 Notes o Poit Estimator ad Cofidece Iterval 1 By Hiro Kasahara Parameter, Estimator, ad Estimate The ormal probability desity fuctio is fully characterized by two costats: populatio

More information

Categorical Data Analysis

Categorical Data Analysis Categorical Data Aalysis Refereces : Ala Agresti, Categorical Data Aalysis, Wiley Itersciece, New Jersey, 2002 Bhattacharya, G.K., Johso, R.A., Statistical Cocepts ad Methods, Wiley,1977 Outlie Categorical

More information

Statistics 3858 : Likelihood Ratio for Multinomial Models

Statistics 3858 : Likelihood Ratio for Multinomial Models Statistics 3858 : Likelihood Ratio for Multiomial Models Suppose X is multiomial o M categories, that is X Multiomial, p), where p p 1, p 2,..., p M ) A, ad the parameter space is A {p : p j 0, p j 1 }

More information

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014.

Product measures, Tonelli s and Fubini s theorems For use in MAT3400/4400, autumn 2014 Nadia S. Larsen. Version of 13 October 2014. Product measures, Toelli s ad Fubii s theorems For use i MAT3400/4400, autum 2014 Nadia S. Larse Versio of 13 October 2014. 1. Costructio of the product measure The purpose of these otes is to preset the

More information

Matrix Representation of Data in Experiment

Matrix Representation of Data in Experiment Matrix Represetatio of Data i Experimet Cosider a very simple model for resposes y ij : y ij i ij, i 1,; j 1,,..., (ote that for simplicity we are assumig the two () groups are of equal sample size ) Y

More information

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1. Eco 325/327 Notes o Sample Mea, Sample Proportio, Cetral Limit Theorem, Chi-square Distributio, Studet s t distributio 1 Sample Mea By Hiro Kasahara We cosider a radom sample from a populatio. Defiitio

More information

CHAPTER 4 BIVARIATE DISTRIBUTION EXTENSION

CHAPTER 4 BIVARIATE DISTRIBUTION EXTENSION CHAPTER 4 BIVARIATE DISTRIBUTION EXTENSION 4. Itroductio Numerous bivariate discrete distributios have bee defied ad studied (see Mardia, 97 ad Kocherlakota ad Kocherlakota, 99) based o various methods

More information

Axioms of Measure Theory

Axioms of Measure Theory MATH 532 Axioms of Measure Theory Dr. Neal, WKU I. The Space Throughout the course, we shall let X deote a geeric o-empty set. I geeral, we shall ot assume that ay algebraic structure exists o X so that

More information

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula

Journal of Multivariate Analysis. Superefficient estimation of the marginals by exploiting knowledge on the copula Joural of Multivariate Aalysis 102 (2011) 1315 1319 Cotets lists available at ScieceDirect Joural of Multivariate Aalysis joural homepage: www.elsevier.com/locate/jmva Superefficiet estimatio of the margials

More information

Assignment 2 Solutions SOLUTION. ϕ 1 Â = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ.

Assignment 2 Solutions SOLUTION. ϕ 1  = 3 ϕ 1 4i ϕ 2. The other case can be dealt with in a similar way. { ϕ 2 Â} χ = { 4i ϕ 1 3 ϕ 2 } χ. PHYSICS 34 QUANTUM PHYSICS II (25) Assigmet 2 Solutios 1. With respect to a pair of orthoormal vectors ϕ 1 ad ϕ 2 that spa the Hilbert space H of a certai system, the operator  is defied by its actio

More information

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to: STA 2023 Module 10 Comparig Two Proportios Learig Objectives Upo completig this module, you should be able to: 1. Perform large-sample ifereces (hypothesis test ad cofidece itervals) to compare two populatio

More information

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution.

Important Formulas. Expectation: E (X) = Σ [X P(X)] = n p q σ = n p q. P(X) = n! X1! X 2! X 3! X k! p X. Chapter 6 The Normal Distribution. Importat Formulas Chapter 3 Data Descriptio Mea for idividual data: X = _ ΣX Mea for grouped data: X= _ Σf X m Stadard deviatio for a sample: _ s = Σ(X _ X ) or s = 1 (Σ X ) (Σ X ) ( 1) Stadard deviatio

More information

Machine Learning for Data Science (CS 4786)

Machine Learning for Data Science (CS 4786) Machie Learig for Data Sciece CS 4786) Lecture & 3: Pricipal Compoet Aalysis The text i black outlies high level ideas. The text i blue provides simple mathematical details to derive or get to the algorithm

More information

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc. Chapter 22 Comparig Two Proportios Copyright 2010, 2007, 2004 Pearso Educatio, Ic. Comparig Two Proportios Read the first two paragraphs of pg 504. Comparisos betwee two percetages are much more commo

More information

Lecture 7: October 18, 2017

Lecture 7: October 18, 2017 Iformatio ad Codig Theory Autum 207 Lecturer: Madhur Tulsiai Lecture 7: October 8, 207 Biary hypothesis testig I this lecture, we apply the tools developed i the past few lectures to uderstad the problem

More information

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting Lecture 6 Chi Square Distributio (χ ) ad Least Squares Fittig Chi Square Distributio (χ ) Suppose: We have a set of measuremets {x 1, x, x }. We kow the true value of each x i (x t1, x t, x t ). We would

More information

Infinite Sequences and Series

Infinite Sequences and Series Chapter 6 Ifiite Sequeces ad Series 6.1 Ifiite Sequeces 6.1.1 Elemetary Cocepts Simply speakig, a sequece is a ordered list of umbers writte: {a 1, a 2, a 3,...a, a +1,...} where the elemets a i represet

More information