Correspondence Analysis (CA)

Size: px

Start display at page:

Download "Correspondence Analysis (CA)"

Reynard Lee
6 years ago
Views:

1 Correspondence Analysis (CA) François Husson & Magalie Houée-Bigot Department o Applied Mathematics - Rennes Agrocampus husson@agrocampus-ouest.r / 43

2 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 2 / 43

3 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 3 / 43

4 Contingency tables Variable V 2 Variable V i j J x ij : number o individuals with category i o the variable I category j o the variable J I Characters in Words Number o times character i uses Phèdre (Racine) the word j Perume Descriptors Number o times perume i was described with the word j Biodiversity Species Abundance o species j in place i = Examples where a χ 2 test or independence can be applied 4 / 43

5 Nobel prize data Chemistry Economic Literature Medicine Peace Physics Sum sciences Canada France Germany Italy Japan Russia UK USA Sum Is there a link between countries and prize categories? Do some countries specialize in certain prizes? 5 / 43

6 Data n individuals and 2 qualitative variables V V 2 Categories o V 2 j J Individuals l i j Categories o V i n I n Distribution o the n individual in the I J boxes 6 / 43

7 From contingency tables to probability tables Categories o V 2 Marginal column (marginal probability) j J Σ ij = x ij n Categories o V Marginal row (marginal probability) i Σ I ij. j =. j ij i= I i. J i. = ij Link between V and V2: Deviation o the observed data rom the independence model j= 7 / 43

8 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 8 / 43

9 Links and independence between qualitative variables Independence model: Independent events: P(A and B) = P(A) P(B) Independent qualitative variables: i, j, ij = i..j Joint probability = product o marginal probabilities ij ij Another way to write it: =.j = i. i..j Conditional probability = marginal probability 9 / 43

10 Links and independence between qualitative variables Deviation o observed data ( ij ) rom independence model ( i..j ) Signiicance o the link/deviation: χ 2 test χ 2 obs = I J (obs. num. theor. num.) 2 theor. num i= j= = I J (n ij n i..j ) 2 n i..j i= j= χ 2 obs = I J n i= j= (observed probability theoretical probability)2 theoretical probability = n Φ 2 2 Strength o the link = Φ 2 = deviation o observed probabilities rom theoretical ones 3 Type o link = connections between certain categories CA works with the table o probabilities says nothing about signiicance visualizes the sorts o links there are between the variables 0 / 43

11 How does CA detect a deviation rom independence? Analysis by row: ij i. =.j Categories o V 2 j J Σ Row proile i = conditional distribution o V2 knowing that we are the category i or V Categories o V i ij i. I G I. j Mean row proile = marginal distribution o V2 Proile o ALL the individuals Deviation rom independence using a multidimensional approach / 43

12 Comparing the row proile to the mean proile Chemistry Eco Lit. Medicine Peace Physics Sum Canada France Germany Italy Japan Russia UK USA Mean proile Canada France Germany Italy Japan Russia UK USA Mean proile Do Italians win particular categories o Nobel prize? 2 / 43

13 How does CA detect a deviation rom independence? Analysis by column: Categories o V 2 j J G J ij.j = i. Categories o V i ij. j i. I Σ Column proile j = conditional distribution o V knowing that we are j or V2 Mean column proile = marginal distribution o V Proile o ALL the individuals CA compares the column proile to the mean proile Deviation rom independence using a multidimensional approach 3 / 43

14 Comparing the column proile to the mean proile Chem Eco Lit Med Peace Phys Mean proile Canada France Chemistry Economic sc. Literature Medicine Peace Physics Mean proile Germany Italy Japan Russia UK USA Sum Is the country distribution or literature prizes the same as the country distribution or total prizes? 4 / 43

15 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 5 / 43

16 The cloud o row proiles J RI d χ 2 ( i, i' ) Categories o V 2 j J Σ i i (weight i. ) N I G I Categories o V i ij i. I G I. j. j i' j i'. ij i. Category j o V 2 Distance between two proiles: d 2 χ 2 (i, i ) = Distance to the mean proile G I : d 2 χ 2 (i, G I ) = J j=.j J ( ij i. i j j=.j i. ) 2 ( ) 2 ij.j i. 6 / 43

17 The cloud o column proiles Categories o V 2 j J G J J RI I d χ 2 ( j, j' ) j (weight.j ) j G J Categories o V i ij. j i. I Σ i. ij'. j' ij. j Category i o V Distance between two proiles: d 2 χ 2 (j, j ) = Distance to the mean proile G J : d 2 χ 2 (j, G J ) = I i= i. I ( ij i= i. ) 2 ij.j.j ( ij i..j ) 2 7 / 43

18 What happens when there is independence? For all i, ij i. =.j the proiles are the same as the mean proile N I becomes just G I (the cloud has zero inertia) RI J i N I G I. j ij i. Category j o V 2 Same or the columns: or all j, ij.j = i / 43

19 Deviation rom independence, and inertia The urther the data is rom independence, the more the proiles spread rom the origin I I Inertia(N I /G I ) = Inertia(i/G I ) = i. dχ 2 2 (i, G I ) = = i= I J i. i= j=.j I J i= j= φ 2 measures the strength o the link ( ij i..j ( ij i..j ) 2 i..j i= ) 2 = χ2 n = φ2 Studying the inertia o N I turns out to be the same as studying deviation rom independence Same or N J : Inertia(N J /G J ) = Inertia(N I /G I ) (duality) 9 / 43

20 Visualizing the row (or column) cloud Decompose the inertia o N I using actor analysis Project N I onto a sequence o orthogonal axes with maximal inertia RI J M i u N I O = G I H i projection o M i on P u 2 P I Find P such that i. (OH i ) 2 is maximal i= u axis o maximal inertia u 2 axis o maximal inertia such that u 2 u Inertia associated with the s-th axis: I i. (OHi s )2 = λ s i= 20 / 43

21 How to interpret? Our example: Dim 2 (24.60%) Economic science Medicine USA Canada UK Physics Chemistry Peace Germany Japan Russia France Italy Literature st axis: contrasts science - other categories 2nd axis: contrasts physics/chemistry - economic sciences Dim (54.75%) Economic Chemistry Literature Medicine Peace Physics Sum science Italy UK Row margin / 43

22 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 22 / 43

23 Percentage o inertia Quality o representation o N I on the sth axis I ( ) Projected inertia o N I on u s i= = i. OH s 2 i Total inertia o N = λ s I I i= i. (OM i ) 2 30 descriptors K i= λ k Inertia Inertia (%) F F F F F Sum Deviation rom independence wellsummarized by the irst two axes (79 %) 2 Projected inertia can be summed across axes (because orthogonal) K k= λ k = Inertia (N I ) = Φ 2 Here nφ 2 = = χ 2 = How the inertia decreases can help choose number o axes to keep % o inertia p-value = % o inertia CA on a contingency table acing o 0 white wines with Rank o the axis CA on a contingency table acing o 0 white wines with 30 descriptors Rank o the axis 23 / 43

24 Inertia (= eigenvalues) FC CA 0 λ ss En ACP λ elle structure Indes CA: données 0 λ s correspond Inle PCA maximum (normalized):? λ What structure does an eigenvalue o correspond to? J J 22 II II J I J 2 I 2 s-axis Partition into two classes o rows and columns Exclusive associations between classes Partition en deux classes des lignes 24 / 43

25 Inertia (= eigenvalues) Data: recognizing three lavors (sweet, sour, bitter) For each lavor, we asked 0 people to try to recognize taste o a sample Perc. Perc. Perc. Perc. sweet sweet sour sour bitter bitter Sweet Sweet Sour Sour Bitter Bitter Perc. Perc. Perc. Perc. sweet sweet sour sour bitter bitter Sweet Sweet Sour Sour Bitter Bitter CA CAEigenvalue Eigenvalue % % Axis Axis Axis 2 Axis Sum Sum CA CA Eigenvalue Eigenvalue % % Axis Axis Axis 2 Axis Sum Sum Dim 2 (27.27%) Perc. bitter Perc. bitter Bitter Bitter Sweet Sweet Perc. sweet Perc. sweet Dim 2 (27.27%) Perc. sour Perc. sour Sour Sour Dim 2 (4.00%) Perc. bitter Perc. bitter Bitter Bitter Sweet Sweet Perc. sweet Perc. sweet Perc. sour Perc. sour Sour Sour Dim 2 (4.00%) Dim (72.73%) Dim (72.73%) Dim (96.00%) Dim (96.00%) 25 / 43

26 Inertia (= eigenvalues) Chemistery Economic Literature science Mathematics Medicine Peace Physics Canada France Germany Italy Japan Russia UK USA Inertia Inertia (%) F F F F F Sum λ = i.e., we are ar rom an exclusive association between certain rows and columns Φ 2 = we are ar rom a perect link. i.e., ar rom exclusive association between categories o the two variables 26 / 43

27 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 27 / 43

28 Simultaneous representation o rows and columns Transition ormulas = barycentric properties F s (i) = λs J ij i. G s (j) j= }{{} F s(i): coord. o row i on the s-th axis ij i. : j-th element o proile i G s(j): coord. o column j on the s-th axis λ s: inertia associated with s-th axis (in CA, λ s ) Along the s-th axis, we calculate the barycenter o each column, with column j given a weight ij / i. The smaller the λ s, the urther the barycenter rom the origin: / λ s G s (j) = λs I i= ij.j F s (i) 28 / 43

29 Simultaneous representation and inertia G s (j) = λs I i= ij.j F s (i) Sweet Sour Bitter Perc. Perc. Perc. Perc. sweet sour sweet bitter sour bitter Sweet Sweet Sour Sour Bitter Bitter Perc. Perc. Perc. Perc. sweet sour sweet bitter sour bitter Sweet Sour Bitter CA Eigenvalue CA Eigenvalue % % Axis Axis Axis 2 Axis Sum Sum CA Eigenvalue CA Eigenvalue % % Axis Axis Axis 2 Axis Sum Sum Dim 2 (27.27%) Perc. bitter Perc. bitter Bitter Bitter Sweet Sweet Perc. sweet Perc. sweet Dim 2 (27.27%) Perc. sour Perc. sour Sour Sour Dim 2 (4.00%) Perc. bitter Perc. bitter Bitter Bitter Sweet Sweet Perc. sour Perc. sweet Perc. sweet Perc. sour Sour Sour Dim 2 (4.00%) Dim (72.73%) Dim (72.73%) Dim (96.00%) Dim (96.00%) 29 / 43

30 Simultaneous representation and inertia G s (j) = λs I i= ij.j F s (i) Perc. Perc. sour bitter bitter (5/8 = 33 /.3 ) sour (3/8 = 23 /.3 ) = Perc. bitter Bitter (7/8 = 33 /.3 ) Perc. Perc. Perc. sweet sour bitter Sweet Sour 0 9 Bitter Perc. sour Sour (/8 = 23 /.3 ).6 λ =.375 = Perc. Perc. Perc. sweet sour bitter Sweet Sour Bitter λ =.042 = 2 Perc. bitter Bitter (5/8 = 33 /.3 ) Perc. sour Sour (3/8 = 23 /.3 ) Perc. Bitter Sweet Sour Bitter Perc. Sour ( λ = / 43

31 The barycentric property Dim 2 (24.60%) Economic science 0. Medicine 0.25 Physics 0.26 Chemistery 0.2 Peace 0.09 Italy Literature 0.09 Japan Dim (54.75%) Chemistery Economic Literature Medicine Peace Physics Italy Japan Mean proile / 43

32 The barycentric property Dim 2 (24.60%) Economic science 0.00 Medicine 0.3 Chemistery Peace Physics 0.04 Italy Literature 0.09 Japan Dim (54.75%) Chemistery Economic Literature Medicine Peace Physics Italy Japan Mean proile / 43

33 The barycentric property Dim 2 (24.60%) Economic science 0.05 Medicine 0.26 Physics Chemistery Peace 0.05 Italy Literature 0.32 Japan Dim (54.75%) Chemistery Economic Literature Medicine Peace Physics Italy Japan Mean proile / 43

34 The barycentric property Dim 2 (24.60%) Economic science Medicine USA Canada UK Physics Chemistry Peace Germany Japan Russia France Italy Literature Dim (54.75%) 32 / 43

35 Correspondence Analysis (CA) Data 2 Independence model 3 Point clouds and what they mean 4 Percentage o inertia, and inertia in CA 5 Simultaneous representation o rows and columns 6 Interpretation aids 33 / 43

36 Interpretation aid: quality o the representation An indicator o the quality o representation o a point (or cloud): projected inertia o M i on u s total inertia o M i = i.(oh s i )2 i. (OM i ) 2 = cos2 ( OM i, u s ) N I RI O = G I J M i s H i u s Unit vector o the axis s projection o M i on P This indicator show how much the deviation o a proile rom the mean proile is shown in an axis or plane 34 / 43

37 The quality o representation: example Sweet Sour Bitter Perc. Perc. Perc. sweet sour bitter CA Eigenvalue % Axis 96 Axis Sum Dim 2 (4.00%) Perc. bitter Bitter Sweet Perc. sweet Perc. sour Sour Quality o representation (cos²) Axis Axis2 Sweet Sour Bitter Perc. sweet Perc. sour Perc. bitter Dim (96.00%) Interpretation o the graph based on extreme points with good quality representations 35 / 43

38 Interpretation aid: contributions Absolute indicator: projected inertia o M i on u s = i. (OH s i )2 Relative indicator: proj. inertia o M i on u s inertia o axis s = i.(oh s i )2 λ s We can sum the contributions o several elements This shows how much we can consider that an axis is due to one or several elements Practical compromise between distance to the origin, and weights Useul in big tables or selecting a subset o elements when starting interpretation (jointly with the quality o representation) 36 / 43

39 Contributions : exemple Contributions: example X X2 X3 X4 a 0 0 b c d 0 0 Dim 2 (.54%) a X b X2 X3 c d X4 Inertia % % cum. Axis Axis Axis Axis Axis 2 a b c d Σ Dim (83.50%) The extreme points are not necessarily the ones that contribute most to axis construction / 43

40 Supplementary inormation G s (j) = λs I i= ij.j F s (i) Mathematics is on the French and Russian side, also the side o literature and peace, but opposite the sciences Dim 2 (24.60%) Economic science Canada USA Medicine UK Chemistry Physics Peace Germany Japan Mathematics Italy France Russia Literature Dim (54.75%) 38 / 43

41 Distributional equivalence Distributional equivalence: i rows with the same values are grouped, the CA results are totally equivalent (same or columns) Application to analysis o texts: Thanks to distributional equivalence, i two (or more) words are used in the same circumstances, their coordinates will be close together, so doing the analysis with both, or just one, is entirely equivalent very useul (to group singular and plural versions o words, verb conjugations, etc.) 39 / 43

42 Maximum number o axes, and Cramer s V Point cloud or rows: I points in a J-dimensional space J dim. but constraint (proiles) S J I points in at most I dim. S I = Φ 2 = min(i,j ) k= λ k min(i, J ) } S min(i, J ) leading to the idea o a bounded indicator o the link between 2 variables: Cramer s V = Φ 2 [0; ] min(i, J ) Nobel Prize Three tastes Three tastes Cramer s V 0.522/5 = /2 = /2 = / 43

43 Conclusions or our example Chemistery Economic science Literature Mathematics Medicine Peace Physics Canada France Germany Italy Japan Russia UK USA Dim 2 (24.60%) Economic science Peace Medicine USA Canada UK Physics Chemistry Germany Japan Russia France Italy Literature Dim (54.75%) CA gives a good summary visual o the deviation rom independence, which helps to understand the data table (and especially, big data tables) As or this data: Most o the deviation rom independence appears in the separation o Science vs The rest. And to a lesser extent: physics/chemistry vs economics The position o each country shows its strengths in getting various prizes 4 / 43

44 Conclusion To study the link between qualitative variables, we build a contingency table The link is ound in the dierence between this table, and what it would look like i there was independence Correspondence analysis: build a point cloud o rows (and o columns) whose total inertia measures the strength o the deviation rom independence break down this total inertia into a sequence o axes o decreasing importance, each representing some eature o the link between the variables visually represent the rows and columns in such a way that their position on the graph relects their participation in the deviation rom independence 42 / 43

45 Data Independence model Point clouds Inertia Simultaneous representation Interpretation aids Bibliography For more inormation in the same vein, have a look at this video: nalysis Series Husson Lê Pagès Exploratory Multivariate Analysis by Example Using R Second Edition 36 Chapman & Hall/CRC Computer Science & Data Analysis Series Exploratory Multivariate Analysis by Example Using R SECOND EDITION Husson F., Lê S. & Pagès J. (207) Exploratory Multivariate Analysis by Example Using R 2nd edition, 230 p., CRC/Press. François Husson Sébastien Lê Jérôme Pagès / 43

Correspondence Analysis

Correspondence Analysis Julie Josse, François Husson, Sébastien Lê Applied Mathematics Department, Agrocampus Ouest user-2008 Dortmund, August 11th 2008 1 / 22 History Theoretical principles: Fisher (1940)