Correspondence Analysis & Related Methods

Correspondence Analysis & Related Methods Michael Greenacre SESSION 9: CA applied to rankings, preferences & paired comparisons Correspondence analysis (CA) can also be applied to other types of data: ratings preferences paired comparisons distances measurement data The art is in the recoding of the data to be suitable for CA Remember that CA analyses profiles, weighted by masses and with inter-profile distances measured by chi-squared distance If the data can be put into a form for which these concepts makes sense, then CA is a valid method for visualizing the data Q SCIENCE AND ENVIRONMENT ISSP 99: Environment How much do you agree or disagree with each of these statements? Qa We believe too often in science, and not enough in feelings and faith Qb Over all, modern science does more harm than good Qc Any change humans cause in nature - no matter how scientific - is likely to make things worse Qd Modern science will solve our environmental problems with little change to our way of life Strongly agree 2 Agree Neither agree nor disagree Disagree 5 Strongly disagree 8 Can't choose, don't know 9 NA, refused Recall the indicator matrix definition of MCA 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc Recoded indicator matrix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 etc (N rows) The resulting graphical display of the column categories has a separate point for each level of the scale Each category can find its own position in the map and the ordering of the scale points is not taken into account There are methods for forcing the ordering of the scale points on, say, the first principal axis: for example, categorical PCA (implemented in SPSS) Alternatively, we can constrain the scale points to lie at equal distances along straight lines in the map, leading to a much simpler graphical display

2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc Doubling of ratings 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) Each categorical variable is converted into a pair of columns: one is called the positive pole of the rating scale and the other the negative pole Deciding which is positive or negative depends on the context Assuming a rating scale that starts at (eg, the -to-5 Likert scale here), the first column consists of all the ratings minus Since strongly agree is a in this case, this column will measure the strength of disagreement, so we should give call it the negative pole The second column is in this case minus the first one, and measures agreement (positive pole) The two columns sum to 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc The rational for this coding is as follows: Doubling of ratings - - - - 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) CA analyses frequency data So when a person gives a response of 2 on the 5-point agreement-disagreement scale, then he/she has scale point below and scale points above : 2 5, ie has a count of towards disagreement and a count of towards agreement Hence the corresponding pair of doubled ratings is [ ] Doubling of ratings CA of doubled ratings 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc - - - - 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) *C- *B- *D *C *B *A CA is then applied to the doubled matrix with 2Q columns Each pair of points represents the end-points of the rating scale and the intermediate points are at equal intervals between these two extremes You can think of this analysis as an MCA where the 5 scale points of each variable are forced to lie on a straight line at equal intervals *A- *D- 059 (8%) 0078 (20%)

CA of doubled ratings origin and scale *C- *D Hence the *C *B average rating on the first question is about 25 5 *A- 2 Notice that all the scale vectors pass through the centre of the display The centre cuts the scale vector exactly at the average rating *B- *D- *A 059 (8%) 0078 (20%) INERTIAS AND PERCENTAGES OF INERTIA ----------------------------------- CA of doubled ratings 05890 85% ************************************************** 2 007808 20% **************************** 0065729 202% ************************ 008 88% ****************** 5 0000002 000% 6 000000 000% 7 0000000 000% -------- 0270 COLUMN CONTRIBUTIONS -------------------- ---------------------------------------------- J NAME QLT MAS INR k= COR CTR k=2 COR CTR ---------------------------------------------- A- 662 89 7-50 07 66-99 255 90 2 A 662 6 95 279 07 92 22 255 05 B- 597 7 2-95 590 58 7 B 597 6 80 590 9-5 7 5 C- 580 0 50-59 57 205 58 7 5 6 C 580 7 06 66 57 5-7 7 D- 766 2 99 2 77 8-95 689 297 8 D 766 08-75 77 2 522 689 9 ---------------------------------------------- Further remarks on geometry of doubling total - - - - ========================== 0 6 0 0 6 2 2 0 6 0 6 0 0 6 0 0 0 6 0 2 2 6 0 2 2 6 2 2 6 2 2 0 0 6 etc (N rows) ========================== Each respondent has the same mass (as in MCA) Distances between column points are (unweighted) Euclidean between column profiles Distances between row points have weights which are related to a measure of polarisation Differences between respondents on a highly polarised variable (ie, mean near the end-points) will be weighted more than usual The column masses occur in pairs which are proportional to the doubled means of the ratings, and sum to a constant Doubling of preferences (rankings rankings) Suppose N individuals rank-order p objects in order of preference For example, in a marketing survey about bottled mineral water, 00 respondents rankordered 6 different attributes of this type of product Usually the data are coded as the ranking given to the attributes, where indicates first choice, 2 second, and so on This is just like a rating scale except no scale values can be repeated by a respondent 6 5 2 5 2 6 2 5 6 5 2 6 5 2 6 etc - - - - - - 5 0 0 5 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 0 5 2 0 5 2 5 0 etc (00 rows) (doubled columns) (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 0 5-2 0 5 0 5 2 2 5 0 2 2 5 0 2 0 5 5 5 2 0

Coding of paired comparisons Once again, we can consider the doubled data as counts: for example Attribute A is in 6th (last) position: 5 attributes are preferred to it, and it is preferred to none This idea can be extended to paired comparisons Suppose that, instead of rank-ordering, respondents consider each pair of attributes and then decide which they prefer Thus each of the N individuals has to make ½ p(p-) decisions Coding of paired comparisons Once again, we can consider the doubled data as counts: for example Again all 5 other attributes have been preferred to A and A has been preferred to none There can be some inconsistencies in the paired comparisons, which means that we can get this type of repetition in the doubled data which we never had for rank-orderings 6 5 2 5 2 6 2 5 6 5 2 6 5 2 6 etc - - - - - - 5 0 0 5 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 0 5 2 0 5 2 5 0 etc (00 rows) (doubled columns) (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 0 5-2 0 5 0 5 2 2 5 0 2 2 5 0 2 0 5 5 5 2 0 A/B A/C A/D B C D B C D B C A B C D B A A etc to paired comparisons - - - - - - 5 0 0 5 2 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 2 0 5 2 5 0 etc (00 rows) Doubled preference counts (doubled columns) Doubled preference counts (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 5-2 0 5 0 5 2 2 2 5 0 2 2 5 0 2 0 5 5 2 0

Correspondence Analysis & Related Methods Michael Greenacre SESSION 20: CA of continuous data and distances I (2) Continuous measurement scales Q A B C 2 57 7 5 0 20 2 78 20 2 52 9 2 20 Convert all data to rank-orders within the variable Double the ranks across the cases Nonparametric CA Q A B C 2 8 5 6 7 2 0 0 2Q A B C - - - 8 0 2 9 0 7 7 5 6 8 6 5 0 9 2 9 2 0 0 8 I (2) Continuous measurement scales Q A B C 2 7 5 2 20 2 2 20 52 9 20 2Q A B - 2-026 26 0 08 07 0 06 06 0-056 56 0-058 58 0 Standardise the data by centring with respect to mean and dividing by standard deviation: x x z = s Calculate doubled entries for variable: A = ( z ) / 2 A- = (- z) / 2 This is a good coding when categorical variables, especially binary variables, are present Data set meteo : Annual averages of five meteorological variables in 0 Turkish cities Data all on different scales One way to homogenize the data is to code into categories, either crisply or fuzzily SUN HUM PRE ALT MAX Adana 755 66 67 27 56 Afyon 709 6 0 98 Anamur 8 69 995 5 2 Ankara 79 60 777 89 08 Antakya 75 70 2 00 9 Antalya 828 6 052 5 50 Aydın 72 6 8577 57 6 Balıkesir 656 70 5885 7 7 Bolu 59 7 56 72 9 Bursa 65 69 696 00 8 Çanakkale 7 7 65 6 88 Siirt 7 5 7265 896 60 Sivas 6 6 70 285 00 Tekirdağ 50 76 575 59 68 Trabzon 6 72 88 8 Şanlıurfa 828 9 6 0 82 Van 7 59 80 66 75 Zonguldak 55 72 2202 7 05 Sun Sun2 Sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sun Sun2 Sun 0000 06 066 000 0997 0000 0000 0000 000 0000 0927 007 0000 0959 00 0000 00 0959 0000 070 0260 082 088 0000 05 056 0000 025 077 0000 0000 0829 07 0000 072 0268 0226 077 0000 057 026 0000 0926 007 0000 0000 00 0959 0000 072 0268 0527 07 0000

There are several ways of performing fuzzy coding As an example, we chose the triangular membership function system shown here: Crisp fuzzy (multiple multiple) correspondence analysis 078 05 022 min median max An example is given of a value below the median on the continuous scale which is coded as [022 078 0] Crisp fuzzy (multiple multiple) correspondence analysis Data: Turkish meteorological data The symmetric map is shown Cities that are at identical positions in the crisp coding (having the same 5 catgories) are separated by the fuzzy coding which retains more of the information in the original data Distance matrices Consider the following table from an environmental survey (this is one of the data sets from my book Correspondence Analysis in Practice: Second Edition, 2007 it is given on wwwcarme-norg) The columns refer to sampling sites, the first labelled to are in the vicinity of an oil-platform in the North Sea, while the last two, R and R2, are reference sites far from the oil-platform The rows are 0 different species of benthic (sea-bed) marine life, labelled s to s0 S S2 S S S5 S6 S7 S8 S9 S0 2 5 6 7 8 9 0 R R2 9 79 50 72 02 6 267 27 992 5 2 9 0 57 9 6 58 7 9 68 76 25 8 9 9 8 8 20 22 0 0 55 65 9 26 5 0 5 2 5 6 0 7 7 5 8 0 2 0 8 8 5 8 2 2 2 6 7 0 8 2 2 6 2 0 8 29 2 7 6 2 8 6 6 6 5 2 2 2 2 2 5 7 5 0 0 0

Bray-Curtis dissimilarity The chi-square distance is used implicitly in CA, but ecologists like to use a non-euclidean distance called the Bray-Curtis dissimilarity This is a much simpler dissimilarity function to understand in the context of environmental sampling: S S2 S S S5 S6 S7 S8 S9 S0 d( j, j ) = 00 x i ij xij ( xij xij ) B-C = 0 if identical abundances, = 00 if no common species; B-C index of similarity is 00 minus above dissimilarity 2 5 6 7 8 9 0 R R2 9 79 50 72 02 6 267 27 992 5 2 9 0 57 9 6 58 7 9 68 76 25 8 9 9 8 8 20 22 0 0 55 65 9 26 5 0 5 2 5 6 0 7 7 5 8 0 2 0 8 8 5 8 2 2 2 6 7 0 8 2 2 6 2 0 8 29 2 7 6 2 8 6 6 6 5 2 2 2 2 2 5 7 5 0 0 0 i Bray-Curtis dissimilarity values Since the measure is greatly affected by the variance difference between common and rare species, ecologists often take fourth roots ( ) of the abundances, and then calculate the B-C values 2 R R2 d( j, j ) = 00 ij i ( x ij xij ) Part of the x matrix of dissimilarities is shown here: 2 R R2 000 9 952 26 77 9 000 82 25 86 952 82 000 72 2786 222 72 000 29 772 26 25 2786 29 000 77 86 222 772 000 i x ij x CA of a distance matrix The best way to visualize the Bray-Curtis dissimilarities would be to apply a method of multidimensional scaling (MDS) But CA has been shown to be applicable to dissimilarity matrices if we convert the dissimilarities to similarities using a transformation of the form s = k d where k is a large value, at least as large as the maximum dissimilarity Since the dissimilarities have as maximum of 00, the obvious choice is k = 00 Hence the Bray-Curtis similarities (in fact, the off-diagonal elements are now exactly the Bray-Curtis indices): 2 R R2 2 R R2 0000 8906 908 8886 785 8226 8906 0000 858 9669 787 89 908 858 0000 8258 72 7779 8886 9669 8258 0000 785 8228 785 787 72 785 0000 8867 8226 89 7779 8228 8867 0000 005 0-005 -0 CA map of Bray-Curtis indices 2 (05 %) R 5 8 R2 0 (9 %) 7 9 6-005 0 005 0