Correspondence Analysis & Related Methods

Similar documents
Data set Science&Environment (ISSP,1993) Multiple categorical variables. Burt matrix. Indicator matrix V4a(A) Multiple Correspondence Analysis

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data

Using SPSS for One Way Analysis of Variance

To investigate relationship between NAO index, frost days, extreme and mean maximum temperature in Turkey

Multivariate Analysis of Ecological Data

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Multiple Correspondence Analysis

Relations Graphical View

Fuzzy coding in constrained ordinations

Introduction to Survey Analysis!

Principal Components Analysis. Sargur Srihari University at Buffalo

Introduction To Confirmatory Factor Analysis and Item Response Theory

Notes. Relations. Introduction. Notes. Relations. Notes. Definition. Example. Slides by Christopher M. Bourke Instructor: Berthe Y.

A Comparison of Correspondence Analysis and Discriminant Analysis-Based Maps

RE-EVALUATION OF TRENDS AND CHANGES IN MEAN, MAXIMUM AND MINIMUM TEMPERATURES OF TURKEY FOR THE PERIOD

Math 240 Calculus III

Unconstrained Ordination

22m:033 Notes: 3.1 Introduction to Determinants

Introduction to Determinants

Chapter 26: Comparing Counts (Chi Square)

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Ordinal Independent Variables Richard Williams, University of Notre Dame, Last revised April 9, 2017

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Topic 1. Definitions

Section 29: What s an Inverse?

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

Three More Examples. Contents

Communication about climate change on the German Baltic coast: Experience and mediated experience

First of all, the notion of linearity does not depend on which coordinates are used. Recall that a map T : R n R m is linear if

14 Singular Value Decomposition

Dummy coding vs effects coding for categorical variables in choice models: clarifications and extensions

The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Selection and Adversary Arguments. COMP 215 Lecture 19

Math x + 3y 5z = 14 3x 2y + 3z = 17 4x + 3y 2z = 1

Urban Transportation Planning Prof. Dr.V.Thamizh Arasan Department of Civil Engineering Indian Institute of Technology Madras

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

ST Presenting & Summarising Data Descriptive Statistics. Frequency Distribution, Histogram & Bar Chart

TOTAL INERTIA DECOMPOSITION IN TABLES WITH A FACTORIAL STRUCTURE

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Supplementary Material

psychological statistics

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

MAC Module 1 Systems of Linear Equations and Matrices I

BIO 682 Multivariate Statistics Spring 2008

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

can only hit 3 points in the codomain. Hence, f is not surjective. For another example, if n = 4

F78SC2 Notes 2 RJRC. If the interest rate is 5%, we substitute x = 0.05 in the formula. This gives

Ranking decision making units with the integration of the multi-dimensional scaling algorithm into PCA-DEA

Similarity and Dissimilarity

Economics 205, Fall 2002: Final Examination, Possible Answers

Name Solutions Linear Algebra; Test 3. Throughout the test simplify all answers except where stated otherwise.

x y = 1, 2x y + z = 2, and 3w + x + y + 2z = 0

Number of solutions of a system

Take-home Final. The questions you are expected to answer for this project are numbered and italicized. There is one bonus question. Good luck!

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

All of my class notes can be found at

Chapter 1: Systems of Linear Equations and Matrices

Section 1.1: Systems of Linear Equations

Multivariate analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

REVIEW 8/2/2017 陈芳华东师大英语系

Lecture 25: Models for Matched Pairs

Inter-Rater Agreement

Dominating Configurations of Kings

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

MAC Module 2 Systems of Linear Equations and Matrices II. Learning Objectives. Upon completing this module, you should be able to :

Experimental Design and Data Analysis for Biologists

Evaluation of scoring index with different normalization and distance measure with correspondence analysis

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

15 Singular Value Decomposition

1 Boolean Algebra Simplification

First we look at some terms to be used in this section.

8. Classification and Pattern Recognition

Multiplying matrices by diagonal matrices is faster than usual matrix multiplication.

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

2.0 Lesson Plan. Answer Questions. Summary Statistics. Histograms. The Normal Distribution. Using the Standard Normal Table

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 11 Offprint. Discriminant Analysis Biplots

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Introduction to Machine Learning

Definition: A vector is a directed line segment which represents a displacement from one point P to another point Q.

Statistical Significance of Ranking Paradoxes

Lecture No. # 07 Linear System Part 4

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Linear Independence Reading: Lay 1.7

Postestimation commands predict estat Remarks and examples Stored results Methods and formulas

Arithmetic: Decimals, Fractions and Percentages

Clustering Lecture 1: Basics. Jing Gao SUNY Buffalo

Lecture 7: Hypothesis Testing and ANOVA

Y (Nominal/Categorical) 1. Metric (interval/ratio) data for 2+ IVs, and categorical (nominal) data for a single DV

8 Square matrices continued: Determinants

a Short Introduction

BALTEX Baltic Sea Experiment

Correlation Matrices and the Perron-Frobenius Theorem

NON-PARAMETRIC STATISTICS * (

Bivariate Data: Graphical Display The scatterplot is the basic tool for graphically displaying bivariate quantitative data.

Transcription:

Correspondence Analysis & Related Methods Michael Greenacre SESSION 9: CA applied to rankings, preferences & paired comparisons Correspondence analysis (CA) can also be applied to other types of data: ratings preferences paired comparisons distances measurement data The art is in the recoding of the data to be suitable for CA Remember that CA analyses profiles, weighted by masses and with inter-profile distances measured by chi-squared distance If the data can be put into a form for which these concepts makes sense, then CA is a valid method for visualizing the data Q SCIENCE AND ENVIRONMENT ISSP 99: Environment How much do you agree or disagree with each of these statements? Qa We believe too often in science, and not enough in feelings and faith Qb Over all, modern science does more harm than good Qc Any change humans cause in nature - no matter how scientific - is likely to make things worse Qd Modern science will solve our environmental problems with little change to our way of life Strongly agree 2 Agree Neither agree nor disagree Disagree 5 Strongly disagree 8 Can't choose, don't know 9 NA, refused Recall the indicator matrix definition of MCA 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc Recoded indicator matrix 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 etc (N rows) The resulting graphical display of the column categories has a separate point for each level of the scale Each category can find its own position in the map and the ordering of the scale points is not taken into account There are methods for forcing the ordering of the scale points on, say, the first principal axis: for example, categorical PCA (implemented in SPSS) Alternatively, we can constrain the scale points to lie at equal distances along straight lines in the map, leading to a much simpler graphical display

2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc Doubling of ratings 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) Each categorical variable is converted into a pair of columns: one is called the positive pole of the rating scale and the other the negative pole Deciding which is positive or negative depends on the context Assuming a rating scale that starts at (eg, the -to-5 Likert scale here), the first column consists of all the ratings minus Since strongly agree is a in this case, this column will measure the strength of disagreement, so we should give call it the negative pole The second column is in this case minus the first one, and measures agreement (positive pole) The two columns sum to 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc The rational for this coding is as follows: Doubling of ratings - - - - 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) CA analyses frequency data So when a person gives a response of 2 on the 5-point agreement-disagreement scale, then he/she has scale point below and scale points above : 2 5, ie has a count of towards disagreement and a count of towards agreement Hence the corresponding pair of doubled ratings is [ ] Doubling of ratings CA of doubled ratings 2 2 2 2 5 2 5 2 2 5 5 2 2 2 2 2 5 5 2 etc - - - - 0 0 2 2 0 0 0 0 0 0 0 0 2 2 0 2 2 2 2 2 2 0 0 etc (N rows) *C- *B- *D *C *B *A CA is then applied to the doubled matrix with 2Q columns Each pair of points represents the end-points of the rating scale and the intermediate points are at equal intervals between these two extremes You can think of this analysis as an MCA where the 5 scale points of each variable are forced to lie on a straight line at equal intervals *A- *D- 059 (8%) 0078 (20%)

CA of doubled ratings origin and scale *C- *D Hence the *C *B average rating on the first question is about 25 5 *A- 2 Notice that all the scale vectors pass through the centre of the display The centre cuts the scale vector exactly at the average rating *B- *D- *A 059 (8%) 0078 (20%) INERTIAS AND PERCENTAGES OF INERTIA ----------------------------------- CA of doubled ratings 05890 85% ************************************************** 2 007808 20% **************************** 0065729 202% ************************ 008 88% ****************** 5 0000002 000% 6 000000 000% 7 0000000 000% -------- 0270 COLUMN CONTRIBUTIONS -------------------- ---------------------------------------------- J NAME QLT MAS INR k= COR CTR k=2 COR CTR ---------------------------------------------- A- 662 89 7-50 07 66-99 255 90 2 A 662 6 95 279 07 92 22 255 05 B- 597 7 2-95 590 58 7 B 597 6 80 590 9-5 7 5 C- 580 0 50-59 57 205 58 7 5 6 C 580 7 06 66 57 5-7 7 D- 766 2 99 2 77 8-95 689 297 8 D 766 08-75 77 2 522 689 9 ---------------------------------------------- Further remarks on geometry of doubling total - - - - ========================== 0 6 0 0 6 2 2 0 6 0 6 0 0 6 0 0 0 6 0 2 2 6 0 2 2 6 2 2 6 2 2 0 0 6 etc (N rows) ========================== Each respondent has the same mass (as in MCA) Distances between column points are (unweighted) Euclidean between column profiles Distances between row points have weights which are related to a measure of polarisation Differences between respondents on a highly polarised variable (ie, mean near the end-points) will be weighted more than usual The column masses occur in pairs which are proportional to the doubled means of the ratings, and sum to a constant Doubling of preferences (rankings rankings) Suppose N individuals rank-order p objects in order of preference For example, in a marketing survey about bottled mineral water, 00 respondents rankordered 6 different attributes of this type of product Usually the data are coded as the ranking given to the attributes, where indicates first choice, 2 second, and so on This is just like a rating scale except no scale values can be repeated by a respondent 6 5 2 5 2 6 2 5 6 5 2 6 5 2 6 etc - - - - - - 5 0 0 5 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 0 5 2 0 5 2 5 0 etc (00 rows) (doubled columns) (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 0 5-2 0 5 0 5 2 2 5 0 2 2 5 0 2 0 5 5 5 2 0

Coding of paired comparisons Once again, we can consider the doubled data as counts: for example Attribute A is in 6th (last) position: 5 attributes are preferred to it, and it is preferred to none This idea can be extended to paired comparisons Suppose that, instead of rank-ordering, respondents consider each pair of attributes and then decide which they prefer Thus each of the N individuals has to make ½ p(p-) decisions Coding of paired comparisons Once again, we can consider the doubled data as counts: for example Again all 5 other attributes have been preferred to A and A has been preferred to none There can be some inconsistencies in the paired comparisons, which means that we can get this type of repetition in the doubled data which we never had for rank-orderings 6 5 2 5 2 6 2 5 6 5 2 6 5 2 6 etc - - - - - - 5 0 0 5 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 0 5 2 0 5 2 5 0 etc (00 rows) (doubled columns) (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 0 5-2 0 5 0 5 2 2 5 0 2 2 5 0 2 0 5 5 5 2 0 A/B A/C A/D B C D B C D B C A B C D B A A etc to paired comparisons - - - - - - 5 0 0 5 2 2 2 2 0 5 5 0 2 2 2 0 5 5 0 2 2 5 0 2 0 5 2 5 0 etc (00 rows) Doubled preference counts (doubled columns) Doubled preference counts (doubled rows) - 5 0 2 2-2 0 5-2 0 5-2 5 5-2 0 5 0 5 2 2 2 5 0 2 2 5 0 2 0 5 5 2 0

Correspondence Analysis & Related Methods Michael Greenacre SESSION 20: CA of continuous data and distances I (2) Continuous measurement scales Q A B C 2 57 7 5 0 20 2 78 20 2 52 9 2 20 Convert all data to rank-orders within the variable Double the ranks across the cases Nonparametric CA Q A B C 2 8 5 6 7 2 0 0 2Q A B C - - - 8 0 2 9 0 7 7 5 6 8 6 5 0 9 2 9 2 0 0 8 I (2) Continuous measurement scales Q A B C 2 7 5 2 20 2 2 20 52 9 20 2Q A B - 2-026 26 0 08 07 0 06 06 0-056 56 0-058 58 0 Standardise the data by centring with respect to mean and dividing by standard deviation: x x z = s Calculate doubled entries for variable: A = ( z ) / 2 A- = (- z) / 2 This is a good coding when categorical variables, especially binary variables, are present Data set meteo : Annual averages of five meteorological variables in 0 Turkish cities Data all on different scales One way to homogenize the data is to code into categories, either crisply or fuzzily SUN HUM PRE ALT MAX Adana 755 66 67 27 56 Afyon 709 6 0 98 Anamur 8 69 995 5 2 Ankara 79 60 777 89 08 Antakya 75 70 2 00 9 Antalya 828 6 052 5 50 Aydın 72 6 8577 57 6 Balıkesir 656 70 5885 7 7 Bolu 59 7 56 72 9 Bursa 65 69 696 00 8 Çanakkale 7 7 65 6 88 Siirt 7 5 7265 896 60 Sivas 6 6 70 285 00 Tekirdağ 50 76 575 59 68 Trabzon 6 72 88 8 Şanlıurfa 828 9 6 0 82 Van 7 59 80 66 75 Zonguldak 55 72 2202 7 05 Sun Sun2 Sun 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Sun Sun2 Sun 0000 06 066 000 0997 0000 0000 0000 000 0000 0927 007 0000 0959 00 0000 00 0959 0000 070 0260 082 088 0000 05 056 0000 025 077 0000 0000 0829 07 0000 072 0268 0226 077 0000 057 026 0000 0926 007 0000 0000 00 0959 0000 072 0268 0527 07 0000

There are several ways of performing fuzzy coding As an example, we chose the triangular membership function system shown here: Crisp fuzzy (multiple multiple) correspondence analysis 078 05 022 min median max An example is given of a value below the median on the continuous scale which is coded as [022 078 0] Crisp fuzzy (multiple multiple) correspondence analysis Data: Turkish meteorological data The symmetric map is shown Cities that are at identical positions in the crisp coding (having the same 5 catgories) are separated by the fuzzy coding which retains more of the information in the original data Distance matrices Consider the following table from an environmental survey (this is one of the data sets from my book Correspondence Analysis in Practice: Second Edition, 2007 it is given on wwwcarme-norg) The columns refer to sampling sites, the first labelled to are in the vicinity of an oil-platform in the North Sea, while the last two, R and R2, are reference sites far from the oil-platform The rows are 0 different species of benthic (sea-bed) marine life, labelled s to s0 S S2 S S S5 S6 S7 S8 S9 S0 2 5 6 7 8 9 0 R R2 9 79 50 72 02 6 267 27 992 5 2 9 0 57 9 6 58 7 9 68 76 25 8 9 9 8 8 20 22 0 0 55 65 9 26 5 0 5 2 5 6 0 7 7 5 8 0 2 0 8 8 5 8 2 2 2 6 7 0 8 2 2 6 2 0 8 29 2 7 6 2 8 6 6 6 5 2 2 2 2 2 5 7 5 0 0 0

Bray-Curtis dissimilarity The chi-square distance is used implicitly in CA, but ecologists like to use a non-euclidean distance called the Bray-Curtis dissimilarity This is a much simpler dissimilarity function to understand in the context of environmental sampling: S S2 S S S5 S6 S7 S8 S9 S0 d( j, j ) = 00 x i ij xij ( xij xij ) B-C = 0 if identical abundances, = 00 if no common species; B-C index of similarity is 00 minus above dissimilarity 2 5 6 7 8 9 0 R R2 9 79 50 72 02 6 267 27 992 5 2 9 0 57 9 6 58 7 9 68 76 25 8 9 9 8 8 20 22 0 0 55 65 9 26 5 0 5 2 5 6 0 7 7 5 8 0 2 0 8 8 5 8 2 2 2 6 7 0 8 2 2 6 2 0 8 29 2 7 6 2 8 6 6 6 5 2 2 2 2 2 5 7 5 0 0 0 i Bray-Curtis dissimilarity values Since the measure is greatly affected by the variance difference between common and rare species, ecologists often take fourth roots ( ) of the abundances, and then calculate the B-C values 2 R R2 d( j, j ) = 00 ij i ( x ij xij ) Part of the x matrix of dissimilarities is shown here: 2 R R2 000 9 952 26 77 9 000 82 25 86 952 82 000 72 2786 222 72 000 29 772 26 25 2786 29 000 77 86 222 772 000 i x ij x CA of a distance matrix The best way to visualize the Bray-Curtis dissimilarities would be to apply a method of multidimensional scaling (MDS) But CA has been shown to be applicable to dissimilarity matrices if we convert the dissimilarities to similarities using a transformation of the form s = k d where k is a large value, at least as large as the maximum dissimilarity Since the dissimilarities have as maximum of 00, the obvious choice is k = 00 Hence the Bray-Curtis similarities (in fact, the off-diagonal elements are now exactly the Bray-Curtis indices): 2 R R2 2 R R2 0000 8906 908 8886 785 8226 8906 0000 858 9669 787 89 908 858 0000 8258 72 7779 8886 9669 8258 0000 785 8228 785 787 72 785 0000 8867 8226 89 7779 8228 8867 0000 005 0-005 -0 CA map of Bray-Curtis indices 2 (05 %) R 5 8 R2 0 (9 %) 7 9 6-005 0 005 0