CORRESPONDENCE ANALYSIS

Size: px

Start display at page:

Download "CORRESPONDENCE ANALYSIS"

Henry Blankenship
6 years ago
Views:

1 CORRESPONDENCE ANALYSIS INTUITIVE THEORETICAL PRESENTATION BASIC RATIONALE DATA PREPARATION INITIAL TRANSFORAMATION OF THE INPUT MATRIX INTO PROFILES DEFINITION OF GEOMETRIC CONCEPTS (MASS, DISTANCE AND CENTROID) CONSTRUCTION OF THE PRINCIPAL AXES OF INERTIA AND PROJECTION PLOTS GRAPHICAL INTERPRETATION OF OUTPUTS SUPPLEMENTARY PROJECTIONS QUALITATIVE REGRESSION ARCHETYPAL DISCRIMINATION CASE STUDIES APPENDIX A Air Quality in Gdansk APPENDIX B Reservoir Quality Zones in an Oil Field APPENDIX C Climatology o Porto urban area APPENDIX D Risk Assessment o mine tailings dam breakage APPENDIX E Index o Quality in Natural Stones 1

2 BASIC RATIONALE CORRESPONDENCE ANALYSIS Correspondence Analysis (CA) is a geometric data analysis methodology whose main goal is to represent tabular data graphically, acilitating by this manner the numeric table interpretation. The basic idea behind such a methodology developed in the 1960s by the French mathematician Jean-Paul Benzécri is that any matrix o positive numbers viewed as some orm o contingency tables 1 concatenation can be summarized into a series o 2D graphs that plot data with respect to two perpendicular coordinate axes, calibrated or a common scale. Avoiding as much as possible unnecessary a priori assumptions, the interpretation o such graphs allows or detecting and evaluating the pattern o relationships between rows and columns o the input table 2. When taking the input table as a n p matrix, its geometric representation consists o depicting the elements o the table as points in a certain geometric space. Depending on how the matrix is viewed, it may be converted into a cloud o n rows in the p R space, or into a cloud o p columns in the n R space. The very backbone o the CA algorithm leads to the eature that these two representations are equivalent, allowing or the joint interpretation o rows and columns in the same plot. The oremost outputs yielded by CA are standard Cartesian graphs showing the simultaneous projection o the labels that represent the input matrix rows and columns 3 onto the axes that convey the maximum inertia 4 associated to the initial cloud o points (each row being considered as a point in the column space, and conversely). The above mentioned graphs are ranked by 1 In a contingency table, the intersection o any row and column gives the number o occurrences that share the characteristics that are common to that row and column. 2 In simple terms, the interpretation procedure consists o inding out i there is attraction, repulsion, or indierence between the relevant elements o a given data set. This entails looking or similarities and dierences rom column to column, rom row to row, and between columns and rows. 3 The rows o the input matrix are usually denoted individuals and the columns, variables (or attributes, or properties, or observations ). 4 The inertia o a point belonging to a cloud located in space is given by the product o its mass by its squared distance to the centroid o the entire cloud. In statistical language, the inertia is the analogue o the variance, since the mass can be seen as the relative requency o a distribution o points, being the geometric centroid calculated as the weighted average o the set o coordinates deining the entire cloud. 2

3 descending order o importance (quantiied by the raction o the total cloud inertia conveyed by the axes extracted rom such cloud), and only the most signiicant plots are retained or interpretation 5 reducing by the same token the dimensionality o the input matrix. This process o discarding the less important graphs that produces the reduction o the initial cloud dimensionality is illustrated in a simple artiicial example outlined in Fig. A-1, in which the cloud o rows (individuals) can be seen as an ellipsoide initially given in the 3D space o columns (properties). By minimizing the distance rom the cloud o individuals to three orthogonal directions in space, the CA algorithm inds the three ellipsoid principal axes (AXIS 1,2 and 3 in Fig. A-1) and projects the points o the cloud into the two planes deined by AXIS 1 vs. AXIS 2, and by AXIS 1 vs. AXIS 3 [(1) and (2), in Fig. A-1]. Fig. A-1 Rotation o an ellipsoidal cloud o points to the most avorable position, where the two principal plans [(1) and (2)] o the ellipsoid resume in 2D the geometry o the cloud in an optimal way 5 Such an interpretation becomes easier since the dimensionality o the initial table is reduced when it is converted into a small number o graphs. 3

4 The initial ellipsoid cloud given ormerly in the columns reerential (whose axes are denoted prop.1, 2 and 3 in Fig A-1) that describe the individuals (rows) contained in the input matrix is rotated in an optimal way to a new reerential, deined by the ellipsoid principal axes. The geometric meaning o the importance o each axis assessed by the quantities denoted eigenvalues and represented in Fig. A-1 by λ 1, λ2, λ3 is now clear: each eigenvalue is a measure o the length o the spread o the projections onto the corresponding principal axis, which is obviously related to the variability o the cloud o points distribution around such axis. I one decides to disregard Axis 3 (because λ3is considered negligible in respect to λ,λ 1 2 ), plot (1) o Fig. A-1 is the representation o the initial cloud that allows the required dimensionality reduction (rom 3D to 2D), with a minimum loss o inormation [quantiied by the ratio λ ( λ λ + ) + ] λ3 As opposed to classical multivariate statistics, the signiicance o an axis is not judged on the grounds o any hypothesis testing procedure based on unveriiable assumptions, but depends only on its explicative power in the context o the problem to be approached by the CA modeling methodology. Hence, CA calls or a symbiotic eort perormed jointly by the data analyst and the expert in the scientiic domain to which data reers, being the later responsible or interpretation, according to the rules made available by the ormer. The rules oered by the data analyst stem directly rom the CA algorithm, as it was put orward by Benzécri and his ollowers. The irst question to be handled beore applying the algorithm is to assure that the input matrix may be considered as a concatenation o contingency tables cross-tabulating two qualitative variables 6. Then, the graphs produced by the CA program should be scrutinized in order to retain or interpretation a small number o them, able to explain a reasonable raction o the total cloud inertia. These graphs are interpreted not only in terms o the pattern emerging rom row and column projections, but also by determining the quality o each point s representation in respect to an axis. This quality is measured by the absolute contribution o the point s projection or the inertia associated with the axes, given by 6 In particular, this requirement implies that sums along rows and column are allowed by the nature o data, producing values that make sense in the context o the problem to be approached by the methodology. 4

5 the raction o this inertia assigned to each particular point. Also, the relative contribution, evaluated by the angle between the vector representing the point in the original cloud and its projection onto the axes, is an additional measure o the point s representation quality. Once outlined a irst interpretation scheme, it is in general needed to modiy the coding o the variables (and sometimes, o the individuals), in order to improve results. Such an improvement by interactive coding is considered as satisactory when the model emerging rom CA outputs is accepted both by the data analyst and the data expert. This model ranges rom a simple discourse about the meaning o the axes to a complex set o quantiied relationships between data sets. In any case, the model should ollow the data, not the reverse, according to Benzécri s dictum. Furthermore, in line with the CA paradigm, the model obtained through the methodology is not validated by any statistical test o hypothesis, but by its contribution to give rise to valuable and helpul insights on the issue addressed by the expert o the scientiic domain where data is included in. DATA PREPARATION In a variety o environmental studies based on empirical observations, it is very common to arrange the results o such observations under the orm o data tables symbolically represented in Fig. A-2, in which the values o observed variables ( j = 1,..., p ) are recorded or each one o the physical units ( i = 1,..., n ) where such variables were captured (denoted individuals). Fig. A-2 Generic table recording results o p observations in n individuals K i, j is a numeric value providing the result o observation j or the individual i ] [ ( ) 5

6 The table depicted in Fig. A-2 may be viewed as a n p matrix, whose n rows are the individuals, and the p columns are the attributes observed in such individuals. Such attributes may be expressed by real numbers which give the value o some measure o a quantitative continuous variable or by integers denoting the modality (or category, or class.) o a qualitative variable. Historically, Correspondence Analysis was developed by Jean-Paul Benzécri or the case o two qualitative variables cross-tabulated in a contingency table (Fig. A-3). In subsequent developments, an appropriate concatenation o such contingency tables expressing the cross-tabulation o two variables is the compulsory input or CA, entailing the transormation o the most common data model o Fig. A-2 into a set o two-way contingency tables, arranged as blocks o a new matrix. Fig. A-3 Contingency Table In the two-way contingency table o Fig. A-3, two qualitative variables VAR 1 and VAR 2 are put in correspondence trough the absolute requencies ( i j) K, o cooccurrences o modalities i and j. The SUM along columns give the total number K(i) o occurrences o VAR 2 modalities and, along lines, the total number K(j) o occurrences o VAR 1 modalities. K is the global number o occurrences, which is the SUM o absolute requencies given in K ( i) or K ( j), which can be viewed as histograms (expressed in absolute requencies) or VAR 2 and VAR 1, respectively. 6

7 In order to put the model depicted in Fig. A-2 under a ormat suited or some CA applications, the irst step is to transorm it into a complete disjunctive matrix, denoted D and shown symbolically in Fig. A-4. This matrix displays as rows the n individuals, and as columns the p modalities o a set o q qualitative variables. For each individual, the value 1 is assigned to the modality that occurs in that individual (and 0 to the others), or the entire set o the q qualitative variables. Fig. A-4 complete disjunctive matrix D It is worth noting that the complete disjunctive matrix is already a juxtaposition o contingency tables. In act, each block reerring to a given variable is the contingency table crossing the individuals with the absolute requency o modalities in which such variable is deployed (in this particular case, those requencies take only two values: one, i the modality occurs or a certain individual, and zero, otherwise). Taking into account that j stands or a modality in Fig. A-3, and or a qualitative variable (deployed into several modalities) in Fig. A-4, the representations given in Figures A-2 and A-3 can be put into relation. In act, it is clear that K(i) = q because the sum o one line along columns o Fig. A-4 equates the number o variables (since each variable contains only the value 1 in a certain modality, being the others given the value 0). Moreover, it is comprehensible that K = nq because n is the total absolute requency o all modalities or each one o the q histograms calculated when columns o Fig. A-4 are summed along lines. 7

8 Hence, any table ollowing the generic model o Fig. A-1 can be submitted to the CA algorithm, provided that it is previously transormed into a complete disjunctive matrix D. I such a table contains quantitative variables, these should be split into meaningul classes by inspection o their empirical distributions, and any real number ( i j) K, appearing in Fig. A-2 should be substituted by an array containing the value one in the class where ( i j) K, alls, and zero in all the other classes. As a result, the quantitative variable is converted into a set o categories (APPENDIX A).. Instead o using as input the complete disjunctive matrix D, it is in general advisable to transorm D into the Burt Matrix B (Fig. A-5) by multiplying the transpose matrix o D ( D ) by itsel ( B = D D ). Fig. A-5 Burt Matrix ( q is the number o qualitative variables, whose total number o modalities is p, and A stands or the contingency table cross-tabulating j by j ) 8

9 Furthermore, the Burt matrix 7, apart rom being the input or the CA algorithm, contains all the inormation required by the classical treatment o questionnaires 8, when the aim o the survey perormed through such questionnaires is considered as purely descriptive. In act, its diagonal blocks give the histogram o modalities or each question, and the non-diagonal blocks contain the cross-tabulations o all pairs o questions. In summary, the basic data preparation required to apply the CA algorithm is to transorm the available raw data into an input matrix that can be viewed as a juxtaposition or as a concatenation o contingency tables 9. INITIAL TRANSFORMATION OF THE INPUT MATRIX INTO PROFILES Any n x p matrix to be inputted to the CA algorithm can be depicted in geometric terms as a cloud o n rows in the column space p R or as a cloud o p columns in the row space n R. Given the symmetric character o contingency tables, the two above mentioned views are equivalent in semantic terms, since it is the same to use the matrix shown in Fig. A-3 or its transpose. Hence, the transormations perormed in rows have their counterpart in columns by substituting the index i by j, and conversely. Bearing in mind this argument and or the sake o parsimony, most ormulae o this text are expressed in terms o rows (whenever necessary, the corresponding ormulae or columns are derived rom the ormer by changing i into j). 7 This matrix consists o all two-way cross-tabulations o a set o categorical variables, including the crosstabulation o each variable with itsel. It is the analogue or qualitative variables, o the variance-covariance matrix or actorial methods based on quantitative variables. In act, the diagonal blocks o the Burt matrix are the analogue o variances and the non-diagonal ones, o covariances. 8 Conversely, by viewing each qualitative variable o any input table as a question with response categories, it may as well be stated that we end up with a quasi-universal coding ormat: the questionnaire. 9 Beore running any CA program, the data to be inputted should be scrutinized to assure that the resulting ormat can be considered as a contingency table, or more commonly as a juxtaposition or as a concatenation o such tables (being the above described complete disjunctive matrix and Burt matrix seen as the most common particular instances o juxtaposition and concatenation, respectively). 9

10 Each point o the cloud representing the rows o the input matrix illustrated in Fig. A-3 is a vector in the p R space, whose coordinates deine the row proile i j, given by: i j ( i, j) K ij K K( i, j) = = =, where K( i) i K( i) K K( i, j) ij K =, i = K(i)/K, K( i) = K( i, j), K = K( i, j) The proile o row i p j= 1 n p i= 1 j= 1 i ij j = expresses, in a standard way, how the individual i i is described in terms o the available set o properties. Since K(i,j) is expressed in absolute requencies in a contingency table 10, the coordinates o a row i correspond to its relative requencies, calculated or the row total K(i). In an input table like the one given in Fig. A-3 two individuals assigned to rows i and i have a similar proile when the values o their original properties K(i,j) and K(i,j) are roughly proportional. Moreover, the proile o a given individual i along columns j adds up to 1, as shown below: p j= 1 p K(i, j) K(i) = K(i) K(i) p i ij j= 1 j = = = j= 1 i Since the coordinates o individuals meet the above given relationship, the individuals can be represented in a p 1 dimension space. This is a speciic advantage o CA in terms o the eort to achieve the main objective o the data analysis methods that aim to acilitate interpretation through dimensionality reduction. In contrast to other actorial methods (like Principal Components Analysis), this eature o CA stems rom the act that any contingency table contains an additional inormation which is not used in 1 10 It is worth noting that the notion o contingency table can be generalized beyond the case o counts that give rise to absolute requencies. In act, CA can be properly applied to any table o homogeneous positive numbers or which it makes sense to express them in relative amounts. Hence, the sum along columns and rows should be allowed by the nature o data (in particular, a common unit should be given to all elements o the matrix), and a reliable sense should be conerred to that SUM [denoted K(i) and K(j) in Fig. A-2]. Also, each element o a row (or a column) divided by K(i) [or K(j)] should generate a signiicant ratio. Moreover, adding K(i) along lines and K(j) along columns, the same meaningul amount K should be obtained. 10

11 the PCA input table: the sum o the total number o occurrences may be derived rom n p data by K = K (i, j ). i =1 j =1 An example o the dimensionality shrinkage caused by the act that proiles add up to 1 is shown in Fig. A-6: all individuals i described by 3 properties lay ab initio on the plan E, given by x3 = 1 x 2 x1 (just by eect o the initial transormation, the dimensionality o the space was reduced rom 3 to 2). Fig A-6 Plan containing all individuals i characterized by 3 properties For a 2D case, when the input matrix contains only two columns j and j, proiles o the rows i can be represented graphically in a scatterplot, as shown in Fig. A-7. Fig. A-7 Geometric representation o rows in a n x 2 contingency table (X = ji = ij and Y = ji' = ij' ) i i 11

12 DEFINITION OF GEOMETRIC CONCEPTS (MASS, DISTANCE AND CENTROID) Once placed the cloud o points in space by its coordinates (the above deined proiles), to each point should be assigned a mass, in order to account or its signiicance, in terms o the number o cases reported in the contingency table that such a point embodies. For the sake o comparison, it is reasonable that the mass o each point should K(i) be deined by i =, a measure that accounts or its relative magnitude in respect to K the bulk o entire cloud. Hence, ceteris paribus, the bigger is the mass o a point, the greater is its contribution to the attraction o a Principal Axis to its neighborhood. Another eature needed to delineate a geometric representation problem is to deine a distance to measure how near (ar) the points o the cloud are in space, one in respect to the others. Obviously, in a contingency table, the usual Euclidean distance is not appropriate to account or such geometric inter-relations, which are to be mediated through the directions o spread o the cloud representing the input table, i.e., its principal Axes. In act, the Euclidean distance treats all coordinates equally, and what is needed is to compensate the discrepancy between requencies via a weighting procedure. This procedure leads to the eature that options that occur less requently are made to contribute more highly to the inter-proile distance, while those that occur more requently are made to contribute less. According to Benzécri, the 2 2 χ distance d ( i, i ) as deined below, is well-matched to overcome the drawbacks o the usual Euclidean distance or the case o contingency tables. d 2 p p ij i j i i ( i, i ) = = ( ) 2 j j j= 1 j i i j= 1 j A key concept in the geometric representation o the individuals cloud is its centroid. This is a point in space which is not necessarily located in the geographical centre o the cloud, but that accounts or the mass assigned to each row o the input matrix. The coordinates g j o the centroid G I o n individuals in the deined as: p R space are 12

13 n n n n i ij i j = i = ij = i= 1 i= 1 i i= 1 i= 1 ( j) K g = j K( i, j) / K = = K j mass coordinates It is worth noting that each coordinate o the row centroid corresponds to the relative requency o the column to which such coordinate reers. This is a consequence o the complete symmetry between rows and columns, stemming rom the speciic arrangement o contingency tables (in Fig. A-2, it is indierent to organize the modalities o VAR 1 as rows or as columns, the same applying obviously to VAR 2). CONSTRUCTION OF THE PRINCIPAL AXES OF INERTIA AND PROJECTION PLOTS The basic trait o CA as a actorial technique is that the cloud o points does not stretch equally in every direction 11. Hence, in order to reduce the space dimensionality where the input table is to be interpreted, it is required to ind the directions o maximum spread o the cloud, i.e., the principal axes o inertia o the set o points representing the matrix rows. These Axes are obtained by a procedure that involves the eingevalue decomposition 12 o the inertia matrix obtained rom the input table by calculating its moments and products o inertia. This procedure, analogue to the classical multivariate least squares orthogonal distance it, can be viewed in intuitive terms as ollows: 1. Take the centroid o the cloud 2. From this point, move a straight line in all directions, sweeping the entire space where the cloud is positioned 3. For each direction, calculate the sum o the square distances rom each point o the cloud to the sweeping straight line 11 In the case where the cloud could be assimilated to a hyper-sphere, no direction would dier rom the others, in what the spread o points is concerned. This would indicate that there is no ainity between rows and columns o the input matrix, and consequently CA would yield a number o equivalent Axes, whose putative interpretation is pointless. 12 The eigenvalue decomposition o a symmetric matrix is optimal in terms o least squares. 13

14 4. Select the direction that minimizes the above deined sum o squares and calculate, or such direction, the sum o squared projections o every cloud point to the straight line 5. Take the vector representing the above deined direction as the irst Axis o inertia o the cloud and the above deined sum o squared projections as a measure o the importance o such an axis 13 (the ormer is the irst eigenvector o the inertia matrix, according to B-3, and the later is its irst eigenvalue) 6. Take a new straight line, lying in the plane orthogonal to the irst one, and repeat the algorithm, inding the second Axis and its importance. Once obtained this axis, iterate the procedure until p 1 axis are extracted (i p<n). Trough this modus operandi, a set o p 1 principal axes, sorted by descending order o importance, are produced. At this stage, the cloud o points representing the rows o the input matrix can be projected onto the previously obtained axes. Being projection u α j the coordinate j o axis α, the iα o an individual i onto such axis is given by the scalar product o two vectors, one giving the axis direction, and the other the position o the individual in the R p space (corrected to transorm the required or Cartesian graphs), as ollows: u p ' ij i α = j= 1 i j αj 2 χ distance into the usual Euclidian distance As a consequence o the symmetry o the input table, the columns o the data matrix can also be projected onto the same axes, applying the transition ormulae, given below: ' jα = n 1 ij λ α i= 1 j ' iα 13 The importance o an axis is a measure o conormity o the initial cloud to its projection onto the axis. The more important is an axis, the less deormed is the cloud, when it is reduced to its projections onto such an axis. 14

' iα = p 1 ij λ α j= 1 i ' jα Where jα is the projection o column j onto axis α, whose eigenvalue is λ α Hence, at this phase, two sets o tables are produced, exhibiting the coordinates o the

15 ' iα = p 1 ij λ α j= 1 i ' jα Where jα is the projection o column j onto axis α, whose eigenvalue is λ α Hence, at this phase, two sets o tables are produced, exhibiting the coordinates o the projections o rows and columns onto the same axes. Now, selecting a small 14 number o axes, those can be put into graphical orm, displaying the projections o rows and columns onto the principal planes deined by these axes, as illustrated in Fig. A-8 or a particular point. Fig. A-8 Projection o a point o the cloud into a principal plane The above mentioned principal planes, which are maps characterized by a metric leading to the same distance scale in all directions, are constructed by crossing in Cartesian graphs Axis 1 with all the other selected axes. These maps represent successive sections o the original cloud o points, which are sorted by descending order o importance. Moreover the selected sections are optimal, in the sense that they minimize the loss o inormation when the cloud is substituted by such an array o plots. I they are prone to be interpreted, this set o plots shows the original p R and n R constellation o points under a useul orm, reducing the dimensionality o the problem in the most avorable way. 14 The number o axes to be selected depend on a trade-o between their importance, given by the raction o inertia conveyed by their eigenvalues, and the context where the problem at hand is situated, requiring that all selected axes are interpretable. 15

16 Fig. A-9 summarizes the CA algorithm, illustrating how the input table is converted into a graphical output. It is worth noting the crucial role played by the transition ormulae, which permit to encapsulate into a single inal plot the R p and R n analyses. Such a inal plot is the most economic (and inormative) synthetic representation o the input table. Fig. A-9 Diagram symbolizing the CA algorithm 16

17 GRAPHICAL INTERPRETATION OF OUTPUTS Given a set o graphs where Axis 1 is combined with all the other pre-selected axes, the interpretation route starts by the most important plot, deined by the irst two axes (those that exhibit the larger eigenvalues). Examining this graph (in conjunction with the others), the data analyst (together with the data expert) must give a physical meaning to all axes, producing a discourse that explain their role in terms o similarity (and/or opposition) between properties and/or individuals (columns and/or rows o the input matrix). Indeed, the whole process o interpretation relies on the meaning o Axes, not in proximity (or detachment) between projections onto the planes (above all, i such projections are in the vicinity o the graph s origin). The act that a certain projection o a row i lies close to a projection o a column j does not imply that i is associated with j. No direct row-column distance interpretation is allowed, due to the scaling procedure underlying the method o projection o both individuals and proprieties onto the same plane. Hence, the joint interpretation o rows and columns points must be perormed with respect to the principal axes o the map. Thereore, the understanding o such axes is the irst task to be carried on when interpreting the graphical outputs provided by the CA algorithm (bottom o Fig. A-9). For accomplishing this task, it is needed to choose which properties and/or individuals are associated with each axis. This requirement entails the choice o a threshold in a certain measure o association, denoted Absolute Contribution o individual i to Axis α, and given by 15 : C a iα = i λ ' 2 iα α Where ' iα is the projection o individual i onto Axis α, and λ α is the eigenvalue assigned to Axis α. 15 As noticed beore, individuals and properties are inter-changeable in the CA algorithm (hence, the same ormula applies or properties, substituting i by j ). 17

18 This measure o association expresses (in %) the raction o the Axis α total inertia that is conveyed by individual i. Should the individuals be randomly dispersed around the axis, the Absolute Contribution assigned to each o them is 100 n. Then, a natural criterion to spot which individuals can be used to interpret an axis is to impose a threshold in the Absolute Contribution o the set o n individuals, retaining only the subset that meets the condition that each o its individuals contributes to the axis in a ratio bigger than 100/n. The retained individuals are such that their distances to the Axis - when combined with their masses is the smallest, which entails that they project generally ar rom the graph s origin. Once selected the individuals (and/or properties) exceeding the above deined threshold (and/or 100 p ), their projections onto the graph which lie generally on the outer parts o the map, close to its edges or periphery allow to interpret the axis in terms o vicinity/separation between individuals/properties 16. The above described procedure is repeated or all pre-selected axes 17, until a coherent interpretation o the relevant graphs is reached, or the entire set o individuals/properties, as illustrated in Fig. A-10. In the let part o Fig. A-10 it is shown some usual conigurations obtained when a cloud o points is projected onto the plan crossing Axis 1 and 2 provided by the CA algorithm. The right part o Fig. A-10 represents the input matrix, ater being re-arranged by ascending order, according to the projections o its rows onto Axis 1 (the irst row o the re-arranged matrix exhibits the minimum projection onto Axis1). 16 At this stage, it must be stressed once more that CA deals with proiles, which means that one does not interpret the raw requencies that are given in the input table, but rather their values relative to the SUM o the respective row or column. In eect, comparing individuals or properties always means comparing their proiles. 17 It is not unusual that the set o pre-selected axes cardinal rises above (or below) the minimum number o axes required or interpretation, in which case the interpretation requirement prevails. It may also happen that a certain plane, although important in what concerns the eigenvalues o its axes, does not add any relevant improvement to the interpretation process. In this case, such a plane should be disregarded, in avor o a less important one, which could be more inormative or interpretation purposes. It is not because an axis has a relatively small eigenvalue that it should be ignored (oten such an axis helps to make a strong point about the data). As a general rule based on our experience, it is very exceptional that CA based case studies need more than 3 or 4 axes to get a coherent interpretation o large data tables.. 18

19 Fig. A-10 Typical conigurations o the projections onto plan 1,2 o a cloud submitted to CA, and corresponding input matrices re-arranged according to the projections o their rows onto Axis 1 Assuming that all elements projected onto the plans given in Fig. A-10 are relevant or the Axes 1 and/or 2 (in the previously described sense that the given threshold or their Absolute Contributions is exceeded), Fig. A-10 shows the equivalence between the graphical and the tabular orm o the input matrices (which is obviously an important ingredient to improve the interpretation endeavor, since results can be matched with original data). Analyzing case (1) o Fig. A-10, it is apparent that Axis 1 separates clearly two areas, both in graphical and tabular terms. Those are denoted A and B in the projections plot, and the interpretation o Axis 1 is made on the grounds o its ability to break away the two groups o elements A and B, each one o which exhibits a strong inner homogeneity (elements belonging to group A or B are similar, inside each group). For this case, Axis 2 has no relevance, or interpretation purposes (this argument should be checked by inspection o Absolute Contributions to Axis 2). Regarding the tabular orm represented in the right part o Fig. A-10, the act is that the input matrix ater being re- 19

20 arranged according to its projections onto Axis 1 shows two diagonal blocks containing the elements displayed in the graph as clusters A and B. These blocks exhibit values o the absolute requencies (or a matrix standing or a two-way contingency table) that are the highest, whereas the non-diagonal blocks contain the smallest values (denoted 0 in Fig. A-10). Regarding case (2) o Fig. A-10, three groups o individuals and/or properties emerge. Axis 1 separates projections A rom C, and Axis 2 dierentiates (A+C) rom B, assuming that all elements projected in the graph meet the condition that their Absolute Contribution exceeds 100 n and/or 100 p or Axis 1 and 2. When the input matrix is rearranged in the same way as in case (1), three diagonal blocks are obtained (ollowing the sequence driven by Axis 1). These blocks are bordered by quasi-null elements, as depicted in the right part o Fig. A-10 (2). The case denoted (3) in Fig. A-10 is very common in CA outputs, specially or ordinal variables (i.e., qualitative variables whose modalities are sequenced). Such a pattern similar to a parabolic crescent is known as the Guttman eect, and requires only one coordinate to identiy the sequence o individuals along Axis 1. The data matrix, ater being sorted according to the previously described procedure, exhibits a diagonal structure, where central elements present much higher values than non-diagonal ones. The structures recognized in Fig. A-10, being a sound basis or the descriptive interpretation o data without calling or any statistical hypothesis, are in addition the grounds where a preliminary modeling design can be preormed, providing some explicative power to CA methodology per se. In act, groups o individuals obtained in cases (1) and (2) may be the basis or establishing an empirical typology, which is in general more ruitul than those produced by most clustering algorithms, since the groups produced by CA are explained by the properties that are connected with the Axes responsible or the ormation o groups (APPENDIX B). Moreover, in case (3), scaling indices holding a certain signiicance in terms o properties may be produced, sorting quantitatively the sequence o individuals by means o a single meaningul real number (which is the Axis 1 coordinate), related to the modalities o variables that drive the spread along Axis 1 (APPENDIX B). 20

21 In most cases, the interpretation process does not take the holistic lavor that is patent in Fig. A-10, due to the nature o data (which do no allow a global reading o the graphical outputs based only on Axis 1). Generally, several but no more than three or our axes are needed, and the interpretation is made on the grounds o such axes, related through the application o the threshold criterion to individuals and/or proprieties which they are associated with. Also in addition to the context provided by the data expert the ormat o the input matrix must be taken into account in the interpretation process. In act, even though complete disjunctive and Burt matrices give rise to a similar pattern when their graphical outputs are compared, their axes eigenvalues dier. Moreover, the total number o axes provided by CA depends on the data matrix ormat: or a contingency table, it amounts to p-1 i p<n (or to n-1 i n<p); or a juxtaposition o q contingency tables (in particular, or a complete disjunctive matrix), it amounts to p-q (assuming n>p); or a Burt matrix, it amounts to p-q (where p is the dimension o the square matrix containing q blocks). Furthermore, in contrast to other actorial classical methods like Principal Components Analysis, it should be emphasized that CA captures non-linear relationships. This oremost eature o CA recommend to apply such a methodology even in some cases holding only purely quantitative variables, ater splitting them into classes (APPENDIX C, where a case study where circular quantitative attributes occur is described). SUPPLEMENTARY PROJECTIONS In some instances, the input matrix to be submitted to the CA algorithm is heterogeneous, i.e., blocks o a dierent kind may be acknowledged in the data table, both or individuals and variables. In this case, it may be ruitul or the sake o interpretation to split the matrix in a series o homogeneous blocks, the relationship rom one to the others is to be ound out. For this end, CA oers a speciic procedure, denoted Supplementary Projection, which permits to interpret, under a graphical orm, how those blocks relate. The Supplementary Projection procedure consists o selecting a block o homogeneous individuals and/or properties, designated by Principal or Active and 21

22 apply the CA algorithm using only this block to produce the axes, which are interpreted as described above in terms o rows and columns o the principal block. Ater that, individuals and/or properties belonging to the other blocks (demoted illustrative ) are projected as supplementary elements onto the axes derived rom the active block, according to the ollowing rationale. Given the proile + i j + i o a supplementary individual, its projection onto the axes provided by the eigenvalue decomposition o the inertia matrix corresponding to the principal block is written as: ' + i α = p 1 λ α j= 1 + i j + i ' jα where ' j α are the projections o the principal matrix columns onto the axes α o the same matrix (the eigenvalues o which are denoted λ α ). Obviously, the same applies mutatis mutandis to the projection o supplementary properties, as ollows: ' + j α = n 1 λ α i= 1 + ij + j ' iα By means o the above given ormulae or supplementary projections, all blocks o the initial matrix that were not considered as Principal are related to the axes provided by the later. Now, the problem arises how to judge the strength o the relationship between a given supplementary element and the axes produced by the CA algorithm when applied to the principal matrix. A natural way o achieving this goal is to measure the angle between the location o the supplementary element in the original space and all axes derived rom the principal matrix. This measure is denoted Relative Contribution, given by: 22

23 C r αi 2 α = i 2 ρ = cos 2 β Where ρ 2 = ' 2 iα α is the distance rom the supplementary individual i to the centroid o the cloud representing the principal matrix, and β is angle between the vector representing i and axis α. The corresponding ormula or the Relative Contribution o an axis to a supplementary variable j is obviously ' 2 r jα 2 Cα j =, where ρ = 2 ρ α ' 2 jα At this phase, given a set o previously interpreted axes derived rom the principal matrix, it is required to identiy which axis a certain supplementary element relates the most, to put in correspondence such element with individuals and/or properties that are responsible or the axis emergence. The axis we are seeking or is obviously the one which contributes the most to the element under scrutiny (should the element lie on the axis, a Relative Contribution o 1 is obtained). In addition to allow the selection o the axis which a certain element is associated with, the value o the Relative Contribution in the interval [0,1] accounts also or the strength o such an association, playing an analogue role as the correlation coeicient in classical regression. The greater is the Relative Contribution o an axis to an individual (or property), the closer is that element to the axis (in particular, i the Relative Contribution o an axis to a certain element is zero, this indicates that such an element is orthogonal in respect to that axis). The Supplementary Projection procedure, although being not an exclusive eature o the CA algorithm, is undoubtedly its most powerul modeling tool, allowing to outline strategies or coping with problems o questionnaire enhanced handling, diachronic studies, spatial comparisons, and other issues involving relationships between dissimilar blocks o the input matrix. 23

24 Moreover, new developments on CA applications were put orward by our applied research, on the grounds o supplementary projections. These new developments as qualitative regression and archetypal discrimination are addressed in the subsequent sections. QUALITATIVE REGRESSION When searching or a relationship between two sets o qualitative variables observed in an array o individuals, no classical regression may be applied since the values representing the individuals attributes are not real numbers, but codes indicating the modalities o the qualitative variables included in the available empirical data. CA is a valuable tool to address this modeling problem, provided that advantage is taken rom supplementary projection o one set o variables onto the other. Given an array o empirical cases contained in a database where the two sets o variables are known in the same individuals, the problem to be approached by the proposed methodology can be summarized in the ollowing steps. The irst step aims to extract rom the database a set o q variables (denoted predictors ) that are observable in a new case where the other set o variables (denoted dependent ) is to be predicted. The predictors are then arranged in a complete disjunctive matrix A, containing n rows (the individuals) per p columns (the total number o modalities or the q predictors observed in the n individuals). The second step consists o selecting, rom the database, a new set o qualitative variables to be predicted on the grounds o the irst set. These dependent variables are arranged under the same ormat as A, giving rise to a matrix B (n p ) that contains, or the same set o n individuals, the p modalities o the relevant attributes to be predicted in new cases, where only predictors are recorded. The third step seeks to establish some sort o relationship between B and A. Obviously, this cannot be achieved by means o an equation o the type Y = (x 1,..x q ) (as it is usual an in ordinary regression), since all variables are qualitative. But a speciic 24

25 kind o graphical relationship between B- and A-type variables can be obtained i B is projected onto the actorial axes resulting rom the eigenvalue decomposition o A. This relationship is mediated by the actorial axes, which play the role o a transer unction between B and A (summarizing A-type variables in quantitative coordinates, which are linked, through the same metric, with the corresponding B-type variables coordinates). The ourth step consists o using the relationship given by the previous procedure to orecast the modalities where B-type attributes all, or a new matrix C (n p) containing only the predictors codiied under the same ormat as A and B, and reerring to the selected n individuals where the relevant attributes are to be predicted. It must be stressed that, as ar as prediction is concerned, the third step o the above described methodology is the crucial point to be dealt with. Such point, permitting to get a satisactory relationship between matrices B and A, is addressed by using CA as qualitative regression tool. This calls or the concepts o supplementary projection and relative contribution : the ormer places B-type variables onto the actorial axes provided by matrix A, and the latter measures the quality o the relationship (the analogue o the correlation coeicient, in ordinary regression). Adjusting the transition ormulae to the case o complete disjunctive matrices, the supplementary projection o modality j o matrix B onto the axis α provided by CA o matrix A is given by: where, + ' j α = n j' 1 λ n α i= 1 n j' is the sum o column j' in matrix B, representing the total number o occurrences o each supplementary variables modality λ α is the α -eigenvalue provided by CA o matrix A 1i modality δ ' ij 0 otherwise j ' occurs in row i iα is the projection o the row i onto the α -eigenvalue provided by CA o matrix A δ ij' iα 25

26 It is worth noting that, as expected, all terms o the above given equation depend only on the eigenvalue decomposition o matrix A, and provide all modalities o the variables to be predicted in unction o the predictors modalities (summarized in their projections onto axes α emerging rom matrix A). Now, to choose which axes are relevant to the relationship between B and A, the relative contributions o all axes to the dependent variables are scrutinized. The more the relative contribution o the axis α to a given modality is close to 1, the more that modality is associated with axis α, which in turn relates to a subset o predictors, interpreted in terms o the CA algorithm. This interpretation is perormed on the grounds o CA algorithm reerring only to matrix A, by applying the inertia criterion: a given axis is explained by the combination o predictors that exceeds the proportion o the total inertia that would be assigned to these predictors or a hypothetical uniorm distribution. By applying a maximization criterion to relative contributions, it is achieved the selection o the axes that explain the best a link between predictors and dependent variables, associating one set o variables to the other. Also, the rows o matrix A representing the individuals in the empirical database can be projected onto the same axes, as usual in CA. As expected rom the speciic nature o the problem, no equation relating B- to A- type variables is obtained. However, the projections o each modality o B-type variables onto the relevant axes are given by the above given supplementary projection expression providing the values o + ' j α. Thereore, since the same axes are related to A-type variables through their coordinates, the qualitative regression is perormed in graphical terms, mediated by the axes. Now, the n new cases containing only A-type variables arranged under the complete disjunctive ormat in the matrix C (n p) are projected as supplementary individuals onto the previously obtained axes, according to the ollowing equation: + i ' α = q 1 λ α p j = 1 δ i ' j jα where, 26

27 q is the number o A type variables λ α is the α -eigenvalue provided by matrix A 1i modality j occurs in row i' δ ' i j 0 otherwise jα is the projection o column j onto the α -eigenvalue provided by matrix A Fig. A-11 summarizes the entire procedure, emphasizing how supplementary projection is the key ingredient to achieve the qualitative regression, both in unveiling the relationship between the two sets o variables and in orecasting dependent variables or new cases. An example o application o this procedure to the assessment o risk o mine tailings dam breakage is provided in APPENDIX D, illustrating the proposed modeling methodology. Fig. A-11 Outline o the procedure to use CA as a qualitative regression tool 27

28 ARCHETYPAL DISCRIMINATION CA can be used as a modeling technique to classiy a set o individuals sharing the same qualitative attributes 18 in reerence to a scale deined by two extreme poles or archetypes. The procedure to achieve such goal is outlined in the sequel. Given an empirical data set o n individuals where q attributes were observed, these are arranged under the orm o a complete disjunctive matrix o p columns, (denoted R) containing the relevant modalities o each variable, whose total number is p. Scrutinizing these modalities, two abstract vectors are constructed by the data expert: the irst corresponds to the GOOD pole (archetype 1) and is obtained by selecting, or each variable o the real data set arranged as R, the most avorable modality in respect to a certain criterion; in contrast, the second denoted the BAD pole (archetype 0) is obtained by selecting the most unavorable modalities in respect to the same criterion. These two vectors are put under the orm o a 2 x p complete disjunctive matrix A, containing the same modalities as the real data set. When matrix A is submitted to the CA algorithm, a single Axis is obtained insomuch as the input matrix contains only two rows. This Axis where the modalities projections are also displayed can be viewed as a scale whose extremes are the GOOD and BAD poles (archetypes 1 and 0). Then, when the empirical data set R (n x p) is projected in supplementary terms onto the single Axis provided by CA o the archetype abstract matrix A, the real individuals are characterized by a quantitative variable which is their coordinate in the Axis. This coordinate measures the extent to which each individual resembles to the predeined extremes, and can thereore be used as its degree o goodness. Consequently the set o all individuals projections can be sorted accordingly, providing in general two distributions represented by histograms (one corresponding to individuals more similar to archetype 1, and the other to individuals more similar to archetype 0 19 ). 18 As usual, those attributes may contain some quantitative variables that were previously split into classes 19 In most cases, the matrix R is divided by a priori knowledge into two dierent blocks, each one o each corresponding to real individuals assigned beorehand to a given archetype. In this instance, it is known a priori that the individuals belong in act to two dierent groups, and their projections are contained in distinct histograms. 28

29 in Fig. A-11. The proposed methodology is illustrated or a generic case in the diagram shown Fig. A-12 Archetypal discrimination symbolic description 29

30 In order to accomplish all the objectives o a comprehensive Discrimination Analysis, it is required to allocate an anonymous individual 20 to one o the groups related to each archetype. Consequently, it is needed to address the problem o the overlapping zone represented in the bottom o Fig. A-12. For this end, a boundary dividing in a clear-cut way the Axis provided by CA into two zones must be ound. The procedure to search or such a boundary by an optimal method consists o simulating dierent positions or the boundary in the overlapping zone, until the raction o misclassiied cases reaches its minimum. When this optimal position is established, any unknown case ( anonymous individual) can be allocated to Group I (individuals similar to archetype 1) or to Group II (individuals similar to archetype 0), depending on its supplementary projection onto Axis 1 in relation to the optimal boundary location, as illustrated in Fig. A-13. Fig. A-13 Allocation o unknown cases by establishing an optimal boundary In APPENDIX E is outlined a case study aiming at establishing an index o quality in natural stones, based on archetypal discrimination. 20 This is a case where the a priori belonging is unknown. 30

APPENDIX A AIR QUALITY IN GDANSK The air quality in the city o Gdansk is monitored by measuring the concentration o a set o ive pollutants: Nitrogen Dioxide

For the year o 2010, the average monthly concentrations (expressed in micrograms per cubic meter) are given in Table I.

31 APPENDIX A AIR QUALITY IN GDANSK The air quality in the city o Gdansk is monitored by measuring the concentration o a set o ive pollutants: Nitrogen Dioxide (labeled as NO2), Sulur Dioxide (labeled as SO2), Particulate Matter (labeled as PM), Carbon Monoxide (labeled as CO), and Ozone (labeled as Oz). For the year o 2010, the average monthly concentrations (expressed in micrograms per cubic meter) are given in Table I. Table I The aim o the study is to ind the pattern o association between extreme values o the pollutants concentration those that exceed the allowed limits provided below in Table II and the months o the year

32 Table II In order to apply CA to this case study, Table I was put under a complete disjunctive ormat by handling the inormation contained in Table II according to the ollowing procedure: i the value o concentration or a given pollutant is lower than the prescribed limit given in Table II, code 1 is assigned to column labeled -, and code 0 is assigned to column labeled +. In the cases or which the observed concentration exceeds the limit, code 1 is assigned to column labeled +, and code 0 is assigned to column labeled -. As a result o this procedure, raw data is transormed into the complete disjunctive matrix given in Table III. Table III NO2- NO2+ SO2- SO2+ PM- PM+ CO- CO+ Oz+ Oz+ Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec

33 When the CA algorithm is applied to Table III, a set o 5 axes are obtained. In Table IV it is given the eigenvalues o such axes, as well as the % o inertia conveyed by each o them and the accumulated inertia. Table IV EIGENVALUE % INERTIA % ACCUM AXIS AXIS AXIS AXIS AXIS By inspection o Table IV, a preliminary conclusion may be oered: it seems that the problem can be approached using only axis 1 and 2, which account or 75% o the cloud total inertia. In act, i these two axes explain all relevant proprieties which are the cases where the allowed limits o pollutants are exceeded it is not needed to scrutinize the remaining axes. For the end o assuring that the plan 1,2 is suicient to display the relationships between the relevant variables, it is required to examine table V, where the absolute contributions o the set o proprieties to axes 1 and 2 are given (the values that surpass the practical threshold o 100/p=10 are printed in bold or axes 1 and 2, indicating that a signiicant connection may be established between a given modality and one o these axes). Table V ABSOLUTE CONTRIBUTIONS AXIS 1 AXIS 2 AXIS 3 AXIS 4 AXIS 5 NO NO SO SO PM PM CO CO Oz Oz

34 Since all modalities or which the allowed limit in the pollutant concentration is exceeded can be connected to axis 1 (SO2+, PM+, C0+, Oz+) or axis 2 (N02+), the other axes can be discarded, and the interpretation o all relevant variables can be perormed in the plane 1,2, depicted in Fig. 1. Fig.1 Projection o variables modalities onto plane 1,2 The interpretation o Fig 1 is to be done exclusively on the grounds o modalities linked to axis 1 and 2 (and this linkage is symbolized by arrows pointing to one o the axes). Regarding axis 1, a group o 3 points project onto in its let part (SO2+, PM+, C0+), as opposed to Oz+, which project onto its right part. This means that there is a similarity between the proile o modalities SO2+, PM+, C0+ along time, and that this cluster opposes to modality Oz+. In what axis 2 is concerned, it separates the two modalities o NO2. Moreover, since the two axes are orthogonal, the pattern disclosed by axis 1 is unrelated to axis 2. 34

35 Hence, the conclusion can be drawn that the irst two axes are suicient to reveal the association/opposition pattern o all relevant modalities (those that indicate an excess o pollutant concentration over the allowed limit). Given that the raction o inertia conveyed by such axes is signiicant (75 %), it was decided to disregard the remaining axes. It is now required to select, rom the set o individuals (months), those whose contribution to axes 1 and 2 exceed the practical threshold o 100/12=8.33. This is perormed by examining Table VI, where the link between a given month and one o the axes 1 and 2 is symbolized by the respective absolute contribution printed in bold. Table VI ABSOLUTE CONTRIBUTIONS Jan Feb Mar Apr May Jun Jul Aug Sept Oct Nov Dec The inspection o Table VI indicates that the individuals linked to axis 1 are JAN, FEB, JUN and DEC, and those linked to axis 2 are MAR, JUN and SEPT. These months, retained or interpretation in conjunction with conclusions drawn rom Fig. 1, are highlighted in Fig. 2 by the arrows pointing their projections to the axis they are linked to. 35

36 It is worth noting that months APR, AUG and JUL project onto the same point and hence their absolute contributions to Axis 1 are summed up, amounting to 3 x (which exceeds the practical threshold o 100/12=8.33, and are thereore retained or interpretation). In what JUNE is concerned, even though its absolute contribution to Axis 2 is greater than to Axis 1, it was decided to assign this individual to the later axis, since its explicative power is more important than the ormer (55% vs. 20 %) All the other months do no intervene in the pattern revealed by Fig. 1, and are disregarded rom this analysis. Fig.2 Projection o individuals onto plane 1,2 The joint interpretation o Fig. 1 and 2 leads to the pattern o associations/oppositions represented in Fig. 3 symbolic diagram. 36

37 Fig. 3 Diagram showing associations/oppositions between pollutants and months Fig. 3 indicates that concentrations o pollutants CO, SO2 and PM that exceed the allowed limits occur mainly in months Jan, Feb and Dec. This association correlates negatively with months o April, June, July and August, linked to extreme vales o Ozone. A weaker (20 vs. 50%) opposition is disclosed by Axis 2, detaching concentrations in NO2 above and below the allowed limit (the ormer being linked to Sept and the later to March) SOURCE: Project or the course Natural Resources Management and Planning ( ) 37

38 APPENDIX B RESERVOIR QUALITY ZONES IN AN OIL FIELD In this example CA is applied to properties captured in a Middle East oil ield through a set o exploration wells. The aim o the study is to model the reservoir internal architecture in terms o homogeneous zones, in what oil quality is concerned. Based on such Reservoir Quality Zones, production planning can be achieved by maximizing recovery. The available data, consisting o quantitative and qualitative variables that characterize the reservoir, were arranged under a common ormat by the construction o the Complete Disjunctive Matrix given under a symbolic orm in Fig. D-3.1, where the variables reerring to the 172 wells trough which the oil ield was sampled were split into two sets: the irst set, denoted principal, contains the leading properties o the reservoir, in terms o their capacity to give rise to contrasting zones inside the geological ormation in which the oil was trapped; the second set, denoted supplementary, contains ancillary parameters that acilitate the interpretation o the ormer, or the purpose o obtaining homogeneous Reservoir Quality Zones. For each variable, meaningul classes or categories were established by the reservoir engineer who needs to disclose the oil ield internal architecture, or production planning purposes. Fig. D-3.1 Data model 38

When the matrix depicted in Fig. D-3.1 is submitted to the CA algorithm, the graph given in Fig. D-3.2 is obtained, or the plan deined by Axis 1 and 2 (representing 85 % o the initial could inertia).

D-3.2, depicting a perect Guttman eect, leads to the creation o 7 groups o wells, sequenced according to their projection onto Axis 1.

such worth decreases when the wells coordinate increases); in the right part o Axis 1, wells reveal a poorer oil quality (the greater is a well coordinate in

39 When the matrix depicted in Fig. D-3.1 is submitted to the CA algorithm, the graph given in Fig. D-3.2 is obtained, or the plan deined by Axis 1 and 2 (representing 85 % o the initial could inertia). Fig. D-3.2 Projection o individuals and variables into the principal plan produced by CA (Labels o variables categories are given in Fig. D-3.1) The interpretation o Fig. D-3.2, depicting a perect Guttman eect, leads to the creation o 7 groups o wells, sequenced according to their projection onto Axis 1. Such sequence displays a degradation in oil quality (rom negative to positive coordinates): in the let part o the Axis, the groups have the highest worth (and such worth decreases when the wells coordinate increases); in the right part o Axis 1, wells reveal a poorer oil quality (the greater is a well coordinate in Axis 1, the smaller is the oil worth). In the region o the plot deined by small Axis 1 coordinates (and by big negative Axis 2 39

40 coordinates 21 ), GROUP 4 is projected, associated with intermediate values o Elevation and Water Saturation. This particular group makes the transition between pure oil and clean water [the ormer reerring to the highest negative coordinates in Axis 1 (associated with low values o Elevation and Water Saturation), and the later projecting near its positive edge, associated with high values o the same variables. Regarding the columns that were projected as supplementary properties, they show a similar pattern as the principal parameters (or the case that they display an ordinal character); regarding nominal variables (as acies, presence/absence o limestone, clay, and dolomite), their modalities that project onto the negative part o Axis 1 indicate the presence o high quality oil, and the reverse or modalities projecting onto its positive part. As a consequence o the Guttman eect revealed by Fig. D-3.2, it is acknowledged that each group o wells (G1 G7) can be identiied exclusively by its projection onto Axis 1. Limits on such projections are given in Fig. D-3.3, which is a kind o symbolic U-shaped histogram, where classes to which groups belong are given by appropriate intervals in Axis 1 coordinate (derived rom Fig. D-3.2). As previously discussed, the oil quality (assessed by oil saturation o the rock) augments when the coordinate in Axis 1 decreases. Fig. D-3.3 Symbolic histogram providing the deinition o groups o wells according to their coordinate in CA Axis 1 The zones o the reservoir (groups o wells) obtained by the aorementioned procedure are shown in Fig. D-3.4, now in a horizontal cross-section across the geographical space where the oil ield is located in. By inspection o Fig. D-3.3, a core o high quality oil is spotted in the centre o the reservoir, and the above noticed 21 Axis 2 does no discriminate oil rom water, being interpreted as reveling the opposition between oil and water vs. mixed groups. 40

41 degradation in the CA plot is now portrayed in physical terms. This can also be seen in the vertical cross-section o Fig. D-3.5, reerring this time to the position o groups in regard to elevation Z (the lower is the elevation below surace, the higher is the oil content). Fig. D-3.4 Geographical representation o Groups (labeled 1 7) obtained by CA (horizontal cross-section) Fig. D-3.5 Geographical representation o Groups (labeled G1 G7) obtained by CA (vertical cross-section) SOURCE: Pereira, H.G., Silva, A.C., Soares, A., Ribeiro, L., Carvalho, J. (1990) Improving reservoir description by using geostatistical and multivariate data analysis techniques, Math. Geol., Vol. 22, No. 8, pp

42 APPENDIX C CLIMATOLOGY OF PORTO URBAN AREA In a weather station located in Porto, records o a small set o climatologic variables is available or the period 1998/2000. In addition to the time o day when each measure was taken, the variables that come here into play are temperature and wind direction/speed. In order to submit this data set to CA, it is required to codiy all variables under a common scheme, denoted by Benzécri as a complete disjunctive matrix. For this end, since all variables are quantitative, it is necessary to categorize them into meaningul intervals, whose limits were proposed iteratively by a climatology expert. For this speciic case, particular attention was paid to variables time o day and wind direction, as a consequence o their dierent character with respect to the remainder parameters. Indeed, while the later (temperature and wind velocity) are expressed by usual real numbers, the metric associated to the ormer is by no means linear (and regular arithmetic does no hold). Such metric, which is an acknowledged property o directional or circular variables, does not permit to calculate the Euclidean distance needed to apply the most common method o data analysis or quantitative variables: Principal Components Analysis (PCA). Hence, in this case, the use o CA was driven by the data set characteristics, not or the reason that the available variables are qualitative, but because their heterogeneity (even within the quantitative realm) could not be managed trough a simpler technique like PCA, which assumes linearity. Ater a series o trials, the modalities o the available variables were established as shown below. VARIABLE time o day CODE H1 H2 H3 Limits (h) 0/8 8/16 16/24 VARIABLE temperature CODE T1 T2 T3 T4 T5 Limits (ºC) 0/10 10/15 15/20 20/25 >25 42

VARIABLE wind direction CODE D1 D2 D3 D4 D5 Limits (º) 0/60 60/140 140/270 270/320 320/360 VARIABLE wind speed CODE V1 V2 V3 V4 V5 Limits (m/s) 0.0/1.5 1.5/3.0 3.0/5.5 5.5/7.0 >7.0 Fig. D-5.

43 VARIABLE wind direction CODE D1 D2 D3 D4 D5 Limits (º) 0/60 60/ / / /360 VARIABLE wind speed CODE V1 V2 V3 V4 V5 Limits (m/s) 0.0/ / / /7.0 >7.0 Fig. D-5.1 Codes and limits or the available variables Histograms o the available variables expressed in absolute requency o occurrences are given in Fig. D-5.2 to D.5.-5, in accordance with the classes previously established. Fig. D.5.2 Variable time o day Fig. D-5.3 Variable temperature 43

44 Fig. D-5.4 Variable wind direction Fig. D-5.5 Variable wind speed The empirical data table was then converted into a complete disjunctive matrix containing rows and 18 columns (the total number o modalities or the available set o variables, codiied according to Fig. D-5.1). When this matrix is submitted to CA algorithm, two principal planes were obtained, explaining 40% o the total inertia o the cloud. Although this raction o inertia may seem small, the act is that all variables can be interpreted on the grounds o their projections onto such planes 22. Thus, the climatologic interpretation is based solely on the projections o the variables modalities onto planes deined by Axis 1,2 and 1,3, depicted in plots o Fig. D-5.6 and D-5.7, respectively. 22 This is a eature that occurs in most cases o applying CA to a complete disjunctive matrix. 44

45 Fig. D-5.6 Projection o variables modalities onto plane 1,2 Fig. D-5.7 Projection o variables modalities onto plane 1,3 In order to interpret the above given igures, it is required to put the histogram reerring to wind direction (D-5.4) in geographical terms. This is shown on Fig. D-5.8, where classes D1 to D5 are shown in respect to the Wind Rose prevailing in the Porto Region. 45

Fig. D-5.7 Wind direction circular histogram or the Porto region In what regards Fig. D5-6, the interpretation o Axis 1 is straightorward: it shows an increase in temperature, rom let to right.

46 Fig. D-5.7 Wind direction circular histogram or the Porto region In what regards Fig. D5-6, the interpretation o Axis 1 is straightorward: it shows an increase in temperature, rom let to right. A clear connection is perceived between the extreme low category o temperature (T1) and night measures o H1 associated with East winds D2. This association is explained by the local scale phenomenon known as land breeze, occurring at nigh (when onshore temperatures are lower that oshore ones). Focusing now on the right side o Axis 1, the extreme high categories o temperature (T4 and T5) are in relation with the intermediate wind speed (denoted V3) and NW winds D4. This association is explained by a global scale synoptic phenomenon driven by the interaction between the Azores anticyclone and the Iberian depression, which causes a clockwise wind circulation reaching Porto rom NW (D4) and occurring mainly in hot Summer days (T4 and T5). In what Axis 2 is concerned, it can be remarked that it opposes Northern wind directions D1+D5 to both extremes o temperature (T1 and T5). 46

47 Fig. D-5.7 brings a new insight stemming rom Axis 3 interpretation. Such an Axis opposes wind speeds V4/V5 rom wind directions D1+D5, suggesting that strong winds seldom blow rom Northern directions. The conclusions drawn rom this analysis are o two kind: on one hand, some o them, being trivial and/or expectable, have just the unction o authenticate the methodology or the beneit o a skeptic climatologist; on the other, those conclusions that convey any sort o novelty may be useul as conditional evidence to be scrutinized by the (now less skeptic ) expert in the ield to which data reer. SOURCE: Góis, J., Pereira, H.G., Salgueiro, R. (2010) Geostatistics applied to City o Porto urban climatology, in geoenv VII, Atkinson & Loyd (Eds.) Springer, p

APPENDIX D RISK ASSESSMENT OF MINE TAILINGS POND DAM BREAKAGE In the Mediterranean region, ancient mining operations have generated huge amounts o sludge, dumped into tailing ponds that are in

48 APPENDIX D RISK ASSESSMENT OF MINE TAILINGS POND DAM BREAKAGE In the Mediterranean region, ancient mining operations have generated huge amounts o sludge, dumped into tailing ponds that are in general sustained by precarious dams, constructed rom locally obtained ills. An inventory o such tailing ponds was perormed in the scope o an UE unded research project, aiming at developing a decision-support system to prevent environmental disasters ollowing the accidental breakage o this type o dams. Since modest physical modeling experience is available to address this issue (as opposed to water reservoir dams), it was decided to adopt a stochastic approach to assess the risk o breakage on the grounds o an historical data base where 55 cases o disasters were recorded, together with dam characteristics and harmul consequences. The irst step o the proposed approach was to extract rom the database an assemblage o attributes (denoted predictors ) that characterize the dam conditions, prior to the disaster, and another array o attributes that embody the damage resulting rom the disaster. Given that available inormation on prior conditions and damage contains important qualitative eatures (described by nominal variables like the nature o the dam, country where the pond is located, ailure type.), no classical regression may be perormed and conditions are met to apply CA based qualitative regression to be above outlined training set. In Fig. D-6.1, it is given the graphical output produced by CA application to the predictor matrix, arranged under a complete disjunctive ormat (where quantitative variables were split into classes). Fig. D-6.1 Projection o predictor s modalities onto Axis 1 and 2 resulting rom CA 48

49 The interpretation o Fig. D-6.1 is made on the grounds o Axis 1 ability to separate small dams in environmentally regulated countries (negative semi-axis) rom big, inactive and ring type dams located in in environmentally unregulated countries (positive semi-axis), showing in addition a clear increasing sequence (rom let to right) or the relevant ordinal variables like dam height and storage volume. Now, when attributes linked to dam breakage conditions are projected as supplementary variables, Fig. D-6.2 is obtained, displaying graphically how predictors are associated with to those variables. It is obvious that the only relevant mediator between the two sets o variables to be put into relationship is Axis 1, even though it conveys only 35% o the inertia cloud. Moreover, the strength o such a relationship can be evaluated quantitatively by the Relative Contributions o Axis 1 to each modality o supplementary attributes, as given in Table D-6.I. Fig. D-6.2 Supplementary projection o variables linked to the disaster conditions Table D-6.I Relative contributions o Axis 1 to supplementary variables modalities Negative semi-axis: Sludge Volume Released <50000m3 Mix Type o Sequentially Raised Tailing Dam (0.75) No atalities (0.55) Failure Type: Hole (0.20) Failure Type: Overtopping/Overlow (0.16) Downstream Type o Sequentially Raised Tailing Dam (0.12) (0.62) 49 Positive semi-axis: Upstream Type o Sequentially Raised Tailing Dam Sludge Linear Distance Traveled >12000m Sludge Volume Released >300000m3 > 10 atalities Sludge Linear Distance Traveled between 800 and 12000m 1-10 atalities (0.63) (0.51) (0.47) (0.40) (0.15) (0.11)

damage variable classes are ranked. Hence, when projecting onto Axis 1 the test sites displayed in Fig. D-6.

50 It is clear that Axis 1 resulting rom predictors matrix CA can be viewed as a scale o RISK. In act, onto the negative semi-axis are projected modalities o the attributes that characterize the type o ailure leading to low damage, and the reverse occurs or the positive semi-axis, where severe damage variable classes are ranked. Hence, when projecting onto Axis 1 the test sites displayed in Fig. D-6.3 whose risk o ailure is to be assessed, a ramework o prevention priorities (and eort) can be established, on the grounds o Fig. D-6.4., where the training set o historical disasters is also projected as reerence numbers, the meaning o which is disclosed in Table D-6.II. Fig D-6.3 Location o test sites Fig. D Supplementary projection o test sites and cases drawn rom the Historical Data Base onto an empirical scale o risk provided by CA 50

51 Table D-6.II List o the 55 cases that compose the Historical Data Base. Year o Re. Re. Name Country the Name number number incident El Cobre Old 1 Los Frailes Spain Dam Country Year o the incident Chile Aitik Sweden Fort Meade USA Baia Borsa Romania Harmony South Arica Baia Mare Romania Hokkaido Japan Sgurigrad Bulgaria Itabirito Brazil Maritsa Istok 1 Bulgaria Jinduicheng China Stava Italy La Patagua New Dam Chile Balka Chuicheva Russia Los Maquis Chile Zletovo Macedonia (Yugoslavia) Mike Horse USA Maggie Pie United Mochikoshi Kingdom No.1 Japan 1978 Montcoal 11 Bilbao Spain No.7, Raleigh USA 1987 County 13 Derbyshire United Kingdom Olinghouse USA Madjarevo Bulgaria Omai Guyana Middle Arm Tasmania Unknown 70 Placer, Surigao del Norte Philippines Partizansk, Primorski Krai Russia Riverview USA Huelva Spain Sipalay Philippines Amatista, 1994 or Peru Nazca Stancil USA Arcturus Zimbabwe Sullivan mine Canada Baokeng South Arica Tennessee Consolidated USA 1988 No.1 26 Bellavista Chile (unidentiied) SW USA Bualo Creek USA Veta de Agua No.1 Chile Cerro Negro Chile Barahona, Chile Unknown 30 Cerro Negro No.4 Chile Bonsal USA Unknown 31 Cerro Negro No.3 Chile Mochikoshi n2 Japan Unknown 32 Church Rock USA Phelps-Dodge USA Unknown 33 Deneen Mica USA Silver King USA (unidentiied), East Texas USA Unidentiied USA Unknown 35 El Cobre New Dam Chile

52 SOURCE: Salgueiro, A.R, Pereira, H.G., Rico, M.T., Benito, G. Díez-Herrero, A. (2008) Application o Correspondence Analysis in the assessment o mine tailings dam breakage risk in the Mediterranean Region, Risk Analysis, Vol. 28, No 1, pp

53 APPENDIX E INDEX OF QUALITY IN NATURAL STONES A major problem that arises in natural stone exploitation planning is the lack o an objective criterion or optimizing the economic recovery o the material to be produced, under environmental constraints. In act, and in contrast with mineral commodities mining operations where grade is the decisive control variable, or the case o natural stone extraction there is no single parameter driving the demand, which depends decisively on a myriad o actors, ranging rom physical to aesthetical eatures, strongly associated with the speciic application oreseen or the material to be extracted. But the current practice, insomuch as it ocus on blind ad-hoc supply o the most accessible blocks, does not take into account demand requirements. This situation leads to serious shortcomings, namely the huge deposition o waste blocks in the vicinity o the quarry, with the associated landscape recovery costs, and the production o material which is worthless or a given conjuncture in the building industry, with the associated storage costs. In order to address this issue under a demand driven approach that minimizes environmental damage and stocks waiting or a virtual removing, a resh viewpoint is put orward: the production planning is organized in such way that blocks to be extracted in a certain conjuncture should meet the downstream industries requirements entailed by that conjuncture, being let in situ the remaining material. Obviously, this procedure must comply with geotechnical and geological constraints that are ound in the available natural sources, i.e., a compromise should be reached between the characteristics o the quarry and the demand requirements or each application o the blocks to be extracted. The irst step to be undertaken in the above outlined procedure is to identiy the set o the natural stone s attributes required or a certain application that can be captured in the aces o the quarry, prior to extraction. These attributes are then encapsulated into a single index o quality according to Archetypal Discrimination, as illustrated bellow and production planning is perormed by maximizing such an index. The case study reported here reers to a marble quarry located in the Estremoz anticlinorium (see Fig. D-8.1), or which a panel o specialists (architects, builders and geologists) have deined the observable attributes modalities that were considered as the 53

The meaning o the modalities sequence or each attribute is illustrated in Fig. D-8.2. Fig. D-8.1 Location o the Quarry Table D-8.

54 archetypes o the GOOD and BAD material, or a certain application. In Table D-8.I it is given the matrix containing these two vectors, in what ractures are concerned. The meaning o the modalities sequence or each attribute is illustrated in Fig. D-8.2. Fig. D-8.1 Location o the Quarry Table D-8.I Archetype deinition Fig. D-8.2 Modalities o the archetypes attributes The ollowing steps consisted o capturing empirical data in the aces o the quarry, according to the ormat driven by the above given archetypes. For this end, the available vertical aces were swept by a moving window, and a photograph was taken or 54

55 each support, corresponding to the ield were ractures were digitalized and typiied in terms o their attributes, according to the scheme outlined in Fig. D-8.3. Fig. D-8.3 Scheme o data capture and their digital characterization Then, Archetypal Discrimination was applied by projecting as supplementary individuals the empirical supports onto the Axis obtained by CA applied to the archetype matrix o Table D-8.1. The coordinates o these supports in the above mentioned Axis represent the quality index o the material to be extracted, or the a priori deined use. Hence, the blocks o the quarry can now be classiied into meaningul classes or the required application, as exempliied in Fig. D-8.4, and the demand driven production planning can be perormed on the grounds o each block closeness to the GOOD pole. 55

56 Fig. D-8.4 Classiication o blocks in a quarry or a given application SOURCE: Pereira, H.G., Brito, M.G., Albuquerque, T., Ribeiro, J. (1993) Geostatistical estimation o a summary recovery index or marble quarries, Proceedings Geostatistics Tróia 92, Vol. 2, p

Biplots in Practice MICHAEL GREENACRE. Professor of Statistics at the Pompeu Fabra University. Chapter 6 Offprint

Biplots in Practice MICHAEL GREENACRE Proessor o Statistics at the Pompeu Fabra University Chapter 6 Oprint Principal Component Analysis Biplots First published: September 010 ISBN: 978-84-93846-8-6 Supporting