Chapter 12. The numerical classification of vegetation

Size: px
Start display at page:

Download "Chapter 12. The numerical classification of vegetation"

Transcription

1 Chapter 1. The numerical classification of vegetation by Guy BOUXIN Contents Introduction... The classification of relevés and of variables... 3 Hierarchical or non hierarchical method?... 3 Divisive or agglomerative method?... 4 Monothetic ou polythetic classification?... 5 Qualitative or quantitative data?... 5 Similarity measures... 5 Hierarchical classifications... 6 Agglomerative methods... 6 Divisive methods... 7 The stopping rules... 8 Twinspan... 9 Cocktail clustering... 9 Choice of techniques Examples A small table Raw table... 1 Complete table transformed by correspondence analyses Conclusions... 1 A large floristical table... File with 35 species and 147 relevés... Conclusions... 9 General conclusions... 9 rue des Sorbiers 33 à B Erpent mail : guy.bouxin@skynet.be 1

2 References Introduction The classification of vegetation relevés is the arrangement of relevés into classes, in which members share a number of characteristics that separate them from members of other classes. The classification of relevés is a process quite different from that of factorial analysis since it involves discontinuities in the composition not only of the concrete units in the field but also in the abstract classes within which all vegetation may, theoretically, be placed (GREIG-SMITH, 1964). All classification is, to some extent, an arbitrary process, especially when the discontinuities between possible classes are not very marked. Nevertheless, classification, whether natural or not, is useful from many points of view (GOUNOT, 1969): as a label, as a basis for mapping or as a synthesis process. This last author considers that the classification of the vegetation is made difficult if one considers that vegetation presents itself as a mosaic consisting of juxtaposed elements, of floristic composition and variable structure, which takes a different meaning according to the scale to which it manifests itself. We do not return here to the concept of the continuum or its opposite, but consider that the choice of a classification mode can be influenced by the way in which vegetation is considered, hierarchically as in the phytosociological system (see WESTHOFF & van der MAAREL, in WHITTAKER, 1973) or by other approaches emerging from any hierarchical structure. We deal here only with the numerical classification, in the same spirit as that of the preceding chapters. If one considers, following GOODALL (1973), that vegetation relevés are represented by points in a multidimensional space, the axes of which correspond to the variables by which they are described, the classification consists in dividing this space into subspaces. If the dispersion of the points is interrupted by discontinuities or regions of low density, the division follows the discontinuities. GOODALL (1973) recommends that the classification of relevés in a vegetation table be preceded by a factor analysis. If the dispersion of the relevés in a space then defined by a small number of dimensions is continuous, an arbitrary number of classes can be defined; if it presents discontinuities, these will serve as a reference. Once again, our aim is not to present all the techniques exhaustively, but rather to illustrate how to treat vegetation data, in the same spirit developed in the chapters on the use of other techniques of multivariate analysis.

3 The reader who wishes to go deeper into the technical aspects of the classifications will consult the works of GOODALL (1973), van der MAAREL (1979), GAUCH (198), JONGMAN et al. (1987), PODANI (000) and WILDI (010 & 013). The classification of relevés and of variables Classification usually involves relevés, with a view to defining plant groupings or plant associations (WESTHOFF & van der MAAREL, 1973). The classification can also be constructed, not on relevés, but on variables, so as to verify whether we can associate species and factors of the environment, which amounts to defining ecological or socio-ecological groups in DUVIGNEAUD s meaning (in DULIÈRE et al., 1996). The questions most often asked when approaching the classification process are presented by PIELOU (1977): - Should the classification be hierarchical or networked? - Should the method be divisive or aggregative? - Should classes be separated by monothetic or polythetic criteria? - Should the data be qualitative or quantitative? - How should class separation be measured? We start with these questions to present our point of view on classification. Hierarchical or non hierarchical method? In a hierarchical classification, classes at a given level are subclasses of classes of a higher level (PIELOU, 1977). The classification of relevés by any hierarchical method allows the construction of a diagram (figure 1) in a tree or dendrogram which shows the sequence in which the divisions or group meetings are made. The classification of plants with orders, families, genera, species is a well-known example. Hierarchical classification is by far the most widely used and easy to understand. It is a practical algorithm and it does not mean for us that the vegetation has a structure naturally hierarchical! Several non-hierarchical classification algorithms have been created. GAUCH (1979) starts by randomly selecting a site (or relevé) and grouping all the sites within a certain radius from this site. The technique repeats the process until all sites are taken into account. In a second phase, small group sites are reassigned to larger groups 3

4 within a larger radius. JANSSEN (1975) proposes a very similar approach but selects the first data site to initiate the first group. Any site being further away from the first on the basis of a fixed radius, is used to initiate a new group. The following sites are compared to all previously formed groups. There is thus a clear incidence of the way the sites have entered the classification, which is why a next step generates reallocations in order to define the "best" groups. In the algorithm of VAN TONGEREN (1986), a certain number of sites is chosen at random or defined by the user. All other sites are assigned to the closest to the set. By reallocations and until the stability, a better grouping is then created. ROUX (1985) describes the aggregation technique around mobile centers of which there are many variants; the process starts by setting a number k of classes and choosing an initial partition, either random or defined by the user. If one has a previous knowledge of the vegetation studied (by a factor analysis, for example), this last way of doing is not necessarily a disadvantage. This algorithm has the advantage of optimizing a simple criterion of dispersion, namely the second order moment of a partition. The algorithm first calculates the center of gravity of the points in the space defined by the transformed variables. The total moment of second order is obtained by summing, for all individuals and variables, the squares of the difference between the coordinates of the gravity center and those of the individuals. The gravity centers of the predefined classes are then computed as well as the moments of second order of these classes. The sum of these moments constitutes the moment of second order inter-class. Then, the algorithm re-assigns individuals to new classes as long as the intra-class moment is in each case less than it was before reassignment, otherwise there is no reassignment. However, we do not have the certainty of obtaining an absolute optimum, that is to say the best solution. One of the generally recommended ways to optimize the results is to run the complete algorithm several times with different initial partitions. We can then retain the final partition that minimizes the intra-class moment or optimizes the inter-class moment. However, as the author points out, a better strategy is certainly the examination of "strong shapes". These consist of the subsets of objects that have always been grouped together in the same final class during the various initial partition tests. The various existing techniques for partitioning a set of relevés have been presented by PODANI (000). Divisive or agglomerative method? In a divisive classification, the set of relevés is divided into first sub-sets which are themselves divided and so on down to the ultimate classes. An aggregative classification starts from the bottom, that is, from the relevés, and the groups, step by step, into larger and larger sets, each set consisting of the union of smaller subsets. 4

5 Divisive classification is generally considered advantageous because it is faster and less sensitive to local variations which are likely to create poor combinations of unitary relevés that remain in subsequent steps of the aggregative classification. This same default is also sometimes found in the descending algorithms but is corrected with reallocation procedures (BERTHET et al., 1976). Monothetic ou polythetic classification? In a monothetic classification, the two groups "brothers" are distinguished by a single character, present in one and not in the other. In a polythetic classification, the "brother" groups are distinguished by their total similarity based on all the criteria used to describe them (species most often). In a vegetation study, where abundance data are available, a poythetic technique is required in most cases. A monothetic technique is best suited to only qualitative data. Qualitative or quantitative data? This is an ancient debate. Some phytosociologists favor the presence of species, with the aim in particular of defining character-species, while other researchers take account of their abundance. This choice is sometimes a function of the nature of the vegetation described (very numerous species, for example), the time given to the description (always too short), the initial objectives (sometimes forgotten) and many other parameters. The use of data transformed by factor analyses lessens the impact of this choice. In the approach that we follow, starting with a raw relevé table to classify, we repeating all the steps presented in the chapters on factor analysis techniques. We therefore recommend using transformed tables in preliminary multivariate analyses (principal component analysis, correspondence analysis, multiple factor analysis), as proposed by DAGNELIE (1968), BERTHET et al. (1976) and ROUX (1985) and also used by NOY-MEIR (1973), BOUXIN & LE BOULENGẺ (1983), BAAMAL (1994), BOUXIN (1987a and b, 1991, 1995 and 1999). Similarity measures Many classification methods require the calculation of a similarity coefficient (or its complement, dissimilarity or distance) between the pairs of entities to be classified. Entities are either groups of relevés or simple relevés. There are a multitude of similarity coefficients. 5

6 The coefficients often used are the Euclidean distance, the chi, the Jaccard index. There are many others. However, when using a table in which the species have been replaced by variables transformed by a multivariate analysis, euclidean distance is needed. The reader will find useful documentation in the works of ROUX (1985) and PODANI (000). Hierarchical classifications A dendrogram of a hierarchical process thus has a number of nodes (represented by horizontal lines), each corresponding to an intermediate group that has been divided into two sub-groups (divisive classification) or formed by the union of two sub-groups (aggregative classification). Each intermediate group has its own status given by the position of the corresponding node in the dendrogram. Both methods are now explained. Agglomerative methods Two ways of building the hierarchy are possible: either by forming groups so that the distance between the groups is maximum, or by privileging the homogeneity within the formed groups. In an agglomerative classification based on distance measures, there are several aggregation criteria, the main ones being: complete linkage, mean linkage, single link, centroid and flexible strategy. In an aggregate classification with full link, the distance from a relevé to a group is defined as the distance to the farthest sample from that group; when two groups of relevés merge, the distance is equal to the greatest distance among all pairs of relevés taken in the two melt groups. The average link takes into account the average distance instead of the maximum distance. The single (or minimum or nearest) link takes into account the minimum distance. The centroid method calculates the distance between the gravity centers of gravity (in the geometric sense) of the groups. In an agglomerative classification based on search for homogeneity within groups, they merge as long as the dispersion within the groups is less than it would be if one of the groups were merged with another group. An example is the minimum variance technique (WARD, 1963, ORLỎCI, 1978, ORLỎCI & KENKEL, 1985, GAUCH, 198, ROUX, 1985). The criterion for deciding on the merger of two classes is based on the increase of intra-class dispersion. At each step of the algorithm, we merge the two classes which cause the smallest increase of the intraclass moment. Since the distance between the points and the virtual centers is modified by this process, the two operations are repeated until the virtual centers are fully stabilized. At this step, two groups are formed. Thereafter, each new virtual center is located in the group that has the largest determinant of the variance-covariance matrix and 6

7 is placed on the farthest point of the centroid of that group. The preceding operations are repeated until complete stabilization. The algorithm described by ROUX (1985) is called "hierarchical construction of the moment of second ( n jnk ) /( n j nk ). d jk order ". The clustering criterion is given by: Q jk where d jk is the squared distance between the centroids of the groups j and k; n j and n k are the sizes of the groups. The agglomerative algorithm can handle large amounts of data. The fusions are selected so as to minimize Q jk at each step. Some software integrates, as in non-hierarchical techniques, reallocation procedures; an example is given by JANSSEN (197). The procedure starts with a preclassification of the relevés, created by the user (following a multivariate analysis, for example) or even random. At each step of the procedure, the similarity between each relevé and the centroid of each relevé group is calculated using a similarity ratio. When a relevé shows a greater similarity with a group than with the group to which it is attached, it is moved to the first group. Reallocations are repeated until the relevé groups are stable. Then, the similarities between each pair of groups are calculated and the two most similar groups are merged, each is compared again to the different groups and reallocations are still possible. The process continues until the desired minimum number of groups is obtained or the similarity between the two most similar groups is less than a minimum value set by the user. In a second step, the program finally orders the relevés and species in a table. Divisive methods Using a divisive classification, a crucial question immediately appears: what rule to use to stop classification? In other words, if the relevés are arranged in line (figure 1), at what level of the dendrogram should it be cut, parallel to the horizontal axis, in order to obtain an optimal classification? - The number of classes to be recognized can be fixed. - The level of heterogeneity allowed in the recognized classes can be defined. - A dendrogram can be read from top to bottom and terminate each branch as soon as a node is reached, which is shorter than a chosen length. A polythetic divisive technique with reallocations was created by BERTHET et al. (1976) and used by BOUXIN (1978 and 1987b). A preliminary correspondence analysis (or other multivariate analysis) is first performed. We thus have a set of data composed of n individuals characterized by p continuous variables; we want to group these individuals into g classes. If each individual is represented by a point in a p-dimensional space, the p Euclidean distance between points i and j is given by: x x take a position at the center of the groups. The first virtual center is placed at the centroid of the set of points; its 7 1/ d ij ik jk. k 1 In order to create groups of individuals, virtual centers are successively introduced into the data space and forced to

8 coordinates are the averages of the coordinates of all points. The second virtual center is then introduced at the point furthest from the first virtual center. Two operations follow: calculate the distances between the points and the virtual centers and adjust the position of the virtual centers. a. The Euclidean distances between one point and all the virtual centers are calculated and this point is assigned to the nearest virtual center. This process is repeated for all points. b. Each virtual center is transferred to a new position which is the centroid of all the points assigned to it. In order to limit the calculations, the user sets the maximum number of groups he wants to fix. The stopping rules The user of a hierarchical technique, whether divisive or agglomerative, wonders: how to determine the optimal number of groups? There are many stopping rules. Some are non-inferential as that which is related to the divisive technique of BERTHET et al. (1976). The criterion adopted is the evolution of a quantity Vg as a function of the number of groups where V g g w i i 1 1/, w i is the dispersion matrix of group i and g is the number of groups. When g increases, Vg decreases and reaches a minimum when g is equal to the number of groups actually present. Any subsequent division increases the value of Vg. The latter technique therefore has a stopping rule, but experience shows, however, that Vg decreases very often continuously according to the number of groups and often does not reach a minimum (BOUXIN, 1978). The stopping rule works well only if there are clear discontinuities. Inferential stopping rules take into account only two classes that are candidates for fusion at each step of the classification. It will be considered that the number of classes defined at a given step must be increased by one unit (which would amount to descending a notch in the hierarchy), if the hypothesis of equality of the two classes to be merged during this step is rejected. If so, continue in the same way for step e-1, and so on until the hypothesis is accepted for the first time. Only a few rules are an exception to this principle. Some rules are based on the assumption of normality of the parent-population, others on the assumption of uniformity, others on the permutations. Some make no explicit assumption. Several are based on the bootstrap method (BAAMAL, 1994). This author stresses that the application of the majority of inferential rules is relatively complex. Moreover, this complexity is highly badly founded insofar as it is not accompanied by an improvement in performance, in practical 8

9 terms, compared with much simpler design rules. He also developed a stopping rule, based on Monte Carlo techniques, with a classification quality criterion based on the number of rows and columns in the statement table. The parameter used is defined for each partition of a set of individuals in two classes from the ratio of the sums of square of the inter-class and total deviations on the one hand and the relative proportion of the associated variation to the first main axis of the scatter diagram, on the other hand (obtained from a principal component analysis). This rule has been integrated into a classification program (WARD technique). The author found that the numbers of individuals and variables had a very important influence on the performance of the stopping rules. Indeed, the results are, in general, better with high individual or variable numbers. He finds it unnecessary to seek a stopping rule of which may be the best in all situations. This problem of the stopping rules of therefore still requires many researches. Twinspan The technique known as TWINSPAN (Two-Way INdicator SPecies ANalysis) is also polythetic divisive (HILL, 1979, GAUCH, 198, JONGMAN et al., 1987). The data are first processed by correspondence analysis. The species which characterize the extremities of the axis are detected in such a manner as to polarize the relevés; these are divided into two groups by cutting the axis to its centroid. The division of the relevés is refined by a new classification using the species at the ends of the axis of correspondence analysis. The division process is then repeated on the two subsets of relevés and so on until each group of relevés has a number of relevés that does not exceed a fixed minimum. A corresponding classification of the species is also produced in parallel with that of the relevés and the hierarchical classification of the species is used to construct a table of data rearranged both in its rows and columns so as to present the groups of relevés with their indicator species. Cocktail clustering Cocktail clustering (BRUELHEIDE, 016) is a hierarchical agglomerative clustering algorithm for species. It starts with a species x species matrix of the coefficient of association. After fusing the species with the highest coefficient, the association matrix is recalculed for the new group of species. For calculation association for groups to other species or to the nodes formed by groups of species, the observed frequency distribution of cooccurrences of the species in that group is compared to the expected frequency distribution of co-occurrence, derived from the observed number of species occurrences. As a result, for each species group a minimum number of species is obtained that is required to assign a relevé to this species group. The resulting Cocktail species groups are partially nested, and with increasing node hierarchy show a tendency of decreasing correlation to the last-joining species in that group. 9

10 As the clustering algorithm assigns all of the n species in a data set to groups, the result are n 1 partly nested species groups. These groups correspond to species groups that have been extracted from the same data set using preconceived starting groups. Subsequently, the species groups can be used separately or in logical combinations to classify vegetation relevés either by expert systems, Twinspan-like classification algorithms or by redefining existing vegetation units with automatic algorithms. Used in this way, Cocktail clustering is able to form the backbone of a consistent large-scale vegetation classification system. Choice of techniques Among all the existing techniques, we recommend the following, for the first file, at least: - the agglomerative hierarchical classification of the second order moment of the raw table, - the agglomerative hierarchical classification of the raw table by the flexible link and the euclidean distance, - the agglomerative hierarchical classification of the second order moment of the table transformed by the analysis of classical correspondence and the non-symmetrical correspondence analysis, - an agglomerative hierarchical classification based on a matrix of distance relevés (coordinates of the nonsymmetrical correspondence analysis, euclidean distance, mean linkage), - a divisive classification based on a distance matrix (coordinates of the non-symmetrical correspondence analysis, euclidean distance), - a divisive classification of the virtual centers (coordinates of the non-symmetrialc correspondence, euclidean distance), - the divisive classification "TWINSPAN, which can only apply to raw tables, - the aggregation technique around the mobile centers of ROUX (1985) of the table transformed by the analysis of the classical correspondence and the non-symmetrical correspondence analyses; the two types of preclassification (random and fixed) were used. The "Cocktail" clustering is mainly intended for phytosociologists and still requires concrete applications. The comparison of the results leads to a final solution which retains some arbitrary character. Examples A small table The classification procedure is illustrated with a small table. Analysis of results, as with multivariate analysis techniques, affects how to conceptualize vegetation. 10

11 We take up the Tailfer file (table 1) covered in chapter 7. Two steps are presented: the classification of the raw table and then that of the transformed table. Species/relevés R e l e v é s ( 1 à 9 ) Angelica sylvestris Athyrium filix-femina Callitriche platycarpa Callitriche stagnalis Cardamine pratensis Carex pendula Carex remota Chrysosplenium oppositifolium Cirsium palustre Deschampsia cespitosa Filipendula ulmaria Galium palustre Glyceria fluitans Juncus effusus Lotus uliginosus Lysimachia nemorum Mentha arvensis Myosotis scorpioides Persicaria hydropiper Ranunculus repens Rumex sanguineus Scrophularia auriculata Solanum dulcamara Sphagnum palustre Stachys sylvatica Stellaria alsine

12 Sum Table 1. Vegetation relevés of the tributary 5 of the Tailfer river; table with marginal totals. = sum. The brooklet rises on a plateau. It is first a simple drain. Water appears gradually. The stream crosses open and wooded areas, before reaching the hilly part completely wooded. The slope is at first moderate then strong with a stream bed made up of pebbles and blocks. Raw table First, the hierarchical classification of the moment of second order of the raw table leads to easily distinguish three groups (figure 1): - relevés 1, 5 and 7 which correspond to openings in the woodland, with several heliophilic species found only in these relevés, - relevés, 3, 4, 6, 8, 9, 1, 14 and 19, characterized by Carex remota, - relevés 10, 11, 13, 15, 16, 17, 18, 0, 1,, 3, 4, 5, 6, 7, 8 and 9, characterized by Carex pendula and some other infrequent species; or 6 groups if one stops further in the hierarchy: - relevés 1, 5 and 7 already discriminated against with three groups, - relevés, 3, 4, 9, 1 and 14, - relevés 6, 8, 19 and 8, - relevés 10 and 11, - relevés 13, 15, 18, 0, 1 and, - relevés 16, 17, 3, 4, 5, 6, 7 and 9. Only the last group has a character- species, namely Carex pendula. Only the first solution is of phytosociological and synthetic interest. 1

13 Figure 1. Agglomerative hierarchical classification of the table 6 species x 9 relevés. Another hierarchical method with a link called "flexible beta" and the euclidean distance measurement gives results close to those of the second order moment (figure ) Figure. Agglomerative hierarchical classification, method of the link flexible beta of the table 6 species x 9 relevés. relevés. Here are the results of the TWINSPAN classification (table ) with the rearrangement of species and Athyrium filix-femina

14 Carex pendula Chrysosplenium oppositifolium Callitriche platycarpa Galium palustre Solanum dulcamara Sphagnum palustre Carex remota Filipendula ulmaria Scrophularia auriculata Deschampsia cespitosa Cardamine pratensis Stachys sylvatica Angelica sylvestris Callitriche stagnalis Lysimachia nemorum Glyceria fluitans Juncus effusus Ranunculus repens Cirsium paljustre Lotus pedunculatus Mentha arvensis Myosotis scorpioides Persicaria hydropiper Rumex sanguineus Stellaria alsine Table. TWINSPAN classification of the raw table with the hierarchy of relevés and species. This last classification is close to the other two, with however the small difference that Deschampsia cespitosa is better discriminated than in the two previous ones. Complete table transformed by correspondence analyses 199). In the transformed tables, the number of axes retained was fixed by the permutation technique (BOUXIN, The agglomerative hierarchical classification of the second order moment is applied to the table transformed by the correspondence analysis. The first five significant variables are retained. Three groups are again discriminated (figure 3). - relevés 1 and 7 which are two of the three open sites, 14

15 - relevés, 3, 4, 5, 6, 8, 9, 10, 11, 1, 13, 14, 15 and 19, - relevés 16, 17, 18, 0, 1,, 3, 4, 5, 6, 7, 8 and 9. The division corresponds much better to the situation of the brooklet, which raises on the wooded plateau with a slow course and then becomes rapid as it descends the hill from relevé 16. This change is marked by a difference in the general physiognomy of vegetation and in the floristic composition of the relevés. In this case, the synthetic nature of the classification is markedly improved Figure 3. Agglomerative hierarchical classification of the second order moment of the table transformed by correspondence analysis. Five significant variables x 9 relevés. The agglomerative hierarchical classification (figure 4) uses this time the mean linkage. This analysis is the reflection of the defaults of correspondence analysis of a complete table and the division into classes is partly the reflection of the superposition of very different specific frequencies. The results are of little synthetic interest. 15

16 Figure 4. Agglomerative hierarchical classification of the average linkage of the table transformed by correspondence analysis. Five significant variables x 9 relevés. With the table transformed by non-symmetrical correspondenceanalysis (figure 5), two entities are separated first: the relevés comprising the species limited or observed mainly in the upper part of the course (mainly characterized by Deschampsia cespitosa) and the species of the inferior and rough lower course (without Deschampsia cespitosa). In the formation of these groups, common species play an important role: first, a set of species present exclusively or preferentially in the two most enlightened relevés (1 and 7), followed by a group of relevés characterized mainly by Cardamine pratensis, a group of 10 records characterized by Carex remota and finally a group characterized by Carex pendula. The scheme thus proposed is simple and its synthetic character is undeniable. There is also a two-level hierarchical structure that is interesting. 16

17 Figure 5. Agglomerative hierarchical classification of the second order moment of the table transformed by nonsymmetrical analysis. Two significant variables x 9 relevés. With the technique of the average link (figure 6), the results are close to those of the preceding analysis except that the relevé 15 is more isolated. This relevé is occupied by only one species (Deschampsia cespitosa) is therefore strongly influenced by this particularity Figure 6. Agglomerative hierarchical classification of the average linkage of the table transformed by nonsymmetrical analysis. Two significant variables x 9 relevés. 17

18 Total occupied space The results of the divisive technique of virtual centers are now presented, first with the transformed table by correspondence analysis (figures 7 and 8). 0,08 0,07 CENVI 0,06 0,05 0,04 0,03 0,0 0,01 Number of virtual centers Figure 7. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the table transformed by correspondence analysis. Five significant axes. The quantity Vg (total space occupied) is not very useful in this case, since there is a constant decrease of this parameter, without arriving at a simple stopping rule. 18

19 9 relevés 8 relevés 18 6 relevés relevés relevés relevés Figure 8. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by the correspondence analysis. Five significant axes. Presentation with seven virtual centers. The arrow indicates a reallocation. In the rectangles, one reads the amount of relevés or their numbers. This technique first forms a large group and three very small groups. Subsequent divisions produce only groups consisting of a single relevé. It is necessary to wait for the sixth division to create two important groups and thus obtain synthetic information. The technique of virtual centers, with a table transformed by the non-symmetrical analysis of the correspondences, brings very different results (figures 9 to 11). 19

20 Total occupied space 0,06 CENVI 0,05 0,04 0,03 0,0 Number of virtual centers 0, Figure 9. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the table transformed by non-symmetrical correspondence analysis. Two significant axes. There is again a constant decrease of Vg. 9 relevés 8 relevés Figure 10. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by the non-symmetrical correspondence analysis. Two significant axes. Presentation with four virtual centers. There is only one reallocation (relevé 15) indicated by the arrow. In the rectangles, one reads the amount of relevés or their numbers. 10: The results are again strongly influenced by frequent species. Following the groups from left to right in figure 0

21 axis 1. By Cardamine pratensis and some localized species,. By the superposition of Carex remota and Deschampsia cespitosa, 3. By the superposition of Carex pendula and Deschampsia cespitosa, 4. By Carex pendula alone. The four groups are represented in the plane of the first two axes (the two only significant ones) of the nonsymmetrical correspondence analysis (Figure 11) axis Figure 11. Representation of the four groupings in the plane of the first two axes of the non-symmetrical correspondence analysis. Based on the knowledge gained on the plant groupings, it is now possible to use the mobile center technique which starts from a preclassification. We will compare the results starting from four defined classes and four classes drawn at random. In the first case, the algorithm reproduces the same four classes as the initial ones with an interclass moment/total moment ratio equal to 0.854, always higher than the ratio obtained with a random start partition. Conclusions In conclusion, on a phytosociological and synthetic point of view, the classifications constructed at the start of a table transformed by the non-symmetrical correspondence analysis bring the best results. Each of the classification techniques has its own sensitivity, but the prior transformation of the tables limits the arbitrary aspect of the choice of the classification technique. In the later steps, we will only use transformed tables. The small number of relevés in the Tailfer5 table does not make it possible to group the relevés into blocks in a demonstrative way. 1

22 The use of a stopping rule has not been of great help to us. A large floristical table We return to the Crupet file processed in Chapter 7. This is a simplified table. Only the species with a significant dispersion pattern (according to Chapter 4) have been retained. In this case, correspondence analysis can be used. The use of simplified tables is frequent but generally based on the frequency of the species, fixed arbitrarily (1,, 3 presences for example). The elimination of rows from a table may be statistically justified but is more difficult to admit for a phytosociologist or ecologist who regards vegetation as an ecosystem in which each species has its function. This is the only example presented. File with 35 species and 147 relevés The file has previously been transformed by non-symmetrical correspondence analysis. Four axes are significant. It is therefore a file 4 variables x 147 relevés that has been subjected to the various techniques of classification. Let us begin with the agglomerative hierarchical classification of WARD (figure 1) which produces a large number of classes, without it being possible here to fix a stopping level. Figure 1. Agglomerative hierarchical classification with WARD s method of the Crupet table (35 species) transformed by the non-symmetrical correspondence analysis, with four significant variables. 7 main clusters are marked from A to G.

23 Total occupied space 0, ,0001 Let us compare immediately with the technique of virtual centers (figures 13 and 14). CENVI 0,0001 0, ,00006 Number of virtual centers 0, , Figure 13. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the transformed table by the non-symmetrical correspondence analysis. Four significant axes. 147 relevés 13 relevés 4 relevés relevés 4 relevés relevés 8 relevés relevés 59 relevés 17 relevés 11 relevés relevés 59 relevés 16 relevés relevés Figure 14. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by non-symmetrical correspondence analysis. Four significant axes. Presentation with six virtual centers. In the rectangles, one reads the amount of relevés or their numbers. The arrows and numbers in bold and italics indicate the number of reallocations. 3

24 It is easy to see that the two classifications produce very different results and it is difficult to make a choice a priori. Two first interesting observations come from the technique of virtual centers which shows two useful properties: the number of classes is determined in a simple way (figure 13). Indeed, the total occupied space decreases progressively from one to six virtual centers and then increases, before falling significantly from nine centers. With such a number of virtual centers and more, the technique only produces very small groups. From three to four, then from four to five virtual centers, there are many reallocations. Let us first introduce the results produced by the technique of virtual centers. The following groupings are recognized: - a woody group characterized by Alnus glutinosa, Alnus incana, Fraxinus excelsior, by species of shady environments such as Angelica sylvestris, Festuca gigantea and Filipendula ulmaria, and others of lower course more shady than the rest like Persicaria bistorta, Cardamine amara, Myosotis scorpioides, Fontinalis antipyretica and Petasites hybridus (C1 of table 3; coordinates of axis 1 of the non-symmetrical analysis of correspondences <0); - a grazed grassland grouping characterized by the combination of Glyceria notata, Persicaria hydropiper and Veronica beccabunga (C4 of table 3; coordinates of axis 1> 0 and axis <0); - an unshaded or slightly shaded grouping characterized by the combination of Stachys sylvatica, Glyceria notata and Ranunculus repens (C5 of table 3; coordinates of axis <0 and coordinates of axis 4 <0 or close to 0); - a large grouping characterized by the combination of herbaceous species, the most frequent being Agrostis stolonifera, Epilobium hirsutum, Phalaris arundinacea, Rumex conglomeratus and Scrophularia umbrosa (C); - two small species-poor groupings species-poor, each dominated by one species (Nasturtium officinale for C3 and Filipendula ulmaria for C6). With the WARD s technique, one recognizes: - A woody grouping characterized by the combination of several woody species, mostly Alnus glutinosa, Alnus incana, Fraxinus excelsior and Salix rubens and the following herbaceous: Angelica sylvestris, Festuca gigantea, Chrysosplenium oppositifolium, Filipendula ulmaria, Cardamine amara and in the lower course Fontinalis antipyretica (E in table 3, coordinates of axis 1 <0); - A group characterized by a set of herbaceous species with significant negative coordinates on the second axis, mainly Glyceria notata and Persicaria hydropiper (F in table 3); - A grouping of 6 poorly diversified relevés and characterized by the single Angelica sylvestris (A in table 3, the coordinate axes 3 and 4 negative); - A grouping of 3 relevés characterized by Nasturtium officinale (G in table 3, coordinates> 0 on the axis 4); - A grouping characterized by the combination of the tree Alnus glutinosa and herbaceous species, mainly Angelica sylvestris, Scrophularia umbrosa, Mentha aquatica, Phalaris arundinacea, Festuca gigantea and Myosotis scorpioides (B of Table 3); 4

25 - another poorly defined grouping (C in table 3 with positive coordinates or very close to the origin on axis 1); - A light grouping influenced by a combination of herbaceous species favored by pollution or quite severe eutrophication as Agrostis stolonifera, Phalaris arundinacea, Epilobium hirsutum, Rumex conglomeratus, Nasturtium officinale, Epilobium roseum and Veronica beccabunga (D of table 3 with positive coordinates on both axes 1 and ). C1 C C3 C4 C5 C6 E A E G B C D

26 Table 3. Comparison of both classifications. C1 to 6 : hierarchical divisive classification of virtual centers. W1 to 7 : agglomerative hierarchical classification of the second order moment. Each of the techniques has its own sensitivity. One reveals, for example, a grassland grouping with Glyceria notata and Veronica beccabunga (integreted, in the other technique, in a larger group); the other reveals a brooklet grouping, polluted and eutrophicated by discharges of domestic and agricultural wastewater, which is also intagrated in a larger group in the first technique. It is therefore impossible to retain only the results from a single classification. We therefore decided to combine the contributions of the two techniques to construct a preclassification which is subject to the technique of mobile centers. We have created seven classes by taking classes E, A, B and C (table 3) as they are. Relevés 53, 64 and 7 have been included in class D because they contain a high biomass of Nasturtium officinale as in many relevés of this class. The C4 relevés are taken out of class F and form the class G (grouping with Glyceria notata and Veronica beccabunga). The result with the inter-class moment/ highest total moment ratio (table 4) is used. 6

27 classe1 classe classe3 classe4 classe5 classe6 classe Table 4. Classification of the 147 relevés with the technique of mobile centers. It is thus recognized: 7

28 - a woody grouping distributed throughout the course, characterized mainly by the combination of Alnus glutinosa, A. incana, Fraxinus excelsior, Festuca gigantea, Angelica sylvestris and Festuca gigantea; - an open group also distributed throughout the course and characterized mainly by Angelica sylvestris; - a grouping distributed over the upper and middle reaches, characterized by Stachys sylvatica and Glyceria notata; - a grouping of open sites, only in the upper course, in the middle of the pasture and characterized mainly by Glyceria notata and Veronica beccabunga; - a woody grouping characterized mainly by the combination of Alnus glutinosa, Filipendula ulmaria, Angelica sylvestris, Phalaris arundinacea, Scrophularia umbrosa, Mentha aquatica, Myosotis scorpioides, Festuca gigantea and Stachys sylvatica; - a grouping of open sites spread over a large part of the course from the village of Assesse and characterized by the combination of herbaceous species favored by the generalized eutrophication of the river: Epilobium hirsutum, Agrostis stolonifera, Rumex conglomeratus, Scrophularia Umbrosa, Mentha aquatica, Phalaris arundinacea, Epilobium roseum and Lycopus europaeus; - a grouping of open sites occupying mainly a quarter of the course after Assesse, in an agricultural environment and characterized by species resistant to pollution such as Nasturtium officinale, Veronica beccabunga, Lycopus europaeus and Cirsium palustre. For example, we have also applied the WARD s technique to species (figure 15). It is a table comprising four columns (for the four significant axes) and 35 rows (species). AG_IS st CALYS se FESTU ar JUNCU in SOLAN du STACH pa GLYCE fl SPARG er PO_NUM h RANUN re ANGEL sy CIRSI pa EPILO ro NASTU of PHALA ar EPILO hi RUMEX co MENTH aq SCROP um GLYCE no LYCO_PUS VERON be ALNUS gl FRAXI ex ALNUS in CHR_O op FESTU gi CARD_AMI PO_NUM b FONTI an MY_TIS s PETAS hy SALIX xr FILIP ul STACH sy 8

29 Figure 15. Agglomerative hierarchical classification of species with the WARD s method of the Crupet table (35 species) transformed with non-symmetrical correspondence analysis, with four significant variables. Six groups of plants (ecological groups) are easily distinguished: - a group of woody plants and sciaphil species or more common in wooded areas with Alnus glutinosa, Fraxinus excelsior, Alnus incana and Salix rubens as large trees, Chrysosplenium oppositifolium, Festuca gigantea, Cardamine amara, Persicaria bistorta, Fontinalis antipyretica, Myosotis scorpioides and Petasites hybridus as herbaceous species; - a small group with Filipendula ulmaria and Stachys sylvatica supporting shade and some luminosity; - a group of three heliophilic herbaceous species in the upper course: Glyceria notata, Lycopus europaeus and Veronica beccabunga; - a group of herbaceous species developing mainly in areas receiving wastewater: Cirsium palustre, Epilobium roseum, Nasturtium officinale, Phalaris arundinacea, Epilobium hirsutum, Rumex conglomeratus, Mentha aquatica and Scrophularia umbrosa; - an isolated species: Angelica sylvestris; - a group of herbaceous species not falling into the previous categories, less "typed": Agrostis stolonifera, Calystegia sepium, Festuca arundinacea, Juncus inflexus, Solanum dulcamara, Stachys palustris, Glyceria fluitans, Sparganium erectum, Persicaria hydropiper and Ranunculus repens. Conclusions not. At this step, we see how different the results differ according to whether one classification technique is used or General conclusions These few pages show how difficult it is to construct a classification of a relevé table. The same table can lead to a very large number of solutions. The results depend on: - the data used: original data or transformed by one or other multivariate analysis or complete original table or simplified table after a statistical pattern analysis; - the used algorithms: hierarchical classification or not; - the variants used, especially in metrics; - a possible stopping rule. 9

30 In general, we recommend using only tables previously transformed by multivariate analysis; this reduces the impact of choosing a metric. The euclidean distance is then a completely usable metric. It is easy to combine floristic variables and mesological variables in the same classification, as long as one starts from a table transformed by multiple factor analysis. It is thus possible to provide a statistical basis for socioecological groups. We also recommend comparing the results produced by several techniques in order to highlight the strong nuclei, that is to say the most robust groups. The ultimate use of the mobile center technique makes it possible to consolidate previous choices or to encourage further analysis. There is no ready-made solution to the problem of stopping rules. A simple solution is proposed here. It suffices to take into account the number of significant axes in the preliminary factor analysis and to multiply this number by two. This amounts to separating, for each axis, the relevés or species with positive and negative coordinates. This rule provides an order of magnitude of the number of classes. It is up to the user to refine the number of classes by observing the results. The association of factorial analyses and classification techniques opens immense possibilities for the description of plant groupings. Multiple factor analysis techniques, giving a more balanced representation of multilayered vegetation, for example, or combining floristic variables and mesological variables, lead to classifications taking into account the components of phytocenoses in their environment. Many possibilities remain to be explored. The many available techniques are nothing if they are used automatically, without an important underlying reflection on the nature of the data and the way in which they were collected. These techniques will never compensate for gaps in vegetation sampling. References BAAMAL, L. (1994). Etudes des règles d arrêt en classification numérique. Dissertation originale présentée en vue de l obtention du grade de docteur en sciences agronomiques. Faculté des Sciences agronomiques de Gembloux. 60 pp. BERTHET, P., FEYTMANS, E., STEVENS, D. & GENETTE, A. (1976). A new divisive method of classification illustrated by its applications to ecological problems.. Proc. Ninth Int. Biom. Conf., invited papers. Vol. II:

Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity

More information

Applying cluster analysis to 2011 Census local authority data

Applying cluster analysis to 2011 Census local authority data Applying cluster analysis to 2011 Census local authority data Kitty.Lymperopoulou@manchester.ac.uk SPSS User Group Conference November, 10 2017 Outline Basic ideas of cluster analysis How to choose variables

More information

DIMENSION REDUCTION AND CLUSTER ANALYSIS

DIMENSION REDUCTION AND CLUSTER ANALYSIS DIMENSION REDUCTION AND CLUSTER ANALYSIS EECS 833, 6 March 2006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@kgs.ku.edu 864-2093 Overheads and resources available at http://people.ku.edu/~gbohling/eecs833

More information

Supplementary Information

Supplementary Information Supplementary Information For the article"comparable system-level organization of Archaea and ukaryotes" by J. Podani, Z. N. Oltvai, H. Jeong, B. Tombor, A.-L. Barabási, and. Szathmáry (reference numbers

More information

Discrimination Among Groups. Discrimination Among Groups

Discrimination Among Groups. Discrimination Among Groups Discrimination Among Groups Id Species Canopy Snag Canopy Cover Density Height 1 A 80 1.2 35 2 A 75 0.5 32 3 A 72 2.8 28..... 31 B 35 3.3 15 32 B 75 4.1 25 60 B 15 5.0 3..... 61 C 5 2.1 5 62 C 8 3.4 2

More information

MULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA

MULTIVARIATE ANALYSIS OF BORE HOLE DISCONTINUITY DATA Maerz,. H., and Zhou, W., 999. Multivariate analysis of bore hole discontinuity data. Rock Mechanics for Industry, Proceedings of the 37th US Rock Mechanics Symposium, Vail Colorado, June 6-9, 999, v.,

More information

Chapter 4. Pattern analysis

Chapter 4. Pattern analysis Chapter 4. Pattern analysis by Guy BOUXI Contents Introduction... The analysis of dispersion in contiguous quadrats... 6 Dispersion indices... 6 Example with a relevé strip and presence-absence data...

More information

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan Clustering CSL465/603 - Fall 2016 Narayanan C Krishnan ckn@iitrpr.ac.in Supervised vs Unsupervised Learning Supervised learning Given x ", y " "%& ', learn a function f: X Y Categorical output classification

More information

Overview of clustering analysis. Yuehua Cui

Overview of clustering analysis. Yuehua Cui Overview of clustering analysis Yuehua Cui Email: cuiy@msu.edu http://www.stt.msu.edu/~cui A data set with clear cluster structure How would you design an algorithm for finding the three clusters in this

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Model: X1 X2 X3 X4 X5 Clusters (Nominal variable) Y1 Y2 Y3 Clustering/Internal Variables External Variables Assumes: 1. Actually, any level of measurement (nominal, ordinal,

More information

Main Issues Report - Background Evidence 5. Site Analysis

Main Issues Report - Background Evidence 5. Site Analysis Main Issues Report - Background Evidence 5. Site Analysis 134 Cairngorms National Park Local Development Plan 135 Main Issues Report - Background Evidence 5. Site Analysis 136 Cairngorms National Park

More information

Multivariate analysis

Multivariate analysis Multivariate analysis Prof dr Ann Vanreusel -Multidimensional scaling -Simper analysis -BEST -ANOSIM 1 2 Gradient in species composition 3 4 Gradient in environment site1 site2 site 3 site 4 site species

More information

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA David Zelený & Ching-Feng Li INTRODUCTION TO MULTIVARIATE ANALYSIS Ecologial similarity similarity and distance indices Gradient analysis regression,

More information

Lecture 2: Data Analytics of Narrative

Lecture 2: Data Analytics of Narrative Lecture 2: Data Analytics of Narrative Data Analytics of Narrative: Pattern Recognition in Text, and Text Synthesis, Supported by the Correspondence Analysis Platform. This Lecture is presented in three

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

BIO 682 Multivariate Statistics Spring 2008

BIO 682 Multivariate Statistics Spring 2008 BIO 682 Multivariate Statistics Spring 2008 Steve Shuster http://www4.nau.edu/shustercourses/bio682/index.htm Lecture 11 Properties of Community Data Gauch 1982, Causton 1988, Jongman 1995 a. Qualitative:

More information

BOTANICA HUNGARICA (Antea: Fragmenta Botanica) Phytocoenological survey along the Koloska stream (Balaton-felvidék region, Hungary)

BOTANICA HUNGARICA (Antea: Fragmenta Botanica) Phytocoenological survey along the Koloska stream (Balaton-felvidék region, Hungary) STUDIA XXIII. BOTANICA HUNGARICA (Antea: Fragmenta Botanica) 1992 pp. 81-95 Phytocoenological survey along the Koloska stream (Balaton-felvidék region, Hungary) By B. PAPP (Received October 30,1990) Abstract:

More information

Chapter 1. Gaining Knowledge with Design of Experiments

Chapter 1. Gaining Knowledge with Design of Experiments Chapter 1 Gaining Knowledge with Design of Experiments 1.1 Introduction 2 1.2 The Process of Knowledge Acquisition 2 1.2.1 Choosing the Experimental Method 5 1.2.2 Analyzing the Results 5 1.2.3 Progressively

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

VCS MODULE VMD0018 METHODS TO DETERMINE STRATIFICATION

VCS MODULE VMD0018 METHODS TO DETERMINE STRATIFICATION VMD0018: Version 1.0 VCS MODULE VMD0018 METHODS TO DETERMINE STRATIFICATION Version 1.0 16 November 2012 Document Prepared by: The Earth Partners LLC. Table of Contents 1 SOURCES... 2 2 SUMMARY DESCRIPTION

More information

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain;

CoDa-dendrogram: A new exploratory tool. 2 Dept. Informàtica i Matemàtica Aplicada, Universitat de Girona, Spain; CoDa-dendrogram: A new exploratory tool J.J. Egozcue 1, and V. Pawlowsky-Glahn 2 1 Dept. Matemàtica Aplicada III, Universitat Politècnica de Catalunya, Barcelona, Spain; juan.jose.egozcue@upc.edu 2 Dept.

More information

2. Sample representativeness. That means some type of probability/random sampling.

2. Sample representativeness. That means some type of probability/random sampling. 1 Neuendorf Cluster Analysis Assumes: 1. Actually, any level of measurement (nominal, ordinal, interval/ratio) is accetable for certain tyes of clustering. The tyical methods, though, require metric (I/R)

More information

CONTRIBUTIONS TO THE STUDY OF PALUDAL VEGETATION FROM THE NEAGRA ŞARULUI RIVER S BASIN (SUCEAVA COUNTY) LOREDANA ASOLTANI.

CONTRIBUTIONS TO THE STUDY OF PALUDAL VEGETATION FROM THE NEAGRA ŞARULUI RIVER S BASIN (SUCEAVA COUNTY) LOREDANA ASOLTANI. Analele ştiinţifice ale Universităţii Al. I. Cuza Iaşi Tomul LIV, fasc. 1, s. II a. Biologie vegetală, 2008 CONTRIBUTIONS TO THE STUDY OF PALUDAL VEGETATION FROM THE NEAGRA ŞARULUI RIVER S BASIN (SUCEAVA

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Karr J.R. and D.R. Dudley Ecological perspective on water quality goals. Environmental Manager 5:55-68.

Karr J.R. and D.R. Dudley Ecological perspective on water quality goals. Environmental Manager 5:55-68. Ecological Integrity Assessment: An Approach for Assessing Ecosystem Condition to Guide Conservation and Management Ecological Integrity " the ability of an ecosystem to support and maintain i a balanced

More information

Fitting a Straight Line to Data

Fitting a Straight Line to Data Fitting a Straight Line to Data Thanks for your patience. Finally we ll take a shot at real data! The data set in question is baryonic Tully-Fisher data from http://astroweb.cwru.edu/sparc/btfr Lelli2016a.mrt,

More information

1 Basic Concept and Similarity Measures

1 Basic Concept and Similarity Measures THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2016, Mr. Ruey S. Tsay Lecture 10: Cluster Analysis and Multidimensional Scaling 1 Basic Concept and Similarity Measures

More information

An Introduction to Ordination Connie Clark

An Introduction to Ordination Connie Clark An Introduction to Ordination Connie Clark Ordination is a collective term for multivariate techniques that adapt a multidimensional swarm of data points in such a way that when it is projected onto a

More information

Multivariate Analysis Cluster Analysis

Multivariate Analysis Cluster Analysis Multivariate Analysis Cluster Analysis Prof. Dr. Anselmo E de Oliveira anselmo.quimica.ufg.br anselmo.disciplinas@gmail.com Cluster Analysis System Samples Measurements Similarities Distances Clusters

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Clustering analysis of vegetation data

Clustering analysis of vegetation data Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication ANOVA approach Advantages: Ideal for evaluating hypotheses Ideal to quantify effect size (e.g., differences between groups) Address multiple factors at once Investigates interaction terms Disadvantages:

More information

Improving Dense Packings of Equal Disks in a Square

Improving Dense Packings of Equal Disks in a Square Improving Dense Packings of Equal Disks in a Square David W. Boll Jerry Donovan Ronald L. Graham Boris D. Lubachevsky Hewlett-Packard Hewlett-Packard University of California Lucent Technologies 00 st

More information

Multivariate Statistics

Multivariate Statistics Multivariate Statistics Chapter 6: Cluster Analysis Pedro Galeano Departamento de Estadística Universidad Carlos III de Madrid pedro.galeano@uc3m.es Course 2017/2018 Master in Mathematical Engineering

More information

What determines: 1) Species distributions? 2) Species diversity? Patterns and processes

What determines: 1) Species distributions? 2) Species diversity? Patterns and processes Species diversity What determines: 1) Species distributions? 2) Species diversity? Patterns and processes At least 120 different (overlapping) hypotheses explaining species richness... We are going to

More information

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li

Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 77 Cluster Analysis (Sect. 9.6/Chap. 14 of Wilks) Notes by Hong Li 1) Introduction Cluster analysis deals with separating data into groups whose identities are not known in advance. In general, even the

More information

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms

Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Solving Non-uniqueness in Agglomerative Hierarchical Clustering Using Multidendrograms Alberto Fernández and Sergio Gómez arxiv:cs/0608049v2 [cs.ir] 0 Jun 2009 Departament d Enginyeria Informàtica i Matemàtiques,

More information

STATISTICA MULTIVARIATA 2

STATISTICA MULTIVARIATA 2 1 / 73 STATISTICA MULTIVARIATA 2 Fabio Rapallo Dipartimento di Scienze e Innovazione Tecnologica Università del Piemonte Orientale, Alessandria (Italy) fabio.rapallo@uniupo.it Alessandria, May 2016 2 /

More information

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i

B. Weaver (24-Mar-2005) Multiple Regression Chapter 5: Multiple Regression Y ) (5.1) Deviation score = (Y i B. Weaver (24-Mar-2005) Multiple Regression... 1 Chapter 5: Multiple Regression 5.1 Partial and semi-partial correlation Before starting on multiple regression per se, we need to consider the concepts

More information

Chapter 2 Direct Current Circuits

Chapter 2 Direct Current Circuits Chapter 2 Direct Current Circuits 2.1 Introduction Nowadays, our lives are increasingly dependent upon the availability of devices that make extensive use of electric circuits. The knowledge of the electrical

More information

Design of Manufacturing Systems Manufacturing Cells

Design of Manufacturing Systems Manufacturing Cells Design of Manufacturing Systems Manufacturing Cells Outline General features Examples Strengths and weaknesses Group technology steps System design Virtual cellular manufacturing 2 Manufacturing cells

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis Chris Funk Lecture 5 Topic Overview 1) Introduction/Unvariate Statistics 2) Bootstrapping/Monte Carlo Simulation/Kernel

More information

Clustering Techniques and their applications at ECMWF

Clustering Techniques and their applications at ECMWF Clustering Techniques and their applications at ECMWF Laura Ferranti European Centre for Medium-Range Weather Forecasts Training Course NWP-PR: Clustering techniques and their applications at ECMWF 1/32

More information

MATH II CCR MATH STANDARDS

MATH II CCR MATH STANDARDS RELATIONSHIPS BETWEEN QUANTITIES M.2HS.1 M.2HS.2 M.2HS.3 M.2HS.4 M.2HS.5 M.2HS.6 Explain how the definition of the meaning of rational exponents follows from extending the properties of integer exponents

More information

DESK Secondary Math II

DESK Secondary Math II Mathematical Practices The Standards for Mathematical Practice in Secondary Mathematics I describe mathematical habits of mind that teachers should seek to develop in their students. Students become mathematically

More information

AP Statistics Cumulative AP Exam Study Guide

AP Statistics Cumulative AP Exam Study Guide AP Statistics Cumulative AP Eam Study Guide Chapters & 3 - Graphs Statistics the science of collecting, analyzing, and drawing conclusions from data. Descriptive methods of organizing and summarizing statistics

More information

Elements of probability theory

Elements of probability theory The role of probability theory in statistics We collect data so as to provide evidentiary support for answers we give to our many questions about the world (and in our particular case, about the business

More information

Factor analysis. George Balabanis

Factor analysis. George Balabanis Factor analysis George Balabanis Key Concepts and Terms Deviation. A deviation is a value minus its mean: x - mean x Variance is a measure of how spread out a distribution is. It is computed as the average

More information

Math II. Number and Quantity The Real Number System

Math II. Number and Quantity The Real Number System MATHEMATICS Math II The high school mathematics curriculum is designed to develop deep understanding of foundational math ideas. In order to allow time for such understanding, each level focuses on concepts

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

INSIDE ALGEBRA CORRELATED WITH CALIFORNIA S COMMON CORE STANDARDS HIGH SCHOOL ALGEBRA

INSIDE ALGEBRA CORRELATED WITH CALIFORNIA S COMMON CORE STANDARDS HIGH SCHOOL ALGEBRA We CA Can COMMON Early Learning CORE STANDARDS Curriculum PreK Grades 8 12 INSIDE ALGEBRA CORRELATED WITH CALIFORNIA S COMMON CORE STANDARDS HIGH SCHOOL ALGEBRA May 2011 www.voyagersopris.com/insidealgebra

More information

Learning from Examples

Learning from Examples Learning from Examples Data fitting Decision trees Cross validation Computational learning theory Linear classifiers Neural networks Nonparametric methods: nearest neighbor Support vector machines Ensemble

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

6. APPLICATION TO THE TRAVELING SALESMAN PROBLEM

6. APPLICATION TO THE TRAVELING SALESMAN PROBLEM 6. Application to the Traveling Salesman Problem 92 6. APPLICATION TO THE TRAVELING SALESMAN PROBLEM The properties that have the most significant influence on the maps constructed by Kohonen s algorithm

More information

QUIVERS AND LATTICES.

QUIVERS AND LATTICES. QUIVERS AND LATTICES. KEVIN MCGERTY We will discuss two classification results in quite different areas which turn out to have the same answer. This note is an slightly expanded version of the talk given

More information

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems

3/10/03 Gregory Carey Cholesky Problems - 1. Cholesky Problems 3/10/03 Gregory Carey Cholesky Problems - 1 Cholesky Problems Gregory Carey Department of Psychology and Institute for Behavioral Genetics University of Colorado Boulder CO 80309-0345 Email: gregory.carey@colorado.edu

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION doi:10.1038/nature11226 Supplementary Discussion D1 Endemics-area relationship (EAR) and its relation to the SAR The EAR comprises the relationship between study area and the number of species that are

More information

A Model for Computer Identification of Micro-organisms

A Model for Computer Identification of Micro-organisms J. gen, Microbial. (1965), 39, 401405 Printed.in Great Britain 401 A Model for Computer Identification of Micro-organisms BY H. G. GYLLENBERG Department of Microbiology, Ulziversity of Helsinki, Finland

More information

Module 7-2 Decomposition Approach

Module 7-2 Decomposition Approach Module 7-2 Decomposition Approach Chanan Singh Texas A&M University Decomposition Approach l Now we will describe a method of decomposing the state space into subsets for the purpose of calculating the

More information

Intraclass Correlations in One-Factor Studies

Intraclass Correlations in One-Factor Studies CHAPTER Intraclass Correlations in One-Factor Studies OBJECTIVE The objective of this chapter is to present methods and techniques for calculating the intraclass correlation coefficient and associated

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Gene expression profiling A quick review Which molecular processes/functions

More information

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING Vishwanath Mantha Department for Electrical and Computer Engineering Mississippi State University, Mississippi State, MS 39762 mantha@isip.msstate.edu ABSTRACT

More information

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer

More information

Curriculum Mapping 3/28/2013

Curriculum Mapping 3/28/2013 Curriculum Mapping Curriculum Map: 2012 2013 Mathematics State Standards Algebra 1 Q1 (8/14/2012-10/12/2012) Chapter 1: Expressions, Equations, and Functions N-Q - Quantities Reason quantitatively and

More information

A Sketch of an Ontology of Spaces

A Sketch of an Ontology of Spaces A Sketch of an Ontology of Spaces Pierre Grenon Knowledge Media Institute The Open University p.grenon@open.ac.uk Abstract. In these pages I merely attempt to sketch the basis of an ontology of spaces

More information

Notes for course EE1.1 Circuit Analysis TOPIC 4 NODAL ANALYSIS

Notes for course EE1.1 Circuit Analysis TOPIC 4 NODAL ANALYSIS Notes for course EE1.1 Circuit Analysis 2004-05 TOPIC 4 NODAL ANALYSIS OBJECTIVES 1) To develop Nodal Analysis of Circuits without Voltage Sources 2) To develop Nodal Analysis of Circuits with Voltage

More information

Discrete Spatial Distributions Responsible persons: Claude Collet, Dominique Schneuwly, Regis Caloz

Discrete Spatial Distributions Responsible persons: Claude Collet, Dominique Schneuwly, Regis Caloz Geographic Information Technology Training Alliance (GITTA) presents: Discrete Spatial Distributions Responsible persons: Claude Collet, Dominique Schneuwly, Regis Caloz Table Of Content 1. Discrete Spatial

More information

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 373 Genomic Informatics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 373 Genomic Informatics Elhanan Borenstein Some slides adapted from Jacques van Helden The clustering problem The goal of gene clustering process is to partition the genes into distinct

More information

Algebra 1 Mathematics: to Hoover City Schools

Algebra 1 Mathematics: to Hoover City Schools Jump to Scope and Sequence Map Units of Study Correlation of Standards Special Notes Scope and Sequence Map Conceptual Categories, Domains, Content Clusters, & Standard Numbers NUMBER AND QUANTITY (N)

More information

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden

Clustering. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein. Some slides adapted from Jacques van Helden Clustering Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein Some slides adapted from Jacques van Helden Small vs. large parsimony A quick review Fitch s algorithm:

More information

Cluster Analysis CHAPTER PREVIEW KEY TERMS

Cluster Analysis CHAPTER PREVIEW KEY TERMS LEARNING OBJECTIVES Upon completing this chapter, you should be able to do the following: Define cluster analysis, its roles, and its limitations. Identify the types of research questions addressed by

More information

Projective Clustering by Histograms

Projective Clustering by Histograms Projective Clustering by Histograms Eric Ka Ka Ng, Ada Wai-chee Fu and Raymond Chi-Wing Wong, Member, IEEE Abstract Recent research suggests that clustering for high dimensional data should involve searching

More information

POPULATION AND SAMPLE

POPULATION AND SAMPLE 1 POPULATION AND SAMPLE Population. A population refers to any collection of specified group of human beings or of non-human entities such as objects, educational institutions, time units, geographical

More information

Spatial Analysis I. Spatial data analysis Spatial analysis and inference

Spatial Analysis I. Spatial data analysis Spatial analysis and inference Spatial Analysis I Spatial data analysis Spatial analysis and inference Roadmap Outline: What is spatial analysis? Spatial Joins Step 1: Analysis of attributes Step 2: Preparing for analyses: working with

More information

Multivariate Analysis of Ecological Data using CANOCO

Multivariate Analysis of Ecological Data using CANOCO Multivariate Analysis of Ecological Data using CANOCO JAN LEPS University of South Bohemia, and Czech Academy of Sciences, Czech Republic Universitats- uric! Lanttesbibiiothek Darmstadt Bibliothek Biologie

More information

Essentials of expressing measurement uncertainty

Essentials of expressing measurement uncertainty Essentials of expressing measurement uncertainty This is a brief summary of the method of evaluating and expressing uncertainty in measurement adopted widely by U.S. industry, companies in other countries,

More information

Monitoring of plant species along the Drava river and in Baranja (Croatia)

Monitoring of plant species along the Drava river and in Baranja (Croatia) PURGER, J. J. (ed.) 2008: Biodiversity studies along the Drava river. University of Pécs, Hungary. 328 pp. Monitoring of plant species along the Drava river and in Baranja (Croatia) JÁNOS CSIKY 1 & DRAGICA

More information

ASA Section on Survey Research Methods

ASA Section on Survey Research Methods REGRESSION-BASED STATISTICAL MATCHING: RECENT DEVELOPMENTS Chris Moriarity, Fritz Scheuren Chris Moriarity, U.S. Government Accountability Office, 411 G Street NW, Washington, DC 20548 KEY WORDS: data

More information

Geography 281 Map Making with GIS Project Four: Comparing Classification Methods

Geography 281 Map Making with GIS Project Four: Comparing Classification Methods Geography 281 Map Making with GIS Project Four: Comparing Classification Methods Thematic maps commonly deal with either of two kinds of data: Qualitative Data showing differences in kind or type (e.g.,

More information

Keystone Exams: Algebra

Keystone Exams: Algebra KeystoneExams:Algebra TheKeystoneGlossaryincludestermsanddefinitionsassociatedwiththeKeystoneAssessmentAnchorsand Eligible Content. The terms and definitions included in the glossary are intended to assist

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

Freeman (2005) - Graphic Techniques for Exploring Social Network Data

Freeman (2005) - Graphic Techniques for Exploring Social Network Data Freeman (2005) - Graphic Techniques for Exploring Social Network Data The analysis of social network data has two main goals: 1. Identify cohesive groups 2. Identify social positions Moreno (1932) was

More information

In Silico Spectra Lab. Slide 1

In Silico Spectra Lab. Slide 1 In Silico Spectra Lab Slide 1 In Silico Spectra Lab Libraries Management (import spectra from different sources) Explore & investigate unknown samples against spectra libraries Extract qualitative & quantitative

More information

Mathematics Standards for High School Financial Algebra A and Financial Algebra B

Mathematics Standards for High School Financial Algebra A and Financial Algebra B Mathematics Standards for High School Financial Algebra A and Financial Algebra B Financial Algebra A and B are two semester courses that may be taken in either order or one taken without the other; both

More information

UCLA STAT 233 Statistical Methods in Biomedical Imaging

UCLA STAT 233 Statistical Methods in Biomedical Imaging UCLA STAT 233 Statistical Methods in Biomedical Imaging Instructor: Ivo Dinov, Asst. Prof. In Statistics and Neurology University of California, Los Angeles, Spring 2004 http://www.stat.ucla.edu/~dinov/

More information

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling G. B. Kingston, H. R. Maier and M. F. Lambert Centre for Applied Modelling in Water Engineering, School

More information

Descriptive Data Summarization

Descriptive Data Summarization Descriptive Data Summarization Descriptive data summarization gives the general characteristics of the data and identify the presence of noise or outliers, which is useful for successful data cleaning

More information

Parent Guide. Number System. Diocese of Cleveland

Parent Guide. Number System. Diocese of Cleveland Parent Guide Grade Eight Algebra Curriculum Diocese of Cleveland Below is a list of skills your child will be taught in Grade Eight Algebra. As parents, you are encouraged to support the work of your child

More information

A NEW SET THEORY FOR ANALYSIS

A NEW SET THEORY FOR ANALYSIS Article A NEW SET THEORY FOR ANALYSIS Juan Pablo Ramírez 0000-0002-4912-2952 Abstract: We present the real number system as a generalization of the natural numbers. First, we prove the co-finite topology,

More information

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics Toolbox 6. Apply statistical algorithms and probability models Statistics Toolbox 6 Apply statistical algorithms and probability models Statistics Toolbox provides engineers, scientists, researchers, financial analysts, and statisticians with a comprehensive set of

More information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

More information

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14 STATS 306B: Unsupervised Learning Spring 2014 Lecture 5 April 14 Lecturer: Lester Mackey Scribe: Brian Do and Robin Jia 5.1 Discrete Hidden Markov Models 5.1.1 Recap In the last lecture, we introduced

More information