Chapter 12. The numerical classification of vegetation

Size: px

Start display at page:

Download "Chapter 12. The numerical classification of vegetation"

Estella Bradley
5 years ago
Views:

1 Chapter 1. The numerical classification of vegetation by Guy BOUXIN Contents Introduction... The classification of relevés and of variables... 3 Hierarchical or non hierarchical method?... 3 Divisive or agglomerative method?... 4 Monothetic ou polythetic classification?... 5 Qualitative or quantitative data?... 5 Similarity measures... 5 Hierarchical classifications... 6 Agglomerative methods... 6 Divisive methods... 7 The stopping rules... 8 Twinspan... 9 Cocktail clustering... 9 Choice of techniques Examples A small table Raw table... 1 Complete table transformed by correspondence analyses Conclusions... 1 A large floristical table... File with 35 species and 147 relevés... Conclusions... 9 General conclusions... 9 rue des Sorbiers 33 à B Erpent mail : guy.bouxin@skynet.be 1

2 References Introduction The classification of vegetation relevés is the arrangement of relevés into classes, in which members share a number of characteristics that separate them from members of other classes. The classification of relevés is a process quite different from that of factorial analysis since it involves discontinuities in the composition not only of the concrete units in the field but also in the abstract classes within which all vegetation may, theoretically, be placed (GREIG-SMITH, 1964). All classification is, to some extent, an arbitrary process, especially when the discontinuities between possible classes are not very marked. Nevertheless, classification, whether natural or not, is useful from many points of view (GOUNOT, 1969): as a label, as a basis for mapping or as a synthesis process. This last author considers that the classification of the vegetation is made difficult if one considers that vegetation presents itself as a mosaic consisting of juxtaposed elements, of floristic composition and variable structure, which takes a different meaning according to the scale to which it manifests itself. We do not return here to the concept of the continuum or its opposite, but consider that the choice of a classification mode can be influenced by the way in which vegetation is considered, hierarchically as in the phytosociological system (see WESTHOFF & van der MAAREL, in WHITTAKER, 1973) or by other approaches emerging from any hierarchical structure. We deal here only with the numerical classification, in the same spirit as that of the preceding chapters. If one considers, following GOODALL (1973), that vegetation relevés are represented by points in a multidimensional space, the axes of which correspond to the variables by which they are described, the classification consists in dividing this space into subspaces. If the dispersion of the points is interrupted by discontinuities or regions of low density, the division follows the discontinuities. GOODALL (1973) recommends that the classification of relevés in a vegetation table be preceded by a factor analysis. If the dispersion of the relevés in a space then defined by a small number of dimensions is continuous, an arbitrary number of classes can be defined; if it presents discontinuities, these will serve as a reference. Once again, our aim is not to present all the techniques exhaustively, but rather to illustrate how to treat vegetation data, in the same spirit developed in the chapters on the use of other techniques of multivariate analysis.

3 The reader who wishes to go deeper into the technical aspects of the classifications will consult the works of GOODALL (1973), van der MAAREL (1979), GAUCH (198), JONGMAN et al. (1987), PODANI (000) and WILDI (010 & 013). The classification of relevés and of variables Classification usually involves relevés, with a view to defining plant groupings or plant associations (WESTHOFF & van der MAAREL, 1973). The classification can also be constructed, not on relevés, but on variables, so as to verify whether we can associate species and factors of the environment, which amounts to defining ecological or socio-ecological groups in DUVIGNEAUD s meaning (in DULIÈRE et al., 1996). The questions most often asked when approaching the classification process are presented by PIELOU (1977): - Should the classification be hierarchical or networked? - Should the method be divisive or aggregative? - Should classes be separated by monothetic or polythetic criteria? - Should the data be qualitative or quantitative? - How should class separation be measured? We start with these questions to present our point of view on classification. Hierarchical or non hierarchical method? In a hierarchical classification, classes at a given level are subclasses of classes of a higher level (PIELOU, 1977). The classification of relevés by any hierarchical method allows the construction of a diagram (figure 1) in a tree or dendrogram which shows the sequence in which the divisions or group meetings are made. The classification of plants with orders, families, genera, species is a well-known example. Hierarchical classification is by far the most widely used and easy to understand. It is a practical algorithm and it does not mean for us that the vegetation has a structure naturally hierarchical! Several non-hierarchical classification algorithms have been created. GAUCH (1979) starts by randomly selecting a site (or relevé) and grouping all the sites within a certain radius from this site. The technique repeats the process until all sites are taken into account. In a second phase, small group sites are reassigned to larger groups 3

4 within a larger radius. JANSSEN (1975) proposes a very similar approach but selects the first data site to initiate the first group. Any site being further away from the first on the basis of a fixed radius, is used to initiate a new group. The following sites are compared to all previously formed groups. There is thus a clear incidence of the way the sites have entered the classification, which is why a next step generates reallocations in order to define the "best" groups. In the algorithm of VAN TONGEREN (1986), a certain number of sites is chosen at random or defined by the user. All other sites are assigned to the closest to the set. By reallocations and until the stability, a better grouping is then created. ROUX (1985) describes the aggregation technique around mobile centers of which there are many variants; the process starts by setting a number k of classes and choosing an initial partition, either random or defined by the user. If one has a previous knowledge of the vegetation studied (by a factor analysis, for example), this last way of doing is not necessarily a disadvantage. This algorithm has the advantage of optimizing a simple criterion of dispersion, namely the second order moment of a partition. The algorithm first calculates the center of gravity of the points in the space defined by the transformed variables. The total moment of second order is obtained by summing, for all individuals and variables, the squares of the difference between the coordinates of the gravity center and those of the individuals. The gravity centers of the predefined classes are then computed as well as the moments of second order of these classes. The sum of these moments constitutes the moment of second order inter-class. Then, the algorithm re-assigns individuals to new classes as long as the intra-class moment is in each case less than it was before reassignment, otherwise there is no reassignment. However, we do not have the certainty of obtaining an absolute optimum, that is to say the best solution. One of the generally recommended ways to optimize the results is to run the complete algorithm several times with different initial partitions. We can then retain the final partition that minimizes the intra-class moment or optimizes the inter-class moment. However, as the author points out, a better strategy is certainly the examination of "strong shapes". These consist of the subsets of objects that have always been grouped together in the same final class during the various initial partition tests. The various existing techniques for partitioning a set of relevés have been presented by PODANI (000). Divisive or agglomerative method? In a divisive classification, the set of relevés is divided into first sub-sets which are themselves divided and so on down to the ultimate classes. An aggregative classification starts from the bottom, that is, from the relevés, and the groups, step by step, into larger and larger sets, each set consisting of the union of smaller subsets. 4

5 Divisive classification is generally considered advantageous because it is faster and less sensitive to local variations which are likely to create poor combinations of unitary relevés that remain in subsequent steps of the aggregative classification. This same default is also sometimes found in the descending algorithms but is corrected with reallocation procedures (BERTHET et al., 1976). Monothetic ou polythetic classification? In a monothetic classification, the two groups "brothers" are distinguished by a single character, present in one and not in the other. In a polythetic classification, the "brother" groups are distinguished by their total similarity based on all the criteria used to describe them (species most often). In a vegetation study, where abundance data are available, a poythetic technique is required in most cases. A monothetic technique is best suited to only qualitative data. Qualitative or quantitative data? This is an ancient debate. Some phytosociologists favor the presence of species, with the aim in particular of defining character-species, while other researchers take account of their abundance. This choice is sometimes a function of the nature of the vegetation described (very numerous species, for example), the time given to the description (always too short), the initial objectives (sometimes forgotten) and many other parameters. The use of data transformed by factor analyses lessens the impact of this choice. In the approach that we follow, starting with a raw relevé table to classify, we repeating all the steps presented in the chapters on factor analysis techniques. We therefore recommend using transformed tables in preliminary multivariate analyses (principal component analysis, correspondence analysis, multiple factor analysis), as proposed by DAGNELIE (1968), BERTHET et al. (1976) and ROUX (1985) and also used by NOY-MEIR (1973), BOUXIN & LE BOULENGẺ (1983), BAAMAL (1994), BOUXIN (1987a and b, 1991, 1995 and 1999). Similarity measures Many classification methods require the calculation of a similarity coefficient (or its complement, dissimilarity or distance) between the pairs of entities to be classified. Entities are either groups of relevés or simple relevés. There are a multitude of similarity coefficients. 5

6 The coefficients often used are the Euclidean distance, the chi, the Jaccard index. There are many others. However, when using a table in which the species have been replaced by variables transformed by a multivariate analysis, euclidean distance is needed. The reader will find useful documentation in the works of ROUX (1985) and PODANI (000). Hierarchical classifications A dendrogram of a hierarchical process thus has a number of nodes (represented by horizontal lines), each corresponding to an intermediate group that has been divided into two sub-groups (divisive classification) or formed by the union of two sub-groups (aggregative classification). Each intermediate group has its own status given by the position of the corresponding node in the dendrogram. Both methods are now explained. Agglomerative methods Two ways of building the hierarchy are possible: either by forming groups so that the distance between the groups is maximum, or by privileging the homogeneity within the formed groups. In an agglomerative classification based on distance measures, there are several aggregation criteria, the main ones being: complete linkage, mean linkage, single link, centroid and flexible strategy. In an aggregate classification with full link, the distance from a relevé to a group is defined as the distance to the farthest sample from that group; when two groups of relevés merge, the distance is equal to the greatest distance among all pairs of relevés taken in the two melt groups. The average link takes into account the average distance instead of the maximum distance. The single (or minimum or nearest) link takes into account the minimum distance. The centroid method calculates the distance between the gravity centers of gravity (in the geometric sense) of the groups. In an agglomerative classification based on search for homogeneity within groups, they merge as long as the dispersion within the groups is less than it would be if one of the groups were merged with another group. An example is the minimum variance technique (WARD, 1963, ORLỎCI, 1978, ORLỎCI & KENKEL, 1985, GAUCH, 198, ROUX, 1985). The criterion for deciding on the merger of two classes is based on the increase of intra-class dispersion. At each step of the algorithm, we merge the two classes which cause the smallest increase of the intraclass moment. Since the distance between the points and the virtual centers is modified by this process, the two operations are repeated until the virtual centers are fully stabilized. At this step, two groups are formed. Thereafter, each new virtual center is located in the group that has the largest determinant of the variance-covariance matrix and 6

7 is placed on the farthest point of the centroid of that group. The preceding operations are repeated until complete stabilization. The algorithm described by ROUX (1985) is called "hierarchical construction of the moment of second ( n jnk ) /( n j nk ). d jk order ". The clustering criterion is given by: Q jk where d jk is the squared distance between the centroids of the groups j and k; n j and n k are the sizes of the groups. The agglomerative algorithm can handle large amounts of data. The fusions are selected so as to minimize Q jk at each step. Some software integrates, as in non-hierarchical techniques, reallocation procedures; an example is given by JANSSEN (197). The procedure starts with a preclassification of the relevés, created by the user (following a multivariate analysis, for example) or even random. At each step of the procedure, the similarity between each relevé and the centroid of each relevé group is calculated using a similarity ratio. When a relevé shows a greater similarity with a group than with the group to which it is attached, it is moved to the first group. Reallocations are repeated until the relevé groups are stable. Then, the similarities between each pair of groups are calculated and the two most similar groups are merged, each is compared again to the different groups and reallocations are still possible. The process continues until the desired minimum number of groups is obtained or the similarity between the two most similar groups is less than a minimum value set by the user. In a second step, the program finally orders the relevés and species in a table. Divisive methods Using a divisive classification, a crucial question immediately appears: what rule to use to stop classification? In other words, if the relevés are arranged in line (figure 1), at what level of the dendrogram should it be cut, parallel to the horizontal axis, in order to obtain an optimal classification? - The number of classes to be recognized can be fixed. - The level of heterogeneity allowed in the recognized classes can be defined. - A dendrogram can be read from top to bottom and terminate each branch as soon as a node is reached, which is shorter than a chosen length. A polythetic divisive technique with reallocations was created by BERTHET et al. (1976) and used by BOUXIN (1978 and 1987b). A preliminary correspondence analysis (or other multivariate analysis) is first performed. We thus have a set of data composed of n individuals characterized by p continuous variables; we want to group these individuals into g classes. If each individual is represented by a point in a p-dimensional space, the p Euclidean distance between points i and j is given by: x x take a position at the center of the groups. The first virtual center is placed at the centroid of the set of points; its 7 1/ d ij ik jk. k 1 In order to create groups of individuals, virtual centers are successively introduced into the data space and forced to

8 coordinates are the averages of the coordinates of all points. The second virtual center is then introduced at the point furthest from the first virtual center. Two operations follow: calculate the distances between the points and the virtual centers and adjust the position of the virtual centers. a. The Euclidean distances between one point and all the virtual centers are calculated and this point is assigned to the nearest virtual center. This process is repeated for all points. b. Each virtual center is transferred to a new position which is the centroid of all the points assigned to it. In order to limit the calculations, the user sets the maximum number of groups he wants to fix. The stopping rules The user of a hierarchical technique, whether divisive or agglomerative, wonders: how to determine the optimal number of groups? There are many stopping rules. Some are non-inferential as that which is related to the divisive technique of BERTHET et al. (1976). The criterion adopted is the evolution of a quantity Vg as a function of the number of groups where V g g w i i 1 1/, w i is the dispersion matrix of group i and g is the number of groups. When g increases, Vg decreases and reaches a minimum when g is equal to the number of groups actually present. Any subsequent division increases the value of Vg. The latter technique therefore has a stopping rule, but experience shows, however, that Vg decreases very often continuously according to the number of groups and often does not reach a minimum (BOUXIN, 1978). The stopping rule works well only if there are clear discontinuities. Inferential stopping rules take into account only two classes that are candidates for fusion at each step of the classification. It will be considered that the number of classes defined at a given step must be increased by one unit (which would amount to descending a notch in the hierarchy), if the hypothesis of equality of the two classes to be merged during this step is rejected. If so, continue in the same way for step e-1, and so on until the hypothesis is accepted for the first time. Only a few rules are an exception to this principle. Some rules are based on the assumption of normality of the parent-population, others on the assumption of uniformity, others on the permutations. Some make no explicit assumption. Several are based on the bootstrap method (BAAMAL, 1994). This author stresses that the application of the majority of inferential rules is relatively complex. Moreover, this complexity is highly badly founded insofar as it is not accompanied by an improvement in performance, in practical 8

9 terms, compared with much simpler design rules. He also developed a stopping rule, based on Monte Carlo techniques, with a classification quality criterion based on the number of rows and columns in the statement table. The parameter used is defined for each partition of a set of individuals in two classes from the ratio of the sums of square of the inter-class and total deviations on the one hand and the relative proportion of the associated variation to the first main axis of the scatter diagram, on the other hand (obtained from a principal component analysis). This rule has been integrated into a classification program (WARD technique). The author found that the numbers of individuals and variables had a very important influence on the performance of the stopping rules. Indeed, the results are, in general, better with high individual or variable numbers. He finds it unnecessary to seek a stopping rule of which may be the best in all situations. This problem of the stopping rules of therefore still requires many researches. Twinspan The technique known as TWINSPAN (Two-Way INdicator SPecies ANalysis) is also polythetic divisive (HILL, 1979, GAUCH, 198, JONGMAN et al., 1987). The data are first processed by correspondence analysis. The species which characterize the extremities of the axis are detected in such a manner as to polarize the relevés; these are divided into two groups by cutting the axis to its centroid. The division of the relevés is refined by a new classification using the species at the ends of the axis of correspondence analysis. The division process is then repeated on the two subsets of relevés and so on until each group of relevés has a number of relevés that does not exceed a fixed minimum. A corresponding classification of the species is also produced in parallel with that of the relevés and the hierarchical classification of the species is used to construct a table of data rearranged both in its rows and columns so as to present the groups of relevés with their indicator species. Cocktail clustering Cocktail clustering (BRUELHEIDE, 016) is a hierarchical agglomerative clustering algorithm for species. It starts with a species x species matrix of the coefficient of association. After fusing the species with the highest coefficient, the association matrix is recalculed for the new group of species. For calculation association for groups to other species or to the nodes formed by groups of species, the observed frequency distribution of cooccurrences of the species in that group is compared to the expected frequency distribution of co-occurrence, derived from the observed number of species occurrences. As a result, for each species group a minimum number of species is obtained that is required to assign a relevé to this species group. The resulting Cocktail species groups are partially nested, and with increasing node hierarchy show a tendency of decreasing correlation to the last-joining species in that group. 9

10 As the clustering algorithm assigns all of the n species in a data set to groups, the result are n 1 partly nested species groups. These groups correspond to species groups that have been extracted from the same data set using preconceived starting groups. Subsequently, the species groups can be used separately or in logical combinations to classify vegetation relevés either by expert systems, Twinspan-like classification algorithms or by redefining existing vegetation units with automatic algorithms. Used in this way, Cocktail clustering is able to form the backbone of a consistent large-scale vegetation classification system. Choice of techniques Among all the existing techniques, we recommend the following, for the first file, at least: - the agglomerative hierarchical classification of the second order moment of the raw table, - the agglomerative hierarchical classification of the raw table by the flexible link and the euclidean distance, - the agglomerative hierarchical classification of the second order moment of the table transformed by the analysis of classical correspondence and the non-symmetrical correspondence analysis, - an agglomerative hierarchical classification based on a matrix of distance relevés (coordinates of the nonsymmetrical correspondence analysis, euclidean distance, mean linkage), - a divisive classification based on a distance matrix (coordinates of the non-symmetrical correspondence analysis, euclidean distance), - a divisive classification of the virtual centers (coordinates of the non-symmetrialc correspondence, euclidean distance), - the divisive classification "TWINSPAN, which can only apply to raw tables, - the aggregation technique around the mobile centers of ROUX (1985) of the table transformed by the analysis of the classical correspondence and the non-symmetrical correspondence analyses; the two types of preclassification (random and fixed) were used. The "Cocktail" clustering is mainly intended for phytosociologists and still requires concrete applications. The comparison of the results leads to a final solution which retains some arbitrary character. Examples A small table The classification procedure is illustrated with a small table. Analysis of results, as with multivariate analysis techniques, affects how to conceptualize vegetation. 10

11 We take up the Tailfer file (table 1) covered in chapter 7. Two steps are presented: the classification of the raw table and then that of the transformed table. Species/relevés R e l e v é s ( 1 à 9 ) Angelica sylvestris Athyrium filix-femina Callitriche platycarpa Callitriche stagnalis Cardamine pratensis Carex pendula Carex remota Chrysosplenium oppositifolium Cirsium palustre Deschampsia cespitosa Filipendula ulmaria Galium palustre Glyceria fluitans Juncus effusus Lotus uliginosus Lysimachia nemorum Mentha arvensis Myosotis scorpioides Persicaria hydropiper Ranunculus repens Rumex sanguineus Scrophularia auriculata Solanum dulcamara Sphagnum palustre Stachys sylvatica Stellaria alsine

12 Sum Table 1. Vegetation relevés of the tributary 5 of the Tailfer river; table with marginal totals. = sum. The brooklet rises on a plateau. It is first a simple drain. Water appears gradually. The stream crosses open and wooded areas, before reaching the hilly part completely wooded. The slope is at first moderate then strong with a stream bed made up of pebbles and blocks. Raw table First, the hierarchical classification of the moment of second order of the raw table leads to easily distinguish three groups (figure 1): - relevés 1, 5 and 7 which correspond to openings in the woodland, with several heliophilic species found only in these relevés, - relevés, 3, 4, 6, 8, 9, 1, 14 and 19, characterized by Carex remota, - relevés 10, 11, 13, 15, 16, 17, 18, 0, 1,, 3, 4, 5, 6, 7, 8 and 9, characterized by Carex pendula and some other infrequent species; or 6 groups if one stops further in the hierarchy: - relevés 1, 5 and 7 already discriminated against with three groups, - relevés, 3, 4, 9, 1 and 14, - relevés 6, 8, 19 and 8, - relevés 10 and 11, - relevés 13, 15, 18, 0, 1 and, - relevés 16, 17, 3, 4, 5, 6, 7 and 9. Only the last group has a character- species, namely Carex pendula. Only the first solution is of phytosociological and synthetic interest. 1

13 Figure 1. Agglomerative hierarchical classification of the table 6 species x 9 relevés. Another hierarchical method with a link called "flexible beta" and the euclidean distance measurement gives results close to those of the second order moment (figure ) Figure. Agglomerative hierarchical classification, method of the link flexible beta of the table 6 species x 9 relevés. relevés. Here are the results of the TWINSPAN classification (table ) with the rearrangement of species and Athyrium filix-femina

14 Carex pendula Chrysosplenium oppositifolium Callitriche platycarpa Galium palustre Solanum dulcamara Sphagnum palustre Carex remota Filipendula ulmaria Scrophularia auriculata Deschampsia cespitosa Cardamine pratensis Stachys sylvatica Angelica sylvestris Callitriche stagnalis Lysimachia nemorum Glyceria fluitans Juncus effusus Ranunculus repens Cirsium paljustre Lotus pedunculatus Mentha arvensis Myosotis scorpioides Persicaria hydropiper Rumex sanguineus Stellaria alsine Table. TWINSPAN classification of the raw table with the hierarchy of relevés and species. This last classification is close to the other two, with however the small difference that Deschampsia cespitosa is better discriminated than in the two previous ones. Complete table transformed by correspondence analyses 199). In the transformed tables, the number of axes retained was fixed by the permutation technique (BOUXIN, The agglomerative hierarchical classification of the second order moment is applied to the table transformed by the correspondence analysis. The first five significant variables are retained. Three groups are again discriminated (figure 3). - relevés 1 and 7 which are two of the three open sites, 14

15 - relevés, 3, 4, 5, 6, 8, 9, 10, 11, 1, 13, 14, 15 and 19, - relevés 16, 17, 18, 0, 1,, 3, 4, 5, 6, 7, 8 and 9. The division corresponds much better to the situation of the brooklet, which raises on the wooded plateau with a slow course and then becomes rapid as it descends the hill from relevé 16. This change is marked by a difference in the general physiognomy of vegetation and in the floristic composition of the relevés. In this case, the synthetic nature of the classification is markedly improved Figure 3. Agglomerative hierarchical classification of the second order moment of the table transformed by correspondence analysis. Five significant variables x 9 relevés. The agglomerative hierarchical classification (figure 4) uses this time the mean linkage. This analysis is the reflection of the defaults of correspondence analysis of a complete table and the division into classes is partly the reflection of the superposition of very different specific frequencies. The results are of little synthetic interest. 15

16 Figure 4. Agglomerative hierarchical classification of the average linkage of the table transformed by correspondence analysis. Five significant variables x 9 relevés. With the table transformed by non-symmetrical correspondenceanalysis (figure 5), two entities are separated first: the relevés comprising the species limited or observed mainly in the upper part of the course (mainly characterized by Deschampsia cespitosa) and the species of the inferior and rough lower course (without Deschampsia cespitosa). In the formation of these groups, common species play an important role: first, a set of species present exclusively or preferentially in the two most enlightened relevés (1 and 7), followed by a group of relevés characterized mainly by Cardamine pratensis, a group of 10 records characterized by Carex remota and finally a group characterized by Carex pendula. The scheme thus proposed is simple and its synthetic character is undeniable. There is also a two-level hierarchical structure that is interesting. 16

17 Figure 5. Agglomerative hierarchical classification of the second order moment of the table transformed by nonsymmetrical analysis. Two significant variables x 9 relevés. With the technique of the average link (figure 6), the results are close to those of the preceding analysis except that the relevé 15 is more isolated. This relevé is occupied by only one species (Deschampsia cespitosa) is therefore strongly influenced by this particularity Figure 6. Agglomerative hierarchical classification of the average linkage of the table transformed by nonsymmetrical analysis. Two significant variables x 9 relevés. 17

18 Total occupied space The results of the divisive technique of virtual centers are now presented, first with the transformed table by correspondence analysis (figures 7 and 8). 0,08 0,07 CENVI 0,06 0,05 0,04 0,03 0,0 0,01 Number of virtual centers Figure 7. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the table transformed by correspondence analysis. Five significant axes. The quantity Vg (total space occupied) is not very useful in this case, since there is a constant decrease of this parameter, without arriving at a simple stopping rule. 18

19 9 relevés 8 relevés 18 6 relevés relevés relevés relevés Figure 8. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by the correspondence analysis. Five significant axes. Presentation with seven virtual centers. The arrow indicates a reallocation. In the rectangles, one reads the amount of relevés or their numbers. This technique first forms a large group and three very small groups. Subsequent divisions produce only groups consisting of a single relevé. It is necessary to wait for the sixth division to create two important groups and thus obtain synthetic information. The technique of virtual centers, with a table transformed by the non-symmetrical analysis of the correspondences, brings very different results (figures 9 to 11). 19

20 Total occupied space 0,06 CENVI 0,05 0,04 0,03 0,0 Number of virtual centers 0, Figure 9. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the table transformed by non-symmetrical correspondence analysis. Two significant axes. There is again a constant decrease of Vg. 9 relevés 8 relevés Figure 10. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by the non-symmetrical correspondence analysis. Two significant axes. Presentation with four virtual centers. There is only one reallocation (relevé 15) indicated by the arrow. In the rectangles, one reads the amount of relevés or their numbers. 10: The results are again strongly influenced by frequent species. Following the groups from left to right in figure 0

21 axis 1. By Cardamine pratensis and some localized species,. By the superposition of Carex remota and Deschampsia cespitosa, 3. By the superposition of Carex pendula and Deschampsia cespitosa, 4. By Carex pendula alone. The four groups are represented in the plane of the first two axes (the two only significant ones) of the nonsymmetrical correspondence analysis (Figure 11) axis Figure 11. Representation of the four groupings in the plane of the first two axes of the non-symmetrical correspondence analysis. Based on the knowledge gained on the plant groupings, it is now possible to use the mobile center technique which starts from a preclassification. We will compare the results starting from four defined classes and four classes drawn at random. In the first case, the algorithm reproduces the same four classes as the initial ones with an interclass moment/total moment ratio equal to 0.854, always higher than the ratio obtained with a random start partition. Conclusions In conclusion, on a phytosociological and synthetic point of view, the classifications constructed at the start of a table transformed by the non-symmetrical correspondence analysis bring the best results. Each of the classification techniques has its own sensitivity, but the prior transformation of the tables limits the arbitrary aspect of the choice of the classification technique. In the later steps, we will only use transformed tables. The small number of relevés in the Tailfer5 table does not make it possible to group the relevés into blocks in a demonstrative way. 1

22 The use of a stopping rule has not been of great help to us. A large floristical table We return to the Crupet file processed in Chapter 7. This is a simplified table. Only the species with a significant dispersion pattern (according to Chapter 4) have been retained. In this case, correspondence analysis can be used. The use of simplified tables is frequent but generally based on the frequency of the species, fixed arbitrarily (1,, 3 presences for example). The elimination of rows from a table may be statistically justified but is more difficult to admit for a phytosociologist or ecologist who regards vegetation as an ecosystem in which each species has its function. This is the only example presented. File with 35 species and 147 relevés The file has previously been transformed by non-symmetrical correspondence analysis. Four axes are significant. It is therefore a file 4 variables x 147 relevés that has been subjected to the various techniques of classification. Let us begin with the agglomerative hierarchical classification of WARD (figure 1) which produces a large number of classes, without it being possible here to fix a stopping level. Figure 1. Agglomerative hierarchical classification with WARD s method of the Crupet table (35 species) transformed by the non-symmetrical correspondence analysis, with four significant variables. 7 main clusters are marked from A to G.

23 Total occupied space 0, ,0001 Let us compare immediately with the technique of virtual centers (figures 13 and 14). CENVI 0,0001 0, ,00006 Number of virtual centers 0, , Figure 13. Evolution of the total occupied space as a function of the number of virtual centers. Analysis conducted with the transformed table by the non-symmetrical correspondence analysis. Four significant axes. 147 relevés 13 relevés 4 relevés relevés 4 relevés relevés 8 relevés relevés 59 relevés 17 relevés 11 relevés relevés 59 relevés 16 relevés relevés Figure 14. Hierarchical tree of the divisive classification by the technique of virtual centers, conducted with the transformed table by non-symmetrical correspondence analysis. Four significant axes. Presentation with six virtual centers. In the rectangles, one reads the amount of relevés or their numbers. The arrows and numbers in bold and italics indicate the number of reallocations. 3

24 It is easy to see that the two classifications produce very different results and it is difficult to make a choice a priori. Two first interesting observations come from the technique of virtual centers which shows two useful properties: the number of classes is determined in a simple way (figure 13). Indeed, the total occupied space decreases progressively from one to six virtual centers and then increases, before falling significantly from nine centers. With such a number of virtual centers and more, the technique only produces very small groups. From three to four, then from four to five virtual centers, there are many reallocations. Let us first introduce the results produced by the technique of virtual centers. The following groupings are recognized: - a woody group characterized by Alnus glutinosa, Alnus incana, Fraxinus excelsior, by species of shady environments such as Angelica sylvestris, Festuca gigantea and Filipendula ulmaria, and others of lower course more shady than the rest like Persicaria bistorta, Cardamine amara, Myosotis scorpioides, Fontinalis antipyretica and Petasites hybridus (C1 of table 3; coordinates of axis 1 of the non-symmetrical analysis of correspondences <0); - a grazed grassland grouping characterized by the combination of Glyceria notata, Persicaria hydropiper and Veronica beccabunga (C4 of table 3; coordinates of axis 1> 0 and axis <0); - an unshaded or slightly shaded grouping characterized by the combination of Stachys sylvatica, Glyceria notata and Ranunculus repens (C5 of table 3; coordinates of axis <0 and coordinates of axis 4 <0 or close to 0); - a large grouping characterized by the combination of herbaceous species, the most frequent being Agrostis stolonifera, Epilobium hirsutum, Phalaris arundinacea, Rumex conglomeratus and Scrophularia umbrosa (C); - two small species-poor groupings species-poor, each dominated by one species (Nasturtium officinale for C3 and Filipendula ulmaria for C6). With the WARD s technique, one recognizes: - A woody grouping characterized by the combination of several woody species, mostly Alnus glutinosa, Alnus incana, Fraxinus excelsior and Salix rubens and the following herbaceous: Angelica sylvestris, Festuca gigantea, Chrysosplenium oppositifolium, Filipendula ulmaria, Cardamine amara and in the lower course Fontinalis antipyretica (E in table 3, coordinates of axis 1 <0); - A group characterized by a set of herbaceous species with significant negative coordinates on the second axis, mainly Glyceria notata and Persicaria hydropiper (F in table 3); - A grouping of 6 poorly diversified relevés and characterized by the single Angelica sylvestris (A in table 3, the coordinate axes 3 and 4 negative); - A grouping of 3 relevés characterized by Nasturtium officinale (G in table 3, coordinates> 0 on the axis 4); - A grouping characterized by the combination of the tree Alnus glutinosa and herbaceous species, mainly Angelica sylvestris, Scrophularia umbrosa, Mentha aquatica, Phalaris arundinacea, Festuca gigantea and Myosotis scorpioides (B of Table 3); 4

25 - another poorly defined grouping (C in table 3 with positive coordinates or very close to the origin on axis 1); - A light grouping influenced by a combination of herbaceous species favored by pollution or quite severe eutrophication as Agrostis stolonifera, Phalaris arundinacea, Epilobium hirsutum, Rumex conglomeratus, Nasturtium officinale, Epilobium roseum and Veronica beccabunga (D of table 3 with positive coordinates on both axes 1 and ). C1 C C3 C4 C5 C6 E A E G B C D

26 Table 3. Comparison of both classifications. C1 to 6 : hierarchical divisive classification of virtual centers. W1 to 7 : agglomerative hierarchical classification of the second order moment. Each of the techniques has its own sensitivity. One reveals, for example, a grassland grouping with Glyceria notata and Veronica beccabunga (integreted, in the other technique, in a larger group); the other reveals a brooklet grouping, polluted and eutrophicated by discharges of domestic and agricultural wastewater, which is also intagrated in a larger group in the first technique. It is therefore impossible to retain only the results from a single classification. We therefore decided to combine the contributions of the two techniques to construct a preclassification which is subject to the technique of mobile centers. We have created seven classes by taking classes E, A, B and C (table 3) as they are. Relevés 53, 64 and 7 have been included in class D because they contain a high biomass of Nasturtium officinale as in many relevés of this class. The C4 relevés are taken out of class F and form the class G (grouping with Glyceria notata and Veronica beccabunga). The result with the inter-class moment/ highest total moment ratio (table 4) is used. 6

27 classe1 classe classe3 classe4 classe5 classe6 classe Table 4. Classification of the 147 relevés with the technique of mobile centers. It is thus recognized: 7

28 - a woody grouping distributed throughout the course, characterized mainly by the combination of Alnus glutinosa, A. incana, Fraxinus excelsior, Festuca gigantea, Angelica sylvestris and Festuca gigantea; - an open group also distributed throughout the course and characterized mainly by Angelica sylvestris; - a grouping distributed over the upper and middle reaches, characterized by Stachys sylvatica and Glyceria notata; - a grouping of open sites, only in the upper course, in the middle of the pasture and characterized mainly by Glyceria notata and Veronica beccabunga; - a woody grouping characterized mainly by the combination of Alnus glutinosa, Filipendula ulmaria, Angelica sylvestris, Phalaris arundinacea, Scrophularia umbrosa, Mentha aquatica, Myosotis scorpioides, Festuca gigantea and Stachys sylvatica; - a grouping of open sites spread over a large part of the course from the village of Assesse and characterized by the combination of herbaceous species favored by the generalized eutrophication of the river: Epilobium hirsutum, Agrostis stolonifera, Rumex conglomeratus, Scrophularia Umbrosa, Mentha aquatica, Phalaris arundinacea, Epilobium roseum and Lycopus europaeus; - a grouping of open sites occupying mainly a quarter of the course after Assesse, in an agricultural environment and characterized by species resistant to pollution such as Nasturtium officinale, Veronica beccabunga, Lycopus europaeus and Cirsium palustre. For example, we have also applied the WARD s technique to species (figure 15). It is a table comprising four columns (for the four significant axes) and 35 rows (species). AG_IS st CALYS se FESTU ar JUNCU in SOLAN du STACH pa GLYCE fl SPARG er PO_NUM h RANUN re ANGEL sy CIRSI pa EPILO ro NASTU of PHALA ar EPILO hi RUMEX co MENTH aq SCROP um GLYCE no LYCO_PUS VERON be ALNUS gl FRAXI ex ALNUS in CHR_O op FESTU gi CARD_AMI PO_NUM b FONTI an MY_TIS s PETAS hy SALIX xr FILIP ul STACH sy 8

29 Figure 15. Agglomerative hierarchical classification of species with the WARD s method of the Crupet table (35 species) transformed with non-symmetrical correspondence analysis, with four significant variables. Six groups of plants (ecological groups) are easily distinguished: - a group of woody plants and sciaphil species or more common in wooded areas with Alnus glutinosa, Fraxinus excelsior, Alnus incana and Salix rubens as large trees, Chrysosplenium oppositifolium, Festuca gigantea, Cardamine amara, Persicaria bistorta, Fontinalis antipyretica, Myosotis scorpioides and Petasites hybridus as herbaceous species; - a small group with Filipendula ulmaria and Stachys sylvatica supporting shade and some luminosity; - a group of three heliophilic herbaceous species in the upper course: Glyceria notata, Lycopus europaeus and Veronica beccabunga; - a group of herbaceous species developing mainly in areas receiving wastewater: Cirsium palustre, Epilobium roseum, Nasturtium officinale, Phalaris arundinacea, Epilobium hirsutum, Rumex conglomeratus, Mentha aquatica and Scrophularia umbrosa; - an isolated species: Angelica sylvestris; - a group of herbaceous species not falling into the previous categories, less "typed": Agrostis stolonifera, Calystegia sepium, Festuca arundinacea, Juncus inflexus, Solanum dulcamara, Stachys palustris, Glyceria fluitans, Sparganium erectum, Persicaria hydropiper and Ranunculus repens. Conclusions not. At this step, we see how different the results differ according to whether one classification technique is used or General conclusions These few pages show how difficult it is to construct a classification of a relevé table. The same table can lead to a very large number of solutions. The results depend on: - the data used: original data or transformed by one or other multivariate analysis or complete original table or simplified table after a statistical pattern analysis; - the used algorithms: hierarchical classification or not; - the variants used, especially in metrics; - a possible stopping rule. 9

30 In general, we recommend using only tables previously transformed by multivariate analysis; this reduces the impact of choosing a metric. The euclidean distance is then a completely usable metric. It is easy to combine floristic variables and mesological variables in the same classification, as long as one starts from a table transformed by multiple factor analysis. It is thus possible to provide a statistical basis for socioecological groups. We also recommend comparing the results produced by several techniques in order to highlight the strong nuclei, that is to say the most robust groups. The ultimate use of the mobile center technique makes it possible to consolidate previous choices or to encourage further analysis. There is no ready-made solution to the problem of stopping rules. A simple solution is proposed here. It suffices to take into account the number of significant axes in the preliminary factor analysis and to multiply this number by two. This amounts to separating, for each axis, the relevés or species with positive and negative coordinates. This rule provides an order of magnitude of the number of classes. It is up to the user to refine the number of classes by observing the results. The association of factorial analyses and classification techniques opens immense possibilities for the description of plant groupings. Multiple factor analysis techniques, giving a more balanced representation of multilayered vegetation, for example, or combining floristic variables and mesological variables, lead to classifications taking into account the components of phytocenoses in their environment. Many possibilities remain to be explored. The many available techniques are nothing if they are used automatically, without an important underlying reflection on the nature of the data and the way in which they were collected. These techniques will never compensate for gaps in vegetation sampling. References BAAMAL, L. (1994). Etudes des règles d arrêt en classification numérique. Dissertation originale présentée en vue de l obtention du grade de docteur en sciences agronomiques. Faculté des Sciences agronomiques de Gembloux. 60 pp. BERTHET, P., FEYTMANS, E., STEVENS, D. & GENETTE, A. (1976). A new divisive method of classification illustrated by its applications to ecological problems.. Proc. Ninth Int. Biom. Conf., invited papers. Vol. II:

Multivariate Statistics: Hierarchical and k-means cluster analysis

Multivariate Statistics: Hierarchical and k-means cluster analysis Steffen Unkel Department of Medical Statistics University Medical Center Goettingen, Germany Summer term 217 1/43 What is a cluster? Proximity