Clustering analysis of vegetation data

Clustering analysis of vegetation data Valentin Gjorgjioski 1, Sašo Dzeroski 1 and Matt White 2 1 Jožef Stefan Institute Jamova cesta 39, SI-1000 Ljubljana Slovenia 2 Arthur Rylah Institute for Environmental Research Dept. of Sustainability & Environment Heidelberg 3084, VIC, Australia 1 Introduction Vegetation may be described as the plant life of a region. The study of patterns and processes in vegetation at various scales of space and time is useful in understanding landscapes, ecological processes, environmental history and predicting ecosystem attributes such as productivity. Generalized vegetation descriptions, maps and other graphical representations of vegetation types have become fundamental to land use planning and management. They are widely used as biodiversity surrogates in conservation assessments and can provide a useful summary of many non-vegetation landscape elements such as animal habitats, agricultural suitability and the location and abundance of timber and other forest resources. We use clustering or classification of vegetation data to obtain such descriptions, maps and other representations. Clustering vegetation data is well known machine learning problem which aims to partition the data set into subsets, so that the data in each subset share some common trait. Summary of vegetation classification and methods can be found in the numerous texts that focus on this discipline[6,3]. In our work we deal with vegetation data which is organized in relational model. To be able to apply classical machine learning approach we need to do some data preprocessing. We preprocess the data using simple aggregation techniques and we use several approaches to analyze the data: Predictive clustering trees [1], k-means and Hierarchical Agglomerative Clustering. These algorithms were applied and satisfactory results were obtained. The rest of paper is organized as follows. First we discuss dataset and problem in details. Further on we show preprocessing details needed to make data suitable for classical data mining approaches, and in the next section we are describing our data mining setup and experiments. Next, we present the results of the experiments and at the end we conclude with discussion and further work proposals. 2 Dataset and problem description The problem is to produce classification and clustering of vegetation properties, which is easier problem to solve than the classification of the vegetation in general. We aim to solve easier problem and later to advance in solving more general

problem. Mapping of such classification over the whole landscape is also desired, so we will try to do predictive clustering model, which later can be mapped on the landscape. The data has been collected from across the State of Victoria, Australia an area of approximately 22,000,000 hectares. The State is relatively varied climatically and geologically and supports some 4,000 indigenous vascular plant species. Landscape is divided in about quadrants of 30x30 meters resolution referred as sites later in this paper. For this study we have about 30000 sites and about 4600 species. Each of these sites has ordinal categories which represent abundance of each species. Further more, for each site we have environmental (climatic, radiometric, topographic) and spectral variables from the same locations been extracted from a stack of data themes stored in a GIS. On the other hand, additional information is known of the species - their physiognomy (leaf type, plant size and general architecture), phenology (flowering time) and phylogeny (i.e. Genus, Family). Figure 1 depicts the relationship between site properties and species properties. Fig. 1. Species and Sites properties We have relational data with one to many relationships. To handle it we will aggregate the data with simple aggregation techniques. We give more details about this in next section. 3 Preprocessing First we convert the ordinal categories of abundance to numeric value with the help of the expert. We use the following mapping: 1 (0-5%) as 2.5 2 (5-25%) as 15 3 (25-50% as 37.5 4 (50-75%) as 62.5 5 (75-100%) as 87.5

Next, we remove measurements of exotic species, and species with very low cover(0.5) as suggested by experts. After cleaning the data we aggregate the cover abundance of the species in a given site by species properties. For every property of a species we aggregate over each of its values. This is done for every site. Basically here we generate a new feature for every value of a given nominal property. Example for a feature generation of autflow property and value 1 is given in Figure 2, or given as algorithm it is presented with Algorithm 1. autflow1 (S i, S p) = X S p S i,autflow(s p)=1 cover(s i ) cover (S i) = X cover (S i, S p) S p S i cover (S i, S p), given that Fig. 2. Example of feature generation for autflow property, given the site S i and species S p Algorithm 1 Function that returns value of new feature for each site function generatef eature(attribute, av alue) for each site S do for each species species in S do sum+ = speciesabundance(species) if getavalue(species,attribute)==avalue then sum1+ = speciesabundance(species) setf eaturev alue(site, feature, sum1/sum) 4 Methodology We use three approaches: Predictive Clustering Trees for multi-target prediction (PCTs) K-means clustering Hierarchical agglomerative clustering (HAC) 4.1 Predictve Clustering Trees Predictive modeling aims at constructing models that can predict a target property of an object from a description of the object. Predictive models are learned

from sets of examples, where each example has the form (D, T ), with D being an object description and T a target property value. While a variety of representations ranging from propositional to first order logic have been used for D, T is almost always considered to consist of a single target attribute called the class, which is either discrete (classification problem) or continuous (regression problem). Clustering [2], on the other hand, is concerned with grouping objects into subsets of objects (called clusters) that are similar w.r.t. their description D. There is no target property defined in clustering tasks. In conventional clustering, the notion of a distance (or conversely, similarity) is crucial: examples are considered to be points in a metric space and clusters are constructed such that examples in the same cluster are close according to a particular distance metric. A centroid (or prototypical example) may be used as a representative for a cluster. The centroid is the point with the lowest average (squared) distance to all the examples in the cluster, i.e., the mean or medoid of the examples. Predictive clustering [1] combines elements from both prediction and clustering and it is implemented in the Clus system which can be obtained at http://www.cs.kuleuven.be/~dtai/clus. 4.2 K-means Clustering We describe k-means clustering very briefly in this section. The algorithm starts by partitioning the input points into k initial sets, either at random or using some heuristic data. It then calculates the mean point, or centroid, of each set. It constructs a new partition by associating each point with the closest centroid. Then the centroids are recalculated for the new clusters, and algorithm repeated by alternate application of these two steps until convergence, which is obtained when the points no longer switch clusters (or alternatively centroids are no longer changed). 4.3 Hierarchical Agglomerative Clustering In this section, we briefly discuss Hierarchical Agglomerative Clustering (HAC) (see, e.g., [4]). HAC is one of the most widely used clustering approaches. It produces a nested hierarchy of groups of similar objects, based on a matrix containing the pairwise distances between all objects. HAC repeats the following three steps until all objects are in the same cluster: 1. Search the distance matrix for the two closest objects or clusters. 2. Join the two objects (clusters) to produce a new cluster. 3. Update the distance matrix to include the distances between the new cluster and all other clusters (objects). There are four well known HAC algorithms: single-link, complete-link, groupaverage, and centroid clustering, which differ in the cluster similarity measure they employ. We decided to use single-link HAC because it is usually considered

to be the simplest approach and has smallest time complexity. Furthermore, this approach can do much better clustering than PCTs, and comparing to some better approach was out of the scope of this work. Single-link HAC computes the distance between two clusters as the distance between the closest pair of objects. The HAC implementation that we use has a computational cost of O(N 2 ), with N the number of time series, and for efficiency it uses a next-best-merge array [4]. An important drawback of single-link HAC is that it suffers from the chaining effect [4], which in some cases may result in undesirable elongated clusters. Because the merge criterion is strictly local (it only takes the two closest objects into account), a chain of points can be extended over a long distance without regard to the overall shape of the emerging cluster. 5 Experimental setup In experimental setup, attributes obtained with aggregation are target attributes. On the other hand the properties of the sites are used as descriptive attributes in data mining task. To provide better experiments, comparison and analysis of results, we use various experimental setups. First, we take only selected attributes. Using selected set of attributes we experiment with PCTs given size constraint[5] of 6 clusters. Next, experiments were performed using all of the available attributes, constraining PCTs to 6 and 12 clusters, HAC to 6 clusters, and we set k = 6 for k-means clustering. 6 Results Here, we present only the results from PCTs. In Appendix we give the k-means and PCTs results in details. We would like to emphasize that HAC produced very unbalanced clusters (four clusters of size 1, one cluster of size 2, and one cluster of size 29673) which are useless in this case. In Figure 3 we give map colored in six colors according to the tree generated by Clus which is given in Figure 4. Expert provided excellent feedback about this clustering and it s visualization. Maps for the other results, are not currently produced and that is planned for further work. First tree with 6 clusters was obtained using only selected aggregated attributes, while next experiments were performed using all of the available aggregated attributes. The tree given in Figure 5 is the result of PCT algorithm with constraint set to 12 leaves and using aggregation of all available attributes. Description of the clusters in terms of size and lifelook and sprflow attributes is given in Figure 6. We could conclude that elements across the clusters are well distributed: we do not have either too small or too large clusters. In terms of lif elook attribute, clusters are well separated. We can notice that two major lif elook types are different between clusters and both represents more than 30% of all elements in a cluster. Considering that there are 27 lifelook

Fig. 3. Map types, this percantage is far from small. On the other hand, in terms of sprflow attribute, clusters are inpure with small exception in some clusters(a, B, F). K-means clustering results in details are presented in appendix. For k-means clustering we calculate four standard statistics average, standard deviation, min and max over the descriptive attributes in the cluster to give some description about the clusters and later to do visualization of clusters on a map. 7 Discussion and further work This study shows how different is dealing with vegetation data and even more general with environmental data compared to classical data mining problems. We focused on aggregation and data preprocessing mainly in this work, and later we applied classical algorithms. Obtained results are very promising. We continue to work on this problem by adopting classical algorithm to use hierarchical information about species, also doing mining without aggregation but directly over species. In that case we consider that species are complex/structured data and we propose developing new (or adapting classical) algorithms to handle these types of problems.

Fig. 4. Clus Tree with 6 leaves Fig. 5. Clus Tree with 12 leaves References 1. H. Blockeel, L. De Raedt, and J. Ramon. Top-down induction of clustering trees. In 15th Int l Conf. on Machine Learning, pages 55 63, 1998. 2. L. Kaufman and P.J. Rousseeuw, editors. Finding groups in data: An introduction to cluster analysis. John Wiley & Sons, 1990. 3. P. Legendre and L. Legendre. Numerical Ecology. Elsevier, Amsterdam, 1998. 4. C.D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. Cambridge University Press, 2007. 5. J. Struyf and S. Džeroski. Constraint based induction of multi-objective regression trees. In 4th Int l Workshop on Knowledge Discovery in Inductive Databases: Revised Selected and Invited Papers, volume 3933 of LNCS, pages 222 233. Springer, 2006. 6. D. Sun, R. J. Hnatiuk, and V. J. Neldner. Review of vegetation classification and mapping systems undertaken by major forested land management agencies in australia. Australian Journal of Botany, 45(6):929 948, 1997.

Cluster A: Size: 1425 lifelook=ss 19% lifelook=mt 18% sprflow=1 76% Cluster B: Size: 4740 lifelook=s 15% lifelook=mtg 15% sprflow=1 79% Cluster C: Size: 5668 lifelook=h 36% lifelook=s 17% sprflow=1 65% Cluster D: Size: 2250 lifelook=s 20% lifelook=mtg 19% sprflow=1 53% Cluster E: Size: 9350 lifelook=t 21% lifelook=s 15% sprflow=1 65% Cluster F: Size: 6226 lifelook=t 20% lifelook=s 17% sprflow=1 75% Fig. 6. Description of the clusters in terms of size and lifelook and sprflow attributes A Appendix: Extended results For each cluster we provide size of cluster, four statistics on descriptive attributes, and part of the prototype of a cluster. For each prototype we show just the most important values of that prototype.

Cluster A, Size: 1398 Min 16.0 816.0 634.0-9999.0-9999.0-168.0 321.0 3109.0 3.19 1.0 Max 117.0 1034.0 801.0 6.0 5.7 0.0 459.0 3225.0 12.28 7.0 Avg 60.3 906.8 708.6-102.7-96.9-132.3 389.6 3164.1 12.0 1.3 StdDev 18.6 43.4 17.6 1030.7 995.9 15.9 31.5 32.2 0.9 1.2 lifelook=ss 19% leaftype= 34% sprflow=1 76% sumflow=1 58% autflow= 64% winflow= 52% hitecat=1 36% aquatic= 99% fleshyf= 91% fleshyl= 80% Cluster B, Size: 7205 Min 9.0 816.0 512.0-9999.0-9999.0-999.0-412.0 1759.0 3.19 1.0 Max 1652.0 1065.0 825.0 12.66 12.75 2117.0 839.0 3108.0 16.6 7.0 Avg 118.48 892.67 706.45-1071.37-1069.44-338.39 426.22 2727.4 12.2 1.16 StdDev 106.51 50.79 23.06 3099.43 3095.61 295.01 113.07 235.63 0.89 0.81 lifelook=mtg 19% leaftype= 57% sprflow=1 81% sumflow=1 73% autflow= 68% winflow= 75% hitecat=2 31% aquatic= 94% fleshyf= 95% fleshyl= 97% Cluster C, Size: 1461 Min -32.0 816.0 411.0-9999.0-9999.0-632.0 341.0 2203.0 3.19 1.0 Max 8.0 1121.0 849.0 5.88 4.57 393.0 832.0 2635.0 31.12 6.0 Avg 3.26 923.43 704.34-2172.46-2146.75-357.64 534.17 2497.91 9.59 2.91 StdDev 3.45 50.72 26.38 4128.19 4108.44 205.62 60.34 73.63 3.93 2.26 lifelook=h 19% leaftype= 48% sprflow=1 75% sumflow=1 66% autflow= 58% winflow= 78% hitecat=1 35% aquatic= 91% fleshyf= 90% fleshyl= 78% Cluster D, Size: 2297 Min 1261.0 513.0 0.0 2.43 2.81 516.0-571.0 1648.0 15.49 1.0 Max 1945.0 814.0 1411.0 4.92 5.94 2594.0 4.0 2367.0 17.44 2.0 Avg 1547.77 638.37 632.17 3.81 4.03 1675.2-324.96 1848.13 16.33 1.0 StdDev 156.09 68.2 241.09 0.43 0.52 426.98 107.0 114.38 0.44 0.04 lifelook=s 20% leaftype= 54% sprflow=1 53% sumflow=1 83% autflow= 80% winflow= 91% hitecat=2 33% aquatic= 99% fleshyf= 92% fleshyl= 100% Cluster E, Size: 10999 Min -5.0 500.0 0.0-9999.0-9999.0 20.0-343.0 1932.0 3.19 1.0 Max 1260.0 815.0 1410.0 6.4 7.1 1548.0 703.0 3022.0 15.65 6.0 Avg 545.57 661.52 667.96-449.54-449.99 468.73 194.1 2450.21 13.51 1.01 StdDev 338.27 63.96 206.22 2081.78 2081.68 373.71 225.06 156.69 1.01 0.22 lifelook=t 21% leaftype=scle 39% sprflow=1 65% sumflow=1 66% autflow= 74% winflow= 82% hitecat=2 21% aquatic= 99% fleshyf= 94% fleshyl= 100% Cluster F, Size: 6318 Min -14.0-999.0 0.0-9999.0-9999.0-999.0-303.0 2226.0 3.19 1.0 Max 991.0 815.0 1279.0 10.13 11.45 19.0 820.0 3207.0 14.79 6.0 Avg 196.96 728.0 693.16-845.94-835.29-207.96 396.6 2595.02 12.4 1.14 StdDev 173.5 66.18 136.0 2789.66 2772.97 180.05 164.57 173.6 1.22 0.76 lifelook=t 20% leaftype= 40% sprflow=1 75% sumflow=1 71% autflow= 70% winflow= 74% hitecat=2 24% aquatic= 98% fleshyf= 95% fleshyl= 99% Fig. 7. PCTs results obtained using all attributes

Cluster A, Size: 9526 Min -14.00 504.00 0.00-9999.00-9999.00-999.00-511.00 1690.00 3.19 1.00 Max 1832.00 1052.00 1411.00 12.66 11.45 2451.00 807.00 3225.00 17.14 7.00 Avg 326.15 745.13 684.78-490.16-481.44 81.36 318.00 2575.04 12.71 1.14 Stddev 363.46 113.72 157.22 2168.67 2148.75 509.74 223.63 274.69 1.70 0.81 lifelook=s 28% leaftype=scle 54% sprflow=1 78% sumflow=1 71% autflow= 70% winflow= 64% hitecat=3 20% aquatic= 100% fleshyf= 92% fleshyl= 98% Cluster B, Size: 1065 Min -8.00 518.00 50.00-9999.00-9999.00-991.00-520.00 1718.00 3.19 1.00 Max 1770.00 1043.00 1259.00 7.98 7.32 2329.00 820.00 3222.00 17.00 7.00 Avg 205.83 809.66 705.02-1178.99-1170.29-103.91 418.85 2554.12 12.33 1.25 Stddev 317.27 104.38 106.69 3230.88 3219.50 480.67 209.42 248.76 1.68 0.99 lifelook=ltg 31% leaftype= 64% sprflow=1 92% sumflow=1 87% autflow= 86% winflow= 79% hitecat=3 54% aquatic= 86% fleshyf= 98% fleshyl= 99% Cluster C, Size: 3853 Min -21.00 520.00 0.00-9999.00-9999.00-999.00-547.00 1648.00 3.19 1.00 Max 1917.00 1050.00 1399.00 9.36 10.89 2565.00 816.00 3224.00 17.39 7.00 Avg 283.31 824.00 699.99-553.75-549.31-91.13 325.03 2697.50 12.50 1.32 Stddev 402.76 128.56 120.16 2296.16 2285.90 563.98 221.85 360.00 1.95 1.15 lifelook=h 28% leaftype= 62% sprflow=1 83% sumflow=1 65% autflow= 69% winflow= 66% hitecat=1 52% aquatic= 95% fleshyf= 95% fleshyl= 89% Cluster D, Size: 7116 Min -32.00-999.00 0.00-9999.00-9999.00-999.00-547.00 1664.00 3.19 1.00 Max 1933.00 1121.00 1365.00 9.49 12.75 2594.00 839.00 3223.00 31.12 7.00 Avg 416.11 781.09 690.27-1311.49-1309.03 120.37 282.20 2498.21 12.93 1.22 Stddev 522.05 123.49 135.60 3380.95 3377.75 744.98 297.18 325.69 2.10 0.95 lifelook=mtg 27% leaftype= 69% sprflow=1 80% sumflow=1 87% autflow= 62% winflow= 89% hitecat=2 44% aquatic= 95% fleshyf= 97% fleshyl= 98% Cluster E, Size: 2530 Min -14.00 515.00 0.00-9999.00-9999.00-998.00-502.00 1660.00 3.19 1.00 Max 1901.00 1107.00 1325.00 6.99 7.21 2428.00 809.00 3224.00 17.33 7.00 Avg 474.33 705.91 675.17-450.20-451.05 437.84 232.34 2489.47 13.21 1.12 Stddev 389.11 107.48 185.48 2083.72 2083.53 574.12 256.40 264.30 1.51 0.74 lifelook=t 28% leaftype= 40% sprflow= 57% sumflow= 68% autflow= 92% winflow= 85% hitecat=4 30% aquatic= 99% fleshyf= 89% fleshyl= 99% Cluster F, Size: 5589 Min -10.00 511.00 0.00-9999.00-9999.00-998.00-571.00 1648.00 3.19 1.00 Max 1945.00 1044.00 1407.00 8.78 6.74 2577.00 771.00 3223.00 17.44 7.00 Avg 566.27 699.89 661.53-498.66-499.11 405.52 172.94 2432.93 13.47 1.11 Stddev 507.74 106.96 194.85 2185.88 2185.78 692.12 301.15 299.51 1.79 0.68 lifelook=t 17% leaftype= 34% sprflow= 52% sumflow=1 63% autflow= 74% winflow= 89% hitecat=2 28% aquatic= 99% fleshyf= 94% fleshyl= 98% Fig. 8. K-means results obtained using all attributes and k=6