Data analysis and Geostatistics - lecture XI

Size: px

Start display at page:

Download "Data analysis and Geostatistics - lecture XI"

Ferdinand Maxwell
6 years ago
Views:

1 Data analysis and Geostatistics - lecture XI Clustering and spatial analysis of data Cluster analysis Group samples into clusters based on similarity Cluster analysis requires substantial user input (selection of number of clusters, clustering routine, similarity criteria, etc) and results can therefore be ambiguous: always give detailed information on how your cluster analysis was performed

2 Cluster analysis Group samples into clusters based on similarity cluster mean A whichever deviation between sample and cluster mean is smallest: assigned to that cluster cluster mean B _ (i-b) _ (i-c) _ (i-a) cluster mean C Cluster analysis is again controlled by the sum of squares: SSwithin SSwithin SSwithin SSbetween A-B SSbetween B-C SSbetween A-C small variance within: tight clusters large variance between: good separation increasing the number of clusters will decrease the within variance, until all samples are their own cluster. That result is however meaningless... Cluster analysis - sample assignment criteria range of techniques that can be used to determine similarity cluster mean B cluster mean C cluster mean A Wide range of techniques - see book for details Euclidian distance - r or r 2 city block of Manhattan distance - this is useful when the two variables are separate characteristics (fossil length and width, the diagonal is not of interest) correlation similarity - sample with the same correlation are grouped together: deals with dilution effects association values - especially useful when you have only presence/absence data - specialized

3 Cluster analysis - two types Two varieties of clustering: hierarchical and partitioning methods hierarchical techniques: represent similarity in a tree or dendrogram the method: 1. all samples are a separate cluster 2. link the two most similar samples 3. link two other samples to form a new cluster or add a third sample to the first cluster depending on similarities 4. continue until only one cluster remains in this technique all intermediate steps and cluster associations are immediately available - depends on the user to select an appropriate pruning level in the tree there are many ways to link samples and these do result in different trees (see book for details) Hierarchical cluster analysis An eample of hierarchical clustering: the composition of a number of lava samples from Kawah Ijen volcano: degree of dissimilarity sample KV01 KV20 KV41 KV43 KV08 KV10 KV12 KV14 KV14 KV21 KV21 clusters dissimilarity based upon nearest neighbour criterium the resulting tree can be pruned at any level: up to the user to select should test if difference between groups is significant (which test?) basalt andesite dacite duplicate

4 Clustering - partitioning techniques Two varieties of clustering: hierarchical and partitioning methods partitioning techniques: assigns samples to a known number of clusters based upon similarity criteria the method: 1. samples are assigned to the cluster they are most similar to in multi-dimensional space 2. each assignment results in a shift in the characteristics of the cluster centre (means + variance or only variance) 3. samples are re-assigned where necessary and this routine is iterated until the system stabilizes There are two main approaches: clustering with specified cluster means (i.e. known groups) and clustering where the means are obtained during clustering both have their pros and cons: Partitioning techniques advantages disadvantages specified/ fied you always get the same answer during classification groups can relate to real dividing phenomena unknowns are (generally) easily classified boundaries commonly based on consensus (artificial) 2 samples close together can be in different clusters 2 very different samples can be in same cluster assigned/ sought data groups not split up over different clusters boundaries always in regions of low data density easy to apply to data sets with many variables instability issues: more data will result in shift in cluster means and sample assignment no fied boundaries so unsuitable for classification schemes

5 Clustering with hard boundaries 2 samples close together can be in different clusters 2 very different samples can be in same cluster Partitioning techniques advantages disadvantages specified/ fied you always get the same answer during classification groups can relate to real dividing phenomena unknowns are (generally) easily classified boundaries commonly based on consensus (artificial) 2 samples close together can be in different clusters 2 very different samples can be in same cluster assigned/ sought data groups not split up over different clusters boundaries always in regions of low data density easy to apply to data sets with many variables instability issues: more data will result in shift in cluster means and sample assignment no fied boundaries so unsuitable for classification schemes

6 Cluster means assigned during clustering: when cluster means are specified: use minimum distance to mean to assign if not: randomly assign each sample to a cluster and iterate to stable solution both cluster means and cluster assignment change during the iteration process stops when samples no longer change their assignment cluster A cluster B cluster C center Cluster analysis - method of assignment Samples are normally assigned to a cluster in a hard way: samples are unambiguously attributed to a specific cluster - 0 or 1 assignment However, mother nature is rarely so black and white... middle age cluster depends very much on percon/country/continent 1 0 young A middle age B old 1 0 young middle age old if age is between A and B: middle age fuzzy approach: samples have cluster memberships between 0 and 1

Fuzzy clustering fuzzy clustering has a number of distinct benefits: can deal with intermediate cases - not force-assigned samples have share multiple clusters - etra information: (0.7 young + 0.

7 Fuzzy clustering fuzzy clustering has a number of distinct benefits: can deal with intermediate cases - not force-assigned samples have share multiple clusters - etra information: (0.7 young middle age versus 0.5 young middle age) ensures that single samples do not overly control individual clusters can have a separate outlier assignment most fleible and powerful: fuzzy clustering with seeking of cluster means hard clustering strained assignment due to outlier and intermediate value fuzzy clustering outlier not a problem and intermediate shown Clustering in NCSS - the eating habits of Europe can we distinguish the Europeans by their eating habits? the data (missing value = -999): Real coffee Nescafe Tea Sweetener Biscuits Pack. soup Tinned soup Frozen fish Frozen veg. Apples Tinned fruit Jam Garlic Butter Margerine Olive oil Yoghurt

8 Clustering in NCSS - the eating habits of Europe hierarchical clustering of this data set: clear clustering lots of options available: use parametric and nonparametric data and even mi these (length + color) variety of linkage types: nearest neighbour, furthest neighbour, Ward s method distance: Euclidian or Manhattan city block see the NCSS hierarchical clustering tutorial for more information Clustering in NCSS - the eating habits of Europe hard and fuzzy clustering of this data set: K-means - hard fuzzy-prob 1 fuzzy-prob 2 fuzzy-prob 3 fuzzy-prob 4 Germany Italy France Netherlands Belgium Luembourg Britain Portugal Austria Switzerland Sweden Denmark Norway Finland Spain Ireland

9 Clustering - number of clusters the main difficulty in cluster analysis is choosing the no. of clusters NCSS and other clustering packages will calculate assignments for a cluster number range the residual variance will decrease with every additional cluster so this is not a good indicator of optimal no. of clusters instead: choose no. of clusters where variance no longer strongly decreases use the averaged silhouette value: comparison between a value s dissimilarity with its cluster and the dissimilarity with its nearest neighbour: ranges from 1 to -1: > 0.75: good model < 0.25: poor model Use the fuzziness of the model (0; completely fuzzy to 1; hard) Fc(U) and Dc(U) parameters: ma Fc(U) + min Dc(U) = best model Plotting clusters on maps - Massif Central dataset Will look at an eample from the Massif Central in France. A dataset of the chemical composition of stream sediments collected in an area with a diverse geology and old, now abandoned, mining for Sb, As, Pb, Au, Ba & F Geology consists of: + Ally + Chilhac felsic gneisses Cronce river + + Lavoute-Chilhac St Cirgues + Reilhac Langeac recent volcanics mafic gneisses and schists Védrines + St Loup + Chastel + Lestival + Chanteuges + Pinols + Marsanges + Barlet (meta) - granite + Desges Desges river + Chazelles + Pebrac + + Charrai Prades sediment (incl coal) 5 km

10 Clustering - groups in Massif Central dataset The dataset is best described when split up into si clusters clear link to the bedrock geology, but not 1 to 1 Clustering - properties per cluster when the data have been clustered: can look at the characteristics of each cluster (mean + stdev) and correlations within this SiO2 MgO log (Li) K2O Li V

11 Clustering - groups in Massif Central dataset Can plot clusters individually to look at spatial distribution and contents cluster 1 cluster 2 cluster 3 cluster 4 cluster separation isn t perfect only cluster 4 is distinct in Li: multi-element separation Plotting data on maps: bubble plots Data are plotted at their spatial coordinates with a symbol whose size represents the value of the data point 140 Can apply eactly the same tools as used on the element map: adjust contrast, isolate features and perform data transformations can also overlay these bubbles on another layer, such as a topo map, geol map, stream map etc

12 Plotting data on maps: bubble plots Stream sediments as a reflection of the local geology: Beryllium Be concentrations without processing: 500 sometimes it just works! Plotting data on maps: bubble plots Silver concentrations: working with a non-normal distribution Ag optimized linear scale square root scale

13 Plotting data on maps: bubble plots Don t have to plot all the data in the dataset: applying a cut-oﬀ at low values will highlight interesting samples, whereas a high cut-oﬀ removes outliers Zn, only data with > 50 ppm Plotting data on maps: bubble plots Looking for element associations by combining bubble plots Cd Zn Sb W

14 Plotting data on maps Combining elements by using multi-coloured bubble plots is useful, but fast becomes confusing: can lead you to miss interesting samples Can also calculate such associations beforehand and plot them directly: Sb + Zn Sb / Zn Or you can apply logical rules to the data before plotting: plot Sb if S > 200 ppm if SiO2 > 60 wt% then plot K / Zr Note that such properties are calculated much easier and faster in programs designed for such calculations: e.g. Ecel or Quattro Pro Plotting data on maps Not limited to plotting data, but can also plot derived properties such as the mean, median, standard deviation, etc and not just values, but also other observations: geol code / vegetation / mode in multi-modal distribution

15 Plotting data on maps: bubble plots Plotting processed data - standard deviation: the variability at a sample site Plotting data on maps: artefacts in the gold map

16 Spatial data visualization To be able to calculate contours and surfaces: interpolation need to know the concentration at any point in the sampling space to be able to draw smooth contours: o interpolate between values interpolation on As content grid; 40 nearest neighbour 20 o o radius technique: 1/r o radius technique: 1/r 2 Spatial data visualization Results of different interpolation techniques:

17 Spatial data visualization To be able to calculate contours and surfaces: interpolation 140 interpolation on As content grid; 120 nearest neighbour 100 o radius technique: 1/r o o radius technique: 1/r 2 main issue: what samples should be included in the interpolation: 20 o what should the maimum radius be? Interpolation radius Spatial data have a very useful property: adjacent samples should be most similar, whereas samples that are far apart can be distinctly different, or: the variance for a small interpolation radius is small, as the variance between adjacent samples is small the variance increases as the interpolation radius increases (i.e. as samples further away from the point of interest are included) at some radius the variance will no longer increase as we have reached the overall variance, which is called the regional variance including values beyond the regional variance radius is pointless as such samples do not contain any information on the value at the point of interest

18 Interpolation radius Interpolation radius in a sedimentary core: radius concentration adjacent samples are most similar: as interpolation radius increases so does the variances when you enter another unit the variance increases significantly: such samples should not be included in your interpolation Interpolation radius

19 Interpolation radius variance radius semivariance radius Interpolation radius semivariance semivariance radius

20 Semivariance and semivariograms This concept is semivariance and is shown in a semivariogram semivariance: the variance between samples a specified interval or distance apart as the interval increases, the semivariance will approach the total variance of the data set, so it is a spatially controlled partial variance of the data γ h = Σ (z i - z i+h ) 2 2(n - h) with: γ = semivariance for interval h n = total number of samples zi = value at position i as h increases, the relatedness of the samples decreases and the variance will therefore increase: Semivariance and semivariograms plotting the semivariance against h: semivriogram concentration semivariance no relation with distance: random distance interval concentration semivariance gradual changes in concentration distance interval concentration semivariance continuous variation with distance: trend distance interval

21 Semivariance and semivariograms properties of a semivariogram : sill drift semivariance range range interval interval the range is the interval within which there is similarity between the samples Semivariance and semivariograms Semivariograms provide our maimum radius criterion: only samples that fall within the range are included in interpolation before we continue, a few notes: semivariograms have to be determined for each variable as each has its own range: interpolation has to be performed separately as well semivariograms are generally different for different spatial directions (N, SW, etc). Such anisotropy can point to an underlying geological phenomenon such as layering or a fault control on conc. This can be corrected for either manually by stretching the coordinate system perpendicular to the main ais, or automatically by kriging software most semivariagrams have an apparent cut-off at zero distance that has a semivariance 0. This is called the nugget effect and is caused by sample heterogeneity (= field duplicate variance)

22 Nugget effect in semi-variograms sill There is always some uncertainty at a given sample site, which you could quantify by taking field duplicates. semivariance nugget range interval This sample site variance is the nugget in a semivariogram (in essence the variance at zero distance) Every element will have such a nugget, but the effect is strongest for elements that are heterogeneously distributed, such as gold present as nuggets in a sediment because we use mean + var Using semivariogram information: kriging The interpolation technique that employs the range information as obtained from semivariograms is called kriging in kriging, only samples that are within the range are used to determine the value at a given intermediate position and the weighing for each sample is derived from its associated semivariance A (i,yi) = wt1 * A (1,y1) + wt2 * A (2,y2) + wt3 * A (3,y3) +... as an added bonus this also gives us the variance associated with each interpolated value (the uncertainty), so we can immediately see where our interpolations are reliable and where they are not because weights are based on the semivariance, obvious trends in the data should be removed as this leads to a continuous rise in the semivariance: can be done by first subtracting a trend surface

23 Estimate of uncertainty for each interpolated value source: wikipedia.org Uncertainty in block kriging of grades Kriging is commonly applied to estimate the grade of blocks in open pit mining using a sample grid or the grade of adjacent blocks (or both). In such cases it is invaluable to know the uncertainty on the grade estimate

24 Flavours of kriging There are many flavours of kriging and discussing them all would be a course in its own right. A few terms that you come across commonly: Simple/Ordinary kriging: no trend in the data, so there is a constant mean in the dataset and the variance is calculated as the difference from this mean. This mean is either known (Simple) or calculated from the data (Ordinary) Universal kriging: there is a spatial trend in the data, so the mean varies with the spatial coordinates. Instead of using universal kriging, you can also remove the trend in pre-processing of the data Indicator kriging: rather than estimating a numerical value at a given point, you estimate if it is higher or lower than a set value, and the prob. of this Co-kriging: a second variable is included in the kriging which is correlated with the first variable. This should improve estimates of the first and main variable Good kriging resource: Clark & Harper (2000) Practical Geostatistics ISBN , or you can download the 1979 original at Back to our eample Results of different interpolation techniques:

And now using kriging as the interpolation method Results of kriging on this data set: 140 140 120 120 100 100 80 80 60 60

25 And now using kriging as the interpolation method Results of kriging on this data set: Kriging and trends The effect of a strong spatial trend in the data

26 Kriging and trends The effect of a strong spatial trend in the data Some data are not suited to interpolation/kriging There is a strong tendency to directly start with the most comple or fancy technique, such as kriging. However, kriging is not always appropriate! raw concentrations plotted optimized kriging map

27 Kriging and sample coverage Kriging works best when you have a high sample density and a more or less uniform distribution of data over the sample are. If not get artefacts Areas without samples need to be blanketed out, not just removed afterwards

Univariate Data Presentation: The Contouring Conundrum and Philosophical Arguments Regarding the Contouring of Geochemical Data Clifford R.

Univariate Data Presentation: The Contouring Conundrum and Philosophical Arguments Regarding the Contouring of Geochemical Data Clifford R. Stanley Dept. of Geology, Acadia University, Wolfville, Nova