Spectral Clustering of Polarimetric SAR Data With Wishart-Derived Distance Measures

Similar documents
The Laplacian PDF Distance: A Cost Function for Clustering in a Kernel Feature Space

Land Cover Feature recognition by fusion of PolSAR, PolInSAR and optical data

Evaluation and Bias Removal of Multi-Look Effect on Entropy/Alpha /Anisotropy (H/

Machine Learning - MT & 14. PCA and MDS

Manifold Coarse Graining for Online Semi-supervised Learning

Statistical Machine Learning

DUAL FREQUENCY POLARIMETRIC SAR DATA CLASSIFICATION AND ANALYSIS

Maximum Within-Cluster Association

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

Table of Contents. Multivariate methods. Introduction II. Introduction I

Machine Learning - MT Clustering

L11: Pattern recognition principles

MLCC Clustering. Lorenzo Rosasco UNIGE-MIT-IIT

Study and Applications of POLSAR Data Time-Frequency Correlation Properties

Kernel Methods. Machine Learning A W VO

Comparison of Modern Stochastic Optimization Algorithms

Introduction to Machine Learning

Multiple Similarities Based Kernel Subspace Learning for Image Classification

Supervised locally linear embedding

Estimation of the Equivalent Number of Looks in Polarimetric Synthetic Aperture Radar Imagery

Eigenface-based facial recognition

A New Model-Based Scattering Power Decomposition for Polarimetric SAR and Its Application in Analyzing Post-Tsunami Effects

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Dimensionality Reduction

Linear Dimensionality Reduction

PCA and admixture models

Maximum likelihood SAR tomography based on the polarimetric multi-baseline RVoG model:

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Neural Network Training

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

c 4, < y 2, 1 0, otherwise,

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Outline. Motivation. Mapping the input space to the feature space Calculating the dot product in the feature space

Machine Learning Lecture 2

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

CS 231A Section 1: Linear Algebra & Probability Review

CS 231A Section 1: Linear Algebra & Probability Review. Kevin Tang

Support Vector Machine (SVM) and Kernel Methods

Spectral and Spatial Methods for the Classification of Urban Remote Sensing Data

Spectral Generative Models for Graphs

Ordinary Differential Equations II

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

CS534 Machine Learning - Spring Final Exam

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

A Least Squares Formulation for Canonical Correlation Analysis

Pattern Recognition and Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Kernel Methods. Charles Elkan October 17, 2007

Ch 4. Linear Models for Classification

Learning Spectral Graph Segmentation

Functional Analysis Review

Linear & nonlinear classifiers

Advanced Machine Learning & Perception

Basic Calculus Review

Functional Analysis Review

Classification of handwritten digits using supervised locally linear embedding algorithm and support vector machine

PCA, Kernel PCA, ICA

An indicator for the number of clusters using a linear map to simplex structure

ML (cont.): SUPPORT VECTOR MACHINES

Data Mining Techniques

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Support Vector Machines for Classification: A Statistical Portrait

CS 664 Segmentation (2) Daniel Huttenlocher

Dimensionality Reduction

Soil moisture retrieval over periodic surfaces using PolSAR data

Discriminative Direction for Kernel Classifiers

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

SPECTRAL CLUSTERING AND KERNEL PRINCIPAL COMPONENT ANALYSIS ARE PURSUING GOOD PROJECTIONS

Unsupervised Learning

Probabilistic Machine Learning. Industrial AI Lab.

Kernel Principal Component Analysis

Curve Fitting Re-visited, Bishop1.2.5

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

OBJECT DETECTION AND RECOGNITION IN DIGITAL IMAGES

Machine Learning 2nd Edition

Manifold Learning: Theory and Applications to HRI

Polarimetry-based land cover classification with Sentinel-1 data

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

Analysis Preliminary Exam Workshop: Hilbert Spaces

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Kernel Learning with Bregman Matrix Divergences

Machine Learning Techniques for Computer Vision

5. Discriminant analysis

Machine Learning Lecture 5

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Exploiting Sparse Non-Linear Structure in Astronomical Data

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Transcription:

Spectral Clustering of Polarimetric SAR Data With Wishart-Derived Distance Measures STIAN NORMANN ANFINSEN ROBERT JENSSEN TORBJØRN ELTOFT COMPUTATIONAL EARTH OBSERVATION AND MACHINE LEARNING LABORATORY DEPARTMENT OF PHYSICS AND TECHNOLOGY UNIVERSITY OF TROMSØ, NORWAY 1/54

Outline Motivation Introduction to Spectral Clustering Distance Measures for PolSAR Covariance Matrices A New Algorithm Results Conclusions and Future Work 2/54

Motivation Seeking (near) optimal statistical classification Disregarding covariance matrix structure (decomposition theory) and spatial information - for now Improve on the Wishart classifier Lee et al. (IJRS, 1994), Lee et al. (TGRS, 1999), Pottier & Lee (EUSAR, 2000),... Apply modern pattern recognition tools Kernel methods, spectral clustering, information theoretic learning 3/54

The Wishart Classifier Revisited Initialisation: Segmentation in H/A/α space Cloude-Pottier-Wishart (CPW) classifier Class mean coherency matrices V i calculated from initial partitioning of data; V i = < T j pixel j class i >, i = 1,..., k T j = < kk H > k = 1 2 [S hh +S vv, S hh S vv, 2S hv ] T. 4/54

The Wishart Classifier Revisited Initialisation: Segmentation in H/A/α space Cloude-Pottier-Wishart (CPW) classifier Class mean coherency matrices V i calculated from initial partitioning of data Iterative classification Minimum distance classification based on Wishart distance between the pixel coherency matrix T and V i : ω j = min i d W (T, V i ), i 1,..., k Iterative reclassification and update of class means 5/54

The Wishart Classifier Revisited Delivers consistently good results. Few parameters, easy to use, computationally efficient, approaches a ML solution - if it converges. But has some drawbacks: The initialisation uses a fixed number of classes, and is restricted to one class per predetermined zone in H/A/α space. Inherits the well known disadvantages of k-means. E.g., converence is not guaranteed, and may be slow. Conclusion: State of the art algorithms from pattern recognition and machine learning should be tested. 6/54

Clustering by Pairwise Affinities Based on distances d i j between all pixel pairs (i,j). Propagates similarity from pixel to pixel. Yields flexible discrimination surfaces. Nonlinear mapping to kernel space, where clustering is done with linear methods. The mapping is found by eigendecomposition. 3 2 1 0 1 2 Examples of capabilities 3 4 3 2 1 0 1 Input space 0.2 0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.8 0.6 0.4 0.2 0 Kernel space 7/54

Spectral Clustering Pairwise distances d i j are transformed to affinities, e.g.: { } a i j = exp d2 i j 2σ 2 8/54

Spectral Clustering Pairwise distances d i j are transformed to affinities, e.g.: { } a i j = exp d2 i j 2σ 2 Pairwise affinities a i j between N data points are stored in an affinity matrix A. a 11 a 12... a 1N A = a 21 a 22... a 2N..... a N1 a N2... a NN 9/54

Spectral Clustering Pairwise distances d i j are transformed to affinities, e.g.: { } a i j = exp d2 i j 2σ 2 The optimal data partitioning is derived from the eigendecomposition of A. Hence, spectral clustering. Pairwise affinities a i j between N data points are stored in an affinity matrix A. a 11 a 12... a 1N A = a 21 a 22... a 2N..... a N1 a N2... a NN 10/54

Spectral Clustering Pairwise distances d i j are transformed to affinities, e.g.: { } a i j = exp d2 i j 2σ 2 Pairwise affinities a i j between N data points are stored in an affinity matrix A. a 11 a 12... a 1N A = a 21 a 22... a 2N..... a N1 a N2... a NN The optimal data partitioning is derived from the eigendecomposition of A. Hence, spectral clustering. There are different ways of using the eigenvalues and eigenvectors of A to obtain an optimal clustering. 11/54

Spectral Clustering Pairwise distances d i j are transformed to affinities, e.g.: { } a i j = exp d2 i j 2σ 2 Pairwise affinities a i j between N data points are stored in an affinity matrix A. a 11 a 12... a 1N A = a 21 a 22... a 2N..... a N1 a N2... a NN The optimal data partitioning is derived from the eigendecomposition of A. Hence, spectral clustering. E.g.: Using u eigenvectors corresponding to the largest eigenvalues new u-dimensional feature space (eigenspace): e T 1 e T 2. e T u = [φ 1φ 2... φ N ] 12/54

Spectral Clustering We have a mapping from input feature space to eigenspace: Φ(T i ) : T i φ i 13/54

Spectral Clustering We have a mapping from input feature space to eigenspace: Φ(T i ) : T i φ i The eigenspace feature set can be clustered by simple, linear discrimination methods, e.g. k-means with Euclidean distance. 14/54

Spectral Clustering We have a mapping from input feature space to eigenspace: Φ(T i ) : T i φ i The eigenspace feature set can be clustered by simple, linear discrimination methods, e.g. k-means with Euclidean distance. 15/54

Spectral Clustering We have a mapping from input feature space to eigenspace: Φ(T i ) : T i φ i The eigenspace feature set can be clustered by simple, linear discrimination methods, e.g. k-means with Euclidean distance. We use an information theoretic method, which partitions data by implicit maximization of the Cauchy-Schwarz divergence between the cluster pdf s in input space. Pdf s are estimated nonparametrically. 16/54

Spectral Clustering We have a mapping from input feature space to eigenspace: Φ(T i ) : T i φ i The eigenspace feature set can be clustered by simple, linear discrimination methods, e.g. k-means with Euclidean distance. We use an information theoretic method, which partitions data by implicit maximization of the Cauchy-Schwarz divergence between the cluster pdf s in input space. Pdf s are estimated nonparametrically. Data points outside the size N sample can be mapped to eigenspace using the Nyström approximation: Φ j (T) N λ j N e ji d(t, T i ), j = 1,..., u. i=1 17/54

Relation to Kernel Methods May be related to Mercer kernel-based algorithms, such as: Support Vector Machines, Kernel PCA, Kernel k-means, etc. The pairwise affinities are inner products in a Mercer kernel space a i j = a(t i, T j ) = < φ i, φ j >, a i j is a Mercer kernel function and A a Mercer kernel matrix iff: a(t i, T j ) is semi-positive definite a(t i, T j ) is symmetric a(t i, T j ) is continuous With these restrictions, how do we select the distance measure? 18/54

Coherency Matrix Distance Measures Wishart distance (Lee et al., IJRS 94) d W (T 1, T 2 ) = ln T 2 + tr(t 1 2 T 1). 19/54

Coherency Matrix Distance Measures Wishart distance (Lee et al., IJRS 94) d W (T 1, T 2 ) = ln T 2 + tr(t 1 2 T 1). Can be symmetrized, but d W (T i, T i ) depends on T i. Not suitable! 20/54

Coherency Matrix Distance Measures Bartlett distance (Conradsen et al., TGRS 03) ( T1 + T d B (T 1, T 2 ) = ln 2 2 ) 2p ln 2. T 1 T 2 21/54

Coherency Matrix Distance Measures Bartlett distance (Conradsen et al., TGRS 03) ( T1 + T d B (T 1, T 2 ) = ln 2 2 ) 2p ln 2. T 1 T 2 Based on log-likelihood ratio test of equality for two unknown covariance matrices. 22/54

Coherency Matrix Distance Measures Bartlett distance (Conradsen et al., TGRS.03) ( T1 + T d B (T 1, T 2 ) = ln 2 2 ) 2p ln 2. T 1 T 2 Based on log-likelihood ratio test of equality for two unknown covariance matrices. Symmetrized normalized log-likelihood distance (Proposed here) d SNLL (T 1, T 2 ) = 1 2 ( tr(t 1 1 T 2 + T 1 2 T 1) ) p. 23/54

Coherency Matrix Distance Measures Bartlett distance (Conradsen et al., TGRS 03) ( T1 + T d B (T 1, T 2 ) = ln 2 2 ) 2p ln 2. T 1 T 2 Based on log-likelihood ratio test of equality for two unknown covariance matrices. Symmetrized normalized log-likelihood distance (Proposed here) d SNLL (T 1, T 2 ) = 1 ( tr(t 1 2 1 T 2 + T 1 2 T 1) ) p. Based on log-likelihood ratio test of equality for one known and one unknown covariance matrix. Symmetrized version of revised Wishart distance (Kersten et al., TGRS 05) 24/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. 25/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. A subset of N pixels, randomly sampled from the image, is clustered. 26/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. A subset of N pixels, randomly sampled from the image, is clustered. Remaining pixels may be classified in eigenspace, using the Nyström approximation. 27/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. A subset of N pixels, randomly sampled from the image, is clustered. Remaining pixels may be classified in kernel space (eigenspace), using the Nyström approximation. Alternatively, remaining pixels may be classified in input space with the minimum distance Wishart classifier. 28/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. A subset of N pixels, randomly sampled from the image, is clustered. Remaining pixels may be classified in kernel space (eigenspace), using the Nyström approximation. Alternatively, remaining pixels may be classified in input space with the minimum distance Wishart classifier. The latter solution has much lower computational cost. Our experience is that the classification results are essentially equal. 29/54

The New Algorithm Summary Replaces H/A/α space initialisation with spectral clustering. A subset of N pixels, randomly sampled from the image, is clustered. Remaining pixels may be classified in kernel space (eigenspace), using the Nyström approximation. Alternatively, remaining pixels may be classified in input space with the minimum distance Wishart classifier. The latter solution has much lower computational cost. Our experience is that the classification results are essentially equal. Hence, only the initialisation of the CPW classifier is changed. 30/54

The New Algorithm Parameters Number of clusters: k Must be manually selected, but the effective number of classes in the classification result, k e f f, is data adaptive. 31/54

The New Algorithm Parameters Number of clusters: k Must be manually selected, but the effective number of classes in the classification result, k e f f, is data adaptive. Sample size: N Trade-off with computational cost 32/54

The New Algorithm Parameters Number of clusters: k Must be manually selected, but the effective number of classes in the classification result, k e f f, is data adaptive. Sample size: N Trade-off with computational cost Kernel bandwidth: σ Robust automatic selection rule is under investigation 33/54

The New Algorithm Parameters Number of clusters: k Must be manually selected, but the effective number of classes in the classification result, k e f f, is data adaptive. Sample size: N Trade-off with computational cost Kernel bandwidth: σ Robust automatic selection rule is under investigation Eigenspace dimension: u Can be fixed to u = k for simplicity 34/54

POLinSAR 2007 Frascati Test Data Set: Flevoland, L-band 200x320 subset of AIRSAR L-band data set of agricultural area in Flevoland, The Netherlands, August 1989. Courtesy of NASA/JPL. 35/54

Ground Truth Data 36/54

Evaluation Qualitative analysis (visual inspection) Quantitative analysis We calculate a matching matrix M that relates predicted (P) and actual (A) class labels, and derive classification merits from M (Ferro-Famil et al., TGRS 01): Descriptivity D i : The fraction of the dominant predicted class labels within an actual class (quantifies homogeneity). Compactness C i : Quantifies to what extent the dominant predicted class also dominates other actual classes. Representivity R i : Quantifies to what extent the dominant predicted class is predicted for other actual classes. 37/54

Qualitative Analysis Cloude-Pottier-Wishart (CPW) Classifier Parameters: k=16, k e f f =9, it=10 (No. iterations in Wishart classifier). 38/54

Qualitative Analysis Cloude-Pottier-Wishart (CPW) Classifier Observations: Class 2 and 9 covered by same cluster. Class 4 and 10 covered by same cluster. Homogeneous classification in the ground truth areas. 39/54

Qualitative Analysis Bartlett Spectral Wishart (BSW) Classifier Parameters: k=16, k e f f =12, σ = 0.42, N=6400 (10%), it=10. 40/54

Qualitative Analysis Bartlett Spectral Wishart (BSW) Classifier Observations: Class 1 and 5 covered by same cluster. Some interference by a second cluster in class 3 and 5. Not as homogeneous classification as for CPW classifier, but better delineation of some areas. 41/54

POLinSAR 2007 Frascati Qualitative Analysis SNLL Spectral Wishart (SSW) Classifier Parameters: k=16, ke f f =15, σ = 0.42, N=6400 (10%), it=10. 42/54

Qualitative Analysis SNLL Spectral Wishart (SSW) Classifier Observations: Unique dominant cluster for all ground truth areas. Less homogeneous classification than other methods, much due to the higher effective number of classes. 43/54

Matching matrix for CPW classifier D i P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 C i A1 99.8 0.2 0 0 0 0 0 0 0.2 0 99.5 A2 0 99.5 0 0 0.5 0 0 0 99.5 0 0 A3 0 0 99.7 0 0 0.3 0 0 0 0 99.3 A4 0 0 0 92.8 7.2 0 0 0 0 92.8 0 A5 0 20.2 0 0 79.8 0 0 0 20.2 0 39.5 A6 0 0 0 0 0 100 0 0 0 0 100 A7 0 0 0 0 0 0.3 99.7 0 0 0 99.3 A8 0 0 3.0 0 0 1.5 0 95.5 0 0 91.0 A9 0 100 0 0 0 0 0 0 100 0 0 A10 0 0 0 100 0 0 0 0 0 100 0 R i 99.8 0 96.7 0 72.2 97.8 99.7 95.5 0 7.2 44/54

Matching matrix for Bartlett distance classifier D i P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 C i A1 99.8 0 0 0 0 0 0 0 0.2 0 99.7 A2 0 98.5 0 0 0.5 0 0 0 1.5 0 97.0 A3 0 0 72.9 0 0 0.1 0 15.3 0 0 57.6 A4 0 0 0 99.2 0.8 0 0 0 0 0.03 98.4 A5 0 5.6 0 1.8 83.2 0 0 0 8.5 0 67.3 A6 0 0 0 0 0 99.8 0 0.02 0 0 99.8 A7 0 0 0 0 0 0.6 98.1 0 0 0 97.5 A8 0 0 8.3 0 0 7.4 0.02 89.0 0 0 77.0 A9 0 3.7 0 0 1.8 0 0 0 94.5 0 89.1 A10 0 0 0 0.4 0.1 0 0 0 0.6 98.6 97.5 R i 99.8 89.2 64.7 97.0 80.5 91.7 98.1 77.4 83.9 98.6 45/54

Matching matrix for SNLL distance classifier D i P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 C i A1 99.8 0 0 0 0 0 0 0 0.2 0 99.7 A2 0 98.6 0 0 0.1 0 0 0 0.9 0 97.6 A3 0 0 61.3 0 0 0.05 0 6.0 0 0 55.2 A4 0 0 0 93.0 1.8 0 0 0 0 0 91.2 A5 0 1.0 0 0.1 72.1 0 0 0 3.4 0 67.7 A6 0 0 0 0 0 99.0 0 0 0 0 99.0 A7 0 0 0 0 0.2 0.1 84.1 0 0 0 83.9 A8 0 0 1.2 0 0 3.9 0.1 91.8 0 0 86.6 A9 0 0.2 0 0 6.4 0 0 0 87.1 0 80.6 A10 0 0 0 0.2 0.1 0 0 0 0.4 94.5 93.7 R i 99.8 97.4 60.1 92.7 63.6 94.9 84.0 85.8 82.3 94.5 46/54

Quantitative Analysis: Descriptivity 47/54

Quantitative Analysis: Compactness 48/54

Quantitative Analysis: Representivity 49/54

Quantitative Analysis: Effective no. classes 50/54

Convergence Speed 51/54

Conclusions and Future Work We have selected two distance measures suited for calculation of pairwise affinities for PolSAR data coherency matrices. We have demonstrated how PolSAR data can be segmented by spectral clustering of coherency matrices The algorithm improves the classification result of the CPW classifier, while using the same information (derived from the statistics of a single pixel). Performance analysis shows that spectral clustering gives a better initialisation of the Wishart classifier than the H/A/α initialisation, both in terms of classification result and covergence speed. 52/54

Conclusions and Future Work Further work will concentrate on methods for robust selection of the kernel bandwidth σ, and studies of the data adaptive k e f f, in order to develop and verify a fully automatic segmentation algorithm. We will also study how spatial information and information from polarimetric decompositions can be included in the distance measure, to assimilate more prior information in the kernel function. The algorithm will be tested on different data sets. 53/54

Thank you! Stian Normann Anfinsen Computational Earth Observation and Machine Learning Laboratory University of Tromsø URL: http://www.phys.uit.no/ceo-ml/ 54/54