Feature selection and extraction Spectral domain quality estimation Alternatives

Similar documents
Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

L11: Pattern recognition principles

Machine detection of emotions: Feature Selection

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Discriminant analysis and supervised classification

CLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING

Motivating the Covariance Matrix

Linear Dimensionality Reduction

Branch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linear & nonlinear classifiers

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Bayesian Decision Theory

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Statistical Pattern Recognition

5. Discriminant analysis

Machine Learning 2nd Edition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

Pattern Recognition 2

Algorithm-Independent Learning Issues

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Preprocessing & dimensionality reduction

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

PATTERN CLASSIFICATION

Linear Classifiers as Pattern Detectors

Pattern Recognition. Parameter Estimation of Probability Density Functions

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT

MSA220 Statistical Learning for Big Data

Kernel Methods. Barnabás Póczos

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Pattern Classification

44 CHAPTER 2. BAYESIAN DECISION THEORY

Small sample size generalization

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Machine Learning Lecture 7

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Is cross-validation valid for small-sample microarray classification?

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Resampling techniques for statistical modeling

Probability Models for Bayesian Recognition

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

EECS490: Digital Image Processing. Lecture #26

Bayesian Decision Theory

Pattern Recognition and Machine Learning

Data Preprocessing. Cluster Similarity

Expectation Maximization

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Discriminative Direction for Kernel Classifiers

Linear & nonlinear classifiers

day month year documentname/initials 1

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

ECE521 week 3: 23/26 January 2017

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Bayes Decision Theory - I

Bayesian Decision and Bayesian Learning

Non-linear Measure Based Process Monitoring and Fault Diagnosis

Clustering VS Classification

Chemometrics: Classification of spectra

Machine Learning Lecture 5

Design and Implementation of Speech Recognition Systems

Parametric Techniques

Machine Learning Lecture 2

Bayes Decision Theory

STA 4273H: Statistical Machine Learning

Lecture 7: Con3nuous Latent Variable Models

Introduction to Supervised Learning. Performance Evaluation

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

A Unified Bayesian Framework for Face Recognition

EECS 275 Matrix Computation

CMU-Q Lecture 24:

Linear Classifiers as Pattern Detectors

Unsupervised machine learning

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Classification and Pattern Recognition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Machine Learning Lecture 2

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Outline Learning Machines and Data! Linear and nonlinear methods to compress

Introduction to Statistical Inference

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Solving Classification Problems By Knowledge Sets

Classification: The rest of the story

Transcription:

Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic errors Feature selection and extraction: Decrease volume of information Feature selection: select relevant information Feature extraction: compute new features or relations between features Spatial domain: search important targets from image Time domain: time series analysis Spectral domain: decrease feature dimensionality Datastructure: Way to represent features Image: vector & matrix Recognition: Divide measurements to categories (classes) Supervised: classes are known Unsupervised: classes are unknown Important step: quality estimation Feature selection and extraction in spectral domain Try to decrease feature dimensionality Why? Theoretically, if infinite number of training samples is used then adding more features increase accuracy Measurement complexity: k n, where k number of bits, n number of channels 2 classes P c : A prior probability of one class Hughes phenomena In reality, limited number of samples in use As dimension increases, accuracy increases at first, but then starts to decrease Increasing number of training samples increases accuracy 2 classes, a prior probabilities equal m: number of training samples used Infinite number of samples Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Landgrebe Why? Theoretically, as dimensionality increases, separability would increase. Landgrebe Hughes phenomena In reality, finite number of samples N are used to estimate class statistics. If number of samples is fixed, increasing dimensionality results poorer estimates. Combined result: -optimum value of measurement complexity -increasing features do not necessarily lead to better results Alternatives Ad hoc feature extraction & selection, e.g. Extraction: Vegetation indices like NDVI, Tasselled Cap transformation Selection: knowlegde about matter-energy interaction Feature selection Select most important features using some criterion function Feature extraction Make transformation in order to decrease dimensionality Estimate parameters more carefully or using more samples Estimate covariance matrix using small number of samples (Landgrebe ch 4.8) Combine covariance matrix of classes and common covariance matrix Statistics enhancement using unlabeled samples (Landgrebe ch 4.2 4.4) Use also unlabeled samples to estimate class means and covariances 1

Feature selection Choose best subset X from set Y (original measurements) where D > d Maximize some criterion function: where Ξ is any d-dimensional subset of Y In ideal case we maximize probability of correct classification Many combinations: Feature extraction Use all pattern vector y y is transformed to lower-dimensional x: where W is transformation matrix Elimination of redundant and insignificant information depends on transformation matrix Usually W is searched using criterion function Transformation is usually linear: x = W T y size of W: D*d If D=20 and d=10 184756 different combinations Need to measure how far classes are from each other Probability of error Good features minimize probability of error Interclass distance Compute mean distance of pattern vectors belonging to classes a and b Large distance: classes separable Theoretically ideal measure In practice Difficult to compute Estimation error Simple but do not take into account probability density functions of classes Probabilitic distance Use knowledge about probability structure of classes: Class density function p(ξ ω i ) A priori probability P i Probabilistic dependence Measure difference ξ and ω i by measuring difference between class density functions p(ξ ω i ) and mixture density function p(ξ) 2 classes, P 1 = P 2 If p(ξ ω 1 ) = 0 for all p(ξ ω 2 ) 0: classes separable If p(ξ ω 1 ) = p(ξ ω 2 ): classes non-separable Overlapping is measured using distance between p(ξ ω 1 ) and p(ξ ω 2 ) If dependence large Big difference between class density function and mixture density function Classes separable 2

Entropy P(ω i ξ) s uniformly distributed poor separability Measure deviation of P(ω i ξ) s using criterion function P(ω i ξ) s same: class separability poor P(ω i ξ) s vary much: better separability Variation of P(ω i ξ) s can be measured using entropy Small entropy: good separability Large variation of P(ω i ξ) s good separability, correct classification likely Search optimal feature combination Try all possible combinations Problem: computing time E.g. D=20, d=10 184756 different combinations Choose best subset by removing or adding small subsets of features to original set Initial situation: Bottom-up search Ξ 0 = empty set Add features until desired number achieved Top-down search Ξ 0 = all features Remove features until desired number achieved Branch-and-bound Only search algorithm which searches implicitly all d-dimensional subsets without complete search Top-down Backtracking makes possible to search all combinations implicitly Computationally relatively light due to efficient search process Monotonicity of intrinsic featuresets Generate intrinsic featuresets Ξ 1 from Ξ 0 by excluding some features Generate intrinsic featuresets Ξ 2 from Ξ 1 by excluding some features, and so on... This generates search tree Criteria functions Class separability does not increase as number of features decrease It is not necessary to investigate feature subsets with small J() E.g. D=6, d=2 Situation: some branches are searched to level (D-d) and best J(Ξ D-d ) = B if B > A we do not have to search subtree under node A Best features Most simple and unreliable Choose d independently best features Test each feature independently Put in order according to value of criterion function and final featureset is Sequential forward selection SFS Bottom-up (X 0 = 0) Add one feature at time to featureset Each time that feature is chosen, which maximizes criterion function E.g. featureset X k has k features Rank left-over features so that featureset at step k+1 becomes No mechanism to remove features 3

Generalized sequential forward selection GSFS Otherwise same as SFS, but add more (r) that 1 Featureset X k form all featuresets by adding all r-feature combinations from set Y-X k choose that featureset X k+r which satisfies More reliable that SFS, because dependencies between features have been taken into account Computationally heavier Sequential backward selection Top-down version of SFS Initial featureset X 0 = Y Remove one feature at time until D-d features have been removed Featureset X k k features have been removed from set Y Rank new featuresets: featureset at step k+1 Generalized sequential backward selection More that one is removed Featureset X k (k features are removed) form all featuresets by removing all r-feature combinations from set Xk choose that featureset X which satisfies "Add-l, take-r" selection Use backtracking 1. Add l features using SFS 2. Remove r features using SBS If X has d features, stop, else goto 1 If l > r bottom-up Does not take dependencies between features into account Generalized "add-l, take-r" selection Otherwise same as previous, but use GSFS and GSBS Interclass distance in feature selection and extraction Search x, which maximizes mean distance between feature vectors belonging to different classes δ (ξ ik,ξ jl )is distance between two d-dimensional feature vectors, one from class i and other from class j usually P must be estimated as P i = n i / n now criterion function becomes Interclass distance in feature selection and extraction Optimal feature vector by using criterion function satisfies Feature selection J(ξ) is maximized by searching best d-dimensional subset from original D- dimensional featureset Feature extraction Search best transformation W, which determines d-dimensional feature vector ξ In latter, each feature vector affect same amount to result in other words 4

Metrics distance functions There are many different ways to measure distance between two points 1. s-degree Minkowski metrics 2. City-block distance (s=1) 3. Euclidean distance (s=2) 4. Chebyshev distance 5. Quadratic distance where Q is positive definitive scaling matrix If Q = I same as squared Euclidean distance 6. Nonlinear distance where H and T are parameters and δ() any distance function Mean vector for class i: Mean vector of mixture density function: Mean squared Euclidean distance of data: If second term is large m i is far away from mean vectors of other classes good If first term is large mean distances between feature vectors belonging to same class are large not good We try to maximize second term and minimize first term J 1 () is not good measure for class separability opposite information influence to same direction Between-class scatter matrix: Within-class scatter matrix: Criterion function becomes: We want to minimize tr[s w ] and maximize tr[s b ] better criterion function Within-class deviation and mean distance between classes affect to criterion function value but covariances do not make transform where Σ I U T S w U = I U = S w 1/2 Now J 2 becomes: where tr = trace (sum of diagonal elements) where λ k, k=1,...,d, are eigenvalues of matrix S w -1 S b We can also use within-class scatter and between-class scatter Scatter: represents volume of set of vectors determines same thing as deviation Within-class scatter: s w = S w Between-class scatter: s b = S b Total scatter s T = Σ = S w + S b Σ is covariance matrix of mixture density function Separability is good, if ratio of s T and s w large If we diagonalize Σ and S w with matrix U so that U T ΣU = Λ, J 4 becomes where matrix Λ is diagonal matrix which is formed by eigenvalues of product S -1 w Σ λ j are diagonal elements of Λ J 4 is better than J 3 distances between classes are more uniformly distributed according to different coordinate axis when J 3 is used, usually two classes are very separable but others are not one eigenvalue dominates 5

Nonlinear distance J 1 and J 4 are larger for classes 1 and 2 than 1 and 3, although classes are clearly separable In some cases when density functions of classes are completely overlapping, these criterion functions get higher value that when density functions are not overlapping Two separable classes, J 1 - J 4 give wrong impression about class separability In general, criterion functions based on Euclidean distance are good only if covariance matrices of classes are same Nonlinear distance If class i feature vector is far away from other classes vector is classified correctly this "safe" distance is T If distance is greater than T, error probability is zero we can think that distance to these vectors is constant H If there is vector belonging to class j near vector belonging to class i error probability is higher than zero distance between these vectors is zero Nonlinear distance where H and T are parameters and δ() any distance function Criterion function: δ δ Probabilistic distance in feature selection Probabilistic distance in feature selection Error probability is ideal criterion for feature selection Difficult to implement, estimation problems Distance function easy to implement Low performance Probabilistic distance gives more realistic view to class separability Use knowledge about probability structure of classes: Class density functions: p(ξ ω i ) A'priori probabilities P i Distances for 2 class case Chernoff if s=0.5 Bhattacharyya e.g. -ln ( 0.1 * 0.9 ) 1.2 -ln ( 0.5 * 0.5 ) 0.7 Matusita Divergence Patrick-Fisher Lissack-Fu Error probability Kolmogorov J K = J L, if α = 1 in J L Probabilistic distance in feature selection Distance simpler, if parametric distributions are used e.g. normal distribution Probabilistic distance in feature selection If Σ 1 = Σ 2 = Σ J M = J D = 8J B = (μ 2 μ 1 ) T Σ -1 (μ 2 - μ 1 ) Mahalanobis-distance More classes we cannot generalize these distances compute pairwise distances and sum all: where J ij is probabilistic distance between classes i and j 6

Probabilistic dependence in feature selection We get dependence between ξ and ω i by measuring difference between class density functions and mixture density function Dependence large big difference between class density function and mixture density function classes separable If distributions of classes are parametric their mixture density function is NOT parametric we do not get simpler forms, as with probabilistic distances Chernoff Probabilistic dependence in feature selection Bhattacharyya Matusita Joshi (formed from Divergence) Patrick-Fisher Lissack-Fu Mutual information Use these if several classes & possible to implement Entropy in feature selection Separability can also be measured using a'posteriori probabilities P(ω i ξ) and mixture density function p(ξ) Measure "variation" of a'posteriori probabilities using entropy Separability measure based on Shannon entropy minimize called equivocation Separability measure based on Quadratic entropy maximize called Bayes distance Karhunen-Löwe transformation Preserve information in compressed form error of compression as small as possible decorrelate elements of feature vectors y and remove those coordinate axis for which information content is small Represent D-dimensional pattern vectors y using orthonormal basis vectors u j where a j are coefficients of basis vector u j Approximate y using d terms Karhunen-Löwe transformation It would be good to choose basis vectors so that squared error between pattern vectors and their approximations is minimized Approximation error is sum of squared coefficients which were not included to approximation, and Optimal basis vectors (=axis of new coordinate system) correspond to eigenvectors of matrix representing data (Ψ) and λ i are corresponding eigenvalues Minimize compression error by selecting those eigenvectors for transformation matrix with largest eigenvalues Karhunen-Löwe transformation Feature extraction using y D-dimensional original pattern vectors x d-dimensional transformed pattern vectors W transformation matrix W is obtained by solving A is matrix representing data U matrix containing basis vectors of K-L transformation (=eigenvectors of matrix A) Λ diagonal matrix formed by eigenvalues of matrix A W = [ u 1,...,u d ], eigenvectors corresponding to d largest eigenvalues 7

Karhunen-Löwe transformation How to compute A??? Depends on desired properties of transformation Maximize separability of class mean vectors Use between-class scatter matrix S b More optimal matrix can be obtained by making "whitening transformation for between-class scatter matrix If means vectors of classes almost same but covariances differ Use covariance matrices of classes No information about class means and covariances Use covariance matrix of whole dataset This is called "Principal Component Analysis" Estimation of classification error E, error rate = mean probability of error Analytical solution VERY rare Estimation errors, assumptions Empirical testing of classifier Training set & test set Classify test set by training set and count proportion of errors Generalization? Relations between E (error rate) and A (acceptance rate) k-nnr Also based on training set and test set Probabilistic distance Compute upper and lower limits for error Probabilistic distance We can approximate E* by computing upper and lower limits Limit using Mahalanobis-distance Δ = (μ 1 - μ 2 ) T Σ (μ 1 - μ 2 ) Σ = P 1 Σ 1 + P 2 Σ 2 Bhattacharyya-distance Upper limit Lower limit Divergence Error counting Resubstitution estimation Data: Test set = training set Optimistically biased Holdout estimation Data: divide 50% to training set, 50% test set Pessimistically biased Leave-one-out estimation Data: one vector test set, others training set, repeat until all have been as test set Unbiased, large variance Rotation estimation Data: divide to n parts, classify one part as test set, repeat until all have been as test set Bootstrap estimation Generate "new" pattern vectors from old Error counting Risks of resubstitution error estimation method Training set = test set Example I: 2 classes P 1 = P 2, p(x ω 1 ) = p(x ω 2 ) Linear discriminant function with perceptron learning algorithm If n < 2d, it is VERY likely that we can find hyperplane separating classes Estimated error probability = 0 In reality, error probability = 0.5 Example II: 1-NNR Error probability always 0 if x i x j for all i j Relations between E and R/A Use relation between E (error rate) and R (rejection rate) or A (acceptance rate) E.g. k,l-nn-classifier: Estimate acceptance rates, use them to estimate error Advantages: Smaller variance than error counting methods Classification of test set can be unknown Need less labelled data, avoid errors of labelling process 8

Mean risk 2 classes, their density functions and a'priori probabilities are known Conditional probability of error for pattern vector x: Estimate for mean risk NOTE: class of x can be unknown Variance of estimate References Heikkilä, Törmä: Tilastollinen hahmotunnistus, TKK, Fotogrammetrian ja kaukokartoituksen laboratorio Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Variance of error counting estimate Variance of mean risk method is smaller, because error is quantified to values (0,1) in error counting methods, now error is between [0,0.5] 9