Feature selection and extraction Error estimation Maa-57.3210 Data Classification and Modelling in Remote Sensing Markus Törmä markus.torma@tkk.fi Measurements Preprocessing: Remove random and systematic errors Feature selection and extraction: Decrease volume of information Feature selection: select relevant information Feature extraction: compute new features or relations between features Spatial domain: search important targets from image Time domain: time series analysis Spectral domain: decrease feature dimensionality Datastructure: Way to represent features Image: vector & matrix Recognition: Divide measurements to categories (classes) Supervised: classes are known Unsupervised: classes are unknown Important step: quality estimation Feature selection and extraction in spectral domain Try to decrease feature dimensionality Why? Theoretically, if infinite number of training samples is used then adding more features increase accuracy Measurement complexity: k n, where k number of bits, n number of channels 2 classes P c : A prior probability of one class Hughes phenomena In reality, limited number of samples in use As dimension increases, accuracy increases at first, but then starts to decrease Increasing number of training samples increases accuracy 2 classes, a prior probabilities equal m: number of training samples used Infinite number of samples Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Landgrebe Why? Theoretically, as dimensionality increases, separability would increase. Landgrebe Hughes phenomena In reality, finite number of samples N are used to estimate class statistics. If number of samples is fixed, increasing dimensionality results poorer estimates. Combined result: -optimum value of measurement complexity -increasing features do not necessarily lead to better results Alternatives Ad hoc feature extraction & selection, e.g. Extraction: Vegetation indices like NDVI, Tasselled Cap transformation Selection: knowlegde about matter-energy interaction Feature selection Select most important features using some criterion function Feature extraction Make transformation in order to decrease dimensionality Estimate parameters more carefully or using more samples Estimate covariance matrix using small number of samples (Landgrebe ch 4.8) Combine covariance matrix of classes and common covariance matrix Statistics enhancement using unlabeled samples (Landgrebe ch 4.2 4.4) Use also unlabeled samples to estimate class means and covariances 1
Feature selection Choose best subset X from set Y (original measurements) where D > d Maximize some criterion function: where Ξ is any d-dimensional subset of Y In ideal case we maximize probability of correct classification Many combinations: Feature extraction Use all pattern vector y y is transformed to lower-dimensional x: where W is transformation matrix Elimination of redundant and insignificant information depends on transformation matrix Usually W is searched using criterion function Transformation is usually linear: x = W T y size of W: D*d If D=20 and d=10 184756 different combinations Need to measure how far classes are from each other Probability of error Good features minimize probability of error Interclass distance Compute mean distance of pattern vectors belonging to classes a and b Large distance: classes separable Theoretically ideal measure In practice Difficult to compute Estimation error Simple but do not take into account probability density functions of classes Probabilitic distance Use knowledge about probability structure of classes: Class density function p(ξ ω i ) A priori probability P i Probabilistic dependence Measure difference ξ and ω i by measuring difference between class density functions p(ξ ω i ) and mixture density function p(ξ) 2 classes, P 1 = P 2 If p(ξ ω 1 ) = 0 for all p(ξ ω 2 ) 0: classes separable If p(ξ ω 1 ) = p(ξ ω 2 ): classes non-separable Overlapping is measured using distance between p(ξ ω 1 ) and p(ξ ω 2 ) If dependence large Big difference between class density function and mixture density function Classes separable 2
Entropy P(ω i ξ) s uniformly distributed poor separability Measure deviation of P(ω i ξ) s using criterion function P(ω i ξ) s same: class separability poor P(ω i ξ) s vary much: better separability Variation of P(ω i ξ) s can be measured using entropy Small entropy: good separability Large variation of P(ω i ξ) s good separability, correct classification likely Search optimal feature combination Try all possible combinations Problem: computing time E.g. D=20, d=10 184756 different combinations Choose best subset by removing or adding small subsets of features to original set Initial situation: Bottom-up search Ξ 0 = empty set Add features until desired number achieved Top-down search Ξ 0 = all features Remove features until desired number achieved Branch-and-bound Only search algorithm which searches implicitly all d-dimensional subsets without complete search Top-down Backtracking makes possible to search all combinations implicitly Computationally relatively light due to efficient search process Monotonicity of intrinsic featuresets Generate intrinsic featuresets Ξ 1 from Ξ 0 by excluding some features Generate intrinsic featuresets Ξ 2 from Ξ 1 by excluding some features, and so on... This generates search tree Criteria functions Class separability does not increase as number of features decrease It is not necessary to investigate feature subsets with small J() E.g. D=6, d=2 Situation: some branches are searched to level (D-d) and best J(Ξ D-d ) = B if B > A we do not have to search subtree under node A Best features Most simple and unreliable Choose d independently best features Test each feature independently Put in order according to value of criterion function and final featureset is Sequential forward selection SFS Bottom-up (X 0 = 0) Add one feature at time to featureset Each time that feature is chosen, which maximizes criterion function E.g. featureset X k has k features Rank left-over features so that featureset at step k+1 becomes No mechanism to remove features 3
Generalized sequential forward selection GSFS Otherwise same as SFS, but add more (r) that 1 Featureset X k form all featuresets by adding all r-feature combinations from set Y-X k choose that featureset X k+r which satisfies More reliable that SFS, because dependencies between features have been taken into account Computationally heavier Sequential backward selection Top-down version of SFS Initial featureset X 0 = Y Remove one feature at time until D-d features have been removed Featureset X k k features have been removed from set Y Rank new featuresets: featureset at step k+1 Generalized sequential backward selection More that one is removed Featureset X k (k features are removed) form all featuresets by removing all r-feature combinations from set Xk choose that featureset X which satisfies "Add-l, take-r" selection Use backtracking 1. Add l features using SFS 2. Remove r features using SBS If X has d features, stop, else goto 1 If l > r bottom-up Does not take dependencies between features into account Generalized "add-l, take-r" selection Otherwise same as previous, but use GSFS and GSBS Interclass distance in feature selection and extraction Search x, which maximizes mean distance between feature vectors belonging to different classes δ (ξ ik,ξ jl )is distance between two d-dimensional feature vectors, one from class i and other from class j usually P must be estimated as P i = n i / n now criterion function becomes Interclass distance in feature selection and extraction Optimal feature vector by using criterion function satisfies Feature selection J(ξ) is maximized by searching best d-dimensional subset from original D- dimensional featureset Feature extraction Search best transformation W, which determines d-dimensional feature vector ξ In latter, each feature vector affect same amount to result in other words 4
Metrics distance functions There are many different ways to measure distance between two points 1. s-degree Minkowski metrics 2. City-block distance (s=1) 3. Euclidean distance (s=2) 4. Chebyshev distance 5. Quadratic distance where Q is positive definitive scaling matrix If Q = I same as squared Euclidean distance 6. Nonlinear distance where H and T are parameters and δ() any distance function Mean vector for class i: Mean vector of mixture density function: Mean squared Euclidean distance of data: If second term is large m i is far away from mean vectors of other classes good If first term is large mean distances between feature vectors belonging to same class are large not good We try to maximize second term and minimize first term J 1 () is not good measure for class separability opposite information influence to same direction Between-class scatter matrix: Within-class scatter matrix: Criterion function becomes: We want to minimize tr[s w ] and maximize tr[s b ] better criterion function Within-class deviation and mean distance between classes affect to criterion function value but covariances do not make transform where Σ I U T S w U = I U = S w 1/2 Now J 2 becomes: where tr = trace (sum of diagonal elements) where λ k, k=1,...,d, are eigenvalues of matrix S w -1 S b We can also use within-class scatter and between-class scatter Scatter: represents volume of set of vectors determines same thing as deviation Within-class scatter: s w = S w Between-class scatter: s b = S b Total scatter s T = Σ = S w + S b Σ is covariance matrix of mixture density function Separability is good, if ratio of s T and s w large If we diagonalize Σ and S w with matrix U so that U T ΣU = Λ, J 4 becomes where matrix Λ is diagonal matrix which is formed by eigenvalues of product S -1 w Σ λ j are diagonal elements of Λ J 4 is better than J 3 distances between classes are more uniformly distributed according to different coordinate axis when J 3 is used, usually two classes are very separable but others are not one eigenvalue dominates 5
Nonlinear distance J 1 and J 4 are larger for classes 1 and 2 than 1 and 3, although classes are clearly separable In some cases when density functions of classes are completely overlapping, these criterion functions get higher value that when density functions are not overlapping Two separable classes, J 1 - J 4 give wrong impression about class separability In general, criterion functions based on Euclidean distance are good only if covariance matrices of classes are same Nonlinear distance If class i feature vector is far away from other classes vector is classified correctly this "safe" distance is T If distance is greater than T, error probability is zero we can think that distance to these vectors is constant H If there is vector belonging to class j near vector belonging to class i error probability is higher than zero distance between these vectors is zero Nonlinear distance where H and T are parameters and δ() any distance function Criterion function: δ δ Probabilistic distance in feature selection Probabilistic distance in feature selection Error probability is ideal criterion for feature selection Difficult to implement, estimation problems Distance function easy to implement Low performance Probabilistic distance gives more realistic view to class separability Use knowledge about probability structure of classes: Class density functions: p(ξ ω i ) A'priori probabilities P i Distances for 2 class case Chernoff if s=0.5 Bhattacharyya e.g. -ln ( 0.1 * 0.9 ) 1.2 -ln ( 0.5 * 0.5 ) 0.7 Matusita Divergence Patrick-Fisher Lissack-Fu Error probability Kolmogorov J K = J L, if α = 1 in J L Probabilistic distance in feature selection Distance simpler, if parametric distributions are used e.g. normal distribution Probabilistic distance in feature selection If Σ 1 = Σ 2 = Σ J M = J D = 8J B = (μ 2 μ 1 ) T Σ -1 (μ 2 - μ 1 ) Mahalanobis-distance More classes we cannot generalize these distances compute pairwise distances and sum all: where J ij is probabilistic distance between classes i and j 6
Probabilistic dependence in feature selection We get dependence between ξ and ω i by measuring difference between class density functions and mixture density function Dependence large big difference between class density function and mixture density function classes separable If distributions of classes are parametric their mixture density function is NOT parametric we do not get simpler forms, as with probabilistic distances Chernoff Probabilistic dependence in feature selection Bhattacharyya Matusita Joshi (formed from Divergence) Patrick-Fisher Lissack-Fu Mutual information Use these if several classes & possible to implement Entropy in feature selection Separability can also be measured using a'posteriori probabilities P(ω i ξ) and mixture density function p(ξ) Measure "variation" of a'posteriori probabilities using entropy Separability measure based on Shannon entropy minimize called equivocation Separability measure based on Quadratic entropy maximize called Bayes distance Karhunen-Löwe transformation Preserve information in compressed form error of compression as small as possible decorrelate elements of feature vectors y and remove those coordinate axis for which information content is small Represent D-dimensional pattern vectors y using orthonormal basis vectors u j where a j are coefficients of basis vector u j Approximate y using d terms Karhunen-Löwe transformation It would be good to choose basis vectors so that squared error between pattern vectors and their approximations is minimized Approximation error is sum of squared coefficients which were not included to approximation, and Optimal basis vectors (=axis of new coordinate system) correspond to eigenvectors of matrix representing data (Ψ) and λ i are corresponding eigenvalues Minimize compression error by selecting those eigenvectors for transformation matrix with largest eigenvalues Karhunen-Löwe transformation Feature extraction using y D-dimensional original pattern vectors x d-dimensional transformed pattern vectors W transformation matrix W is obtained by solving A is matrix representing data U matrix containing basis vectors of K-L transformation (=eigenvectors of matrix A) Λ diagonal matrix formed by eigenvalues of matrix A W = [ u 1,...,u d ], eigenvectors corresponding to d largest eigenvalues 7
Karhunen-Löwe transformation How to compute A??? Depends on desired properties of transformation Maximize separability of class mean vectors Use between-class scatter matrix S b More optimal matrix can be obtained by making "whitening transformation for between-class scatter matrix If means vectors of classes almost same but covariances differ Use covariance matrices of classes No information about class means and covariances Use covariance matrix of whole dataset This is called "Principal Component Analysis" Estimation of classification error E, error rate = mean probability of error Analytical solution VERY rare Estimation errors, assumptions Empirical testing of classifier Training set & test set Classify test set by training set and count proportion of errors Generalization? Relations between E (error rate) and A (acceptance rate) k-nnr Also based on training set and test set Probabilistic distance Compute upper and lower limits for error Probabilistic distance We can approximate E* by computing upper and lower limits Limit using Mahalanobis-distance Δ = (μ 1 - μ 2 ) T Σ (μ 1 - μ 2 ) Σ = P 1 Σ 1 + P 2 Σ 2 Bhattacharyya-distance Upper limit Lower limit Divergence Error counting Resubstitution estimation Data: Test set = training set Optimistically biased Holdout estimation Data: divide 50% to training set, 50% test set Pessimistically biased Leave-one-out estimation Data: one vector test set, others training set, repeat until all have been as test set Unbiased, large variance Rotation estimation Data: divide to n parts, classify one part as test set, repeat until all have been as test set Bootstrap estimation Generate "new" pattern vectors from old Error counting Risks of resubstitution error estimation method Training set = test set Example I: 2 classes P 1 = P 2, p(x ω 1 ) = p(x ω 2 ) Linear discriminant function with perceptron learning algorithm If n < 2d, it is VERY likely that we can find hyperplane separating classes Estimated error probability = 0 In reality, error probability = 0.5 Example II: 1-NNR Error probability always 0 if x i x j for all i j Relations between E and R/A Use relation between E (error rate) and R (rejection rate) or A (acceptance rate) E.g. k,l-nn-classifier: Estimate acceptance rates, use them to estimate error Advantages: Smaller variance than error counting methods Classification of test set can be unknown Need less labelled data, avoid errors of labelling process 8
Mean risk 2 classes, their density functions and a'priori probabilities are known Conditional probability of error for pattern vector x: Estimate for mean risk NOTE: class of x can be unknown Variance of estimate References Heikkilä, Törmä: Tilastollinen hahmotunnistus, TKK, Fotogrammetrian ja kaukokartoituksen laboratorio Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Variance of error counting estimate Variance of mean risk method is smaller, because error is quantified to values (0,1) in error counting methods, now error is between [0,0.5] 9