Feature selection and extraction Spectral domain quality estimation Alternatives

Size: px
Start display at page:

Download "Feature selection and extraction Spectral domain quality estimation Alternatives"

Transcription

1 Feature selection and extraction Error estimation Maa Data Classification and Modelling in Remote Sensing Markus Törmä Measurements Preprocessing: Remove random and systematic errors Feature selection and extraction: Decrease volume of information Feature selection: select relevant information Feature extraction: compute new features or relations between features Spatial domain: search important targets from image Time domain: time series analysis Spectral domain: decrease feature dimensionality Datastructure: Way to represent features Image: vector & matrix Recognition: Divide measurements to categories (classes) Supervised: classes are known Unsupervised: classes are unknown Important step: quality estimation Feature selection and extraction in spectral domain Try to decrease feature dimensionality Why? Theoretically, if infinite number of training samples is used then adding more features increase accuracy Measurement complexity: k n, where k number of bits, n number of channels 2 classes P c : A prior probability of one class Hughes phenomena In reality, limited number of samples in use As dimension increases, accuracy increases at first, but then starts to decrease Increasing number of training samples increases accuracy 2 classes, a prior probabilities equal m: number of training samples used Infinite number of samples Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Landgrebe Why? Theoretically, as dimensionality increases, separability would increase. Landgrebe Hughes phenomena In reality, finite number of samples N are used to estimate class statistics. If number of samples is fixed, increasing dimensionality results poorer estimates. Combined result: -optimum value of measurement complexity -increasing features do not necessarily lead to better results Alternatives Ad hoc feature extraction & selection, e.g. Extraction: Vegetation indices like NDVI, Tasselled Cap transformation Selection: knowlegde about matter-energy interaction Feature selection Select most important features using some criterion function Feature extraction Make transformation in order to decrease dimensionality Estimate parameters more carefully or using more samples Estimate covariance matrix using small number of samples (Landgrebe ch 4.8) Combine covariance matrix of classes and common covariance matrix Statistics enhancement using unlabeled samples (Landgrebe ch ) Use also unlabeled samples to estimate class means and covariances 1

2 Feature selection Choose best subset X from set Y (original measurements) where D > d Maximize some criterion function: where Ξ is any d-dimensional subset of Y In ideal case we maximize probability of correct classification Many combinations: Feature extraction Use all pattern vector y y is transformed to lower-dimensional x: where W is transformation matrix Elimination of redundant and insignificant information depends on transformation matrix Usually W is searched using criterion function Transformation is usually linear: x = W T y size of W: D*d If D=20 and d= different combinations Need to measure how far classes are from each other Probability of error Good features minimize probability of error Interclass distance Compute mean distance of pattern vectors belonging to classes a and b Large distance: classes separable Theoretically ideal measure In practice Difficult to compute Estimation error Simple but do not take into account probability density functions of classes Probabilitic distance Use knowledge about probability structure of classes: Class density function p(ξ ω i ) A priori probability P i Probabilistic dependence Measure difference ξ and ω i by measuring difference between class density functions p(ξ ω i ) and mixture density function p(ξ) 2 classes, P 1 = P 2 If p(ξ ω 1 ) = 0 for all p(ξ ω 2 ) 0: classes separable If p(ξ ω 1 ) = p(ξ ω 2 ): classes non-separable Overlapping is measured using distance between p(ξ ω 1 ) and p(ξ ω 2 ) If dependence large Big difference between class density function and mixture density function Classes separable 2

3 Entropy P(ω i ξ) s uniformly distributed poor separability Measure deviation of P(ω i ξ) s using criterion function P(ω i ξ) s same: class separability poor P(ω i ξ) s vary much: better separability Variation of P(ω i ξ) s can be measured using entropy Small entropy: good separability Large variation of P(ω i ξ) s good separability, correct classification likely Search optimal feature combination Try all possible combinations Problem: computing time E.g. D=20, d= different combinations Choose best subset by removing or adding small subsets of features to original set Initial situation: Bottom-up search Ξ 0 = empty set Add features until desired number achieved Top-down search Ξ 0 = all features Remove features until desired number achieved Branch-and-bound Only search algorithm which searches implicitly all d-dimensional subsets without complete search Top-down Backtracking makes possible to search all combinations implicitly Computationally relatively light due to efficient search process Monotonicity of intrinsic featuresets Generate intrinsic featuresets Ξ 1 from Ξ 0 by excluding some features Generate intrinsic featuresets Ξ 2 from Ξ 1 by excluding some features, and so on... This generates search tree Criteria functions Class separability does not increase as number of features decrease It is not necessary to investigate feature subsets with small J() E.g. D=6, d=2 Situation: some branches are searched to level (D-d) and best J(Ξ D-d ) = B if B > A we do not have to search subtree under node A Best features Most simple and unreliable Choose d independently best features Test each feature independently Put in order according to value of criterion function and final featureset is Sequential forward selection SFS Bottom-up (X 0 = 0) Add one feature at time to featureset Each time that feature is chosen, which maximizes criterion function E.g. featureset X k has k features Rank left-over features so that featureset at step k+1 becomes No mechanism to remove features 3

4 Generalized sequential forward selection GSFS Otherwise same as SFS, but add more (r) that 1 Featureset X k form all featuresets by adding all r-feature combinations from set Y-X k choose that featureset X k+r which satisfies More reliable that SFS, because dependencies between features have been taken into account Computationally heavier Sequential backward selection Top-down version of SFS Initial featureset X 0 = Y Remove one feature at time until D-d features have been removed Featureset X k k features have been removed from set Y Rank new featuresets: featureset at step k+1 Generalized sequential backward selection More that one is removed Featureset X k (k features are removed) form all featuresets by removing all r-feature combinations from set Xk choose that featureset X which satisfies "Add-l, take-r" selection Use backtracking 1. Add l features using SFS 2. Remove r features using SBS If X has d features, stop, else goto 1 If l > r bottom-up Does not take dependencies between features into account Generalized "add-l, take-r" selection Otherwise same as previous, but use GSFS and GSBS Interclass distance in feature selection and extraction Search x, which maximizes mean distance between feature vectors belonging to different classes δ (ξ ik,ξ jl )is distance between two d-dimensional feature vectors, one from class i and other from class j usually P must be estimated as P i = n i / n now criterion function becomes Interclass distance in feature selection and extraction Optimal feature vector by using criterion function satisfies Feature selection J(ξ) is maximized by searching best d-dimensional subset from original D- dimensional featureset Feature extraction Search best transformation W, which determines d-dimensional feature vector ξ In latter, each feature vector affect same amount to result in other words 4

5 Metrics distance functions There are many different ways to measure distance between two points 1. s-degree Minkowski metrics 2. City-block distance (s=1) 3. Euclidean distance (s=2) 4. Chebyshev distance 5. Quadratic distance where Q is positive definitive scaling matrix If Q = I same as squared Euclidean distance 6. Nonlinear distance where H and T are parameters and δ() any distance function Mean vector for class i: Mean vector of mixture density function: Mean squared Euclidean distance of data: If second term is large m i is far away from mean vectors of other classes good If first term is large mean distances between feature vectors belonging to same class are large not good We try to maximize second term and minimize first term J 1 () is not good measure for class separability opposite information influence to same direction Between-class scatter matrix: Within-class scatter matrix: Criterion function becomes: We want to minimize tr[s w ] and maximize tr[s b ] better criterion function Within-class deviation and mean distance between classes affect to criterion function value but covariances do not make transform where Σ I U T S w U = I U = S w 1/2 Now J 2 becomes: where tr = trace (sum of diagonal elements) where λ k, k=1,...,d, are eigenvalues of matrix S w -1 S b We can also use within-class scatter and between-class scatter Scatter: represents volume of set of vectors determines same thing as deviation Within-class scatter: s w = S w Between-class scatter: s b = S b Total scatter s T = Σ = S w + S b Σ is covariance matrix of mixture density function Separability is good, if ratio of s T and s w large If we diagonalize Σ and S w with matrix U so that U T ΣU = Λ, J 4 becomes where matrix Λ is diagonal matrix which is formed by eigenvalues of product S -1 w Σ λ j are diagonal elements of Λ J 4 is better than J 3 distances between classes are more uniformly distributed according to different coordinate axis when J 3 is used, usually two classes are very separable but others are not one eigenvalue dominates 5

6 Nonlinear distance J 1 and J 4 are larger for classes 1 and 2 than 1 and 3, although classes are clearly separable In some cases when density functions of classes are completely overlapping, these criterion functions get higher value that when density functions are not overlapping Two separable classes, J 1 - J 4 give wrong impression about class separability In general, criterion functions based on Euclidean distance are good only if covariance matrices of classes are same Nonlinear distance If class i feature vector is far away from other classes vector is classified correctly this "safe" distance is T If distance is greater than T, error probability is zero we can think that distance to these vectors is constant H If there is vector belonging to class j near vector belonging to class i error probability is higher than zero distance between these vectors is zero Nonlinear distance where H and T are parameters and δ() any distance function Criterion function: δ δ Probabilistic distance in feature selection Probabilistic distance in feature selection Error probability is ideal criterion for feature selection Difficult to implement, estimation problems Distance function easy to implement Low performance Probabilistic distance gives more realistic view to class separability Use knowledge about probability structure of classes: Class density functions: p(ξ ω i ) A'priori probabilities P i Distances for 2 class case Chernoff if s=0.5 Bhattacharyya e.g. -ln ( 0.1 * 0.9 ) 1.2 -ln ( 0.5 * 0.5 ) 0.7 Matusita Divergence Patrick-Fisher Lissack-Fu Error probability Kolmogorov J K = J L, if α = 1 in J L Probabilistic distance in feature selection Distance simpler, if parametric distributions are used e.g. normal distribution Probabilistic distance in feature selection If Σ 1 = Σ 2 = Σ J M = J D = 8J B = (μ 2 μ 1 ) T Σ -1 (μ 2 - μ 1 ) Mahalanobis-distance More classes we cannot generalize these distances compute pairwise distances and sum all: where J ij is probabilistic distance between classes i and j 6

7 Probabilistic dependence in feature selection We get dependence between ξ and ω i by measuring difference between class density functions and mixture density function Dependence large big difference between class density function and mixture density function classes separable If distributions of classes are parametric their mixture density function is NOT parametric we do not get simpler forms, as with probabilistic distances Chernoff Probabilistic dependence in feature selection Bhattacharyya Matusita Joshi (formed from Divergence) Patrick-Fisher Lissack-Fu Mutual information Use these if several classes & possible to implement Entropy in feature selection Separability can also be measured using a'posteriori probabilities P(ω i ξ) and mixture density function p(ξ) Measure "variation" of a'posteriori probabilities using entropy Separability measure based on Shannon entropy minimize called equivocation Separability measure based on Quadratic entropy maximize called Bayes distance Karhunen-Löwe transformation Preserve information in compressed form error of compression as small as possible decorrelate elements of feature vectors y and remove those coordinate axis for which information content is small Represent D-dimensional pattern vectors y using orthonormal basis vectors u j where a j are coefficients of basis vector u j Approximate y using d terms Karhunen-Löwe transformation It would be good to choose basis vectors so that squared error between pattern vectors and their approximations is minimized Approximation error is sum of squared coefficients which were not included to approximation, and Optimal basis vectors (=axis of new coordinate system) correspond to eigenvectors of matrix representing data (Ψ) and λ i are corresponding eigenvalues Minimize compression error by selecting those eigenvectors for transformation matrix with largest eigenvalues Karhunen-Löwe transformation Feature extraction using y D-dimensional original pattern vectors x d-dimensional transformed pattern vectors W transformation matrix W is obtained by solving A is matrix representing data U matrix containing basis vectors of K-L transformation (=eigenvectors of matrix A) Λ diagonal matrix formed by eigenvalues of matrix A W = [ u 1,...,u d ], eigenvectors corresponding to d largest eigenvalues 7

8 Karhunen-Löwe transformation How to compute A??? Depends on desired properties of transformation Maximize separability of class mean vectors Use between-class scatter matrix S b More optimal matrix can be obtained by making "whitening transformation for between-class scatter matrix If means vectors of classes almost same but covariances differ Use covariance matrices of classes No information about class means and covariances Use covariance matrix of whole dataset This is called "Principal Component Analysis" Estimation of classification error E, error rate = mean probability of error Analytical solution VERY rare Estimation errors, assumptions Empirical testing of classifier Training set & test set Classify test set by training set and count proportion of errors Generalization? Relations between E (error rate) and A (acceptance rate) k-nnr Also based on training set and test set Probabilistic distance Compute upper and lower limits for error Probabilistic distance We can approximate E* by computing upper and lower limits Limit using Mahalanobis-distance Δ = (μ 1 - μ 2 ) T Σ (μ 1 - μ 2 ) Σ = P 1 Σ 1 + P 2 Σ 2 Bhattacharyya-distance Upper limit Lower limit Divergence Error counting Resubstitution estimation Data: Test set = training set Optimistically biased Holdout estimation Data: divide 50% to training set, 50% test set Pessimistically biased Leave-one-out estimation Data: one vector test set, others training set, repeat until all have been as test set Unbiased, large variance Rotation estimation Data: divide to n parts, classify one part as test set, repeat until all have been as test set Bootstrap estimation Generate "new" pattern vectors from old Error counting Risks of resubstitution error estimation method Training set = test set Example I: 2 classes P 1 = P 2, p(x ω 1 ) = p(x ω 2 ) Linear discriminant function with perceptron learning algorithm If n < 2d, it is VERY likely that we can find hyperplane separating classes Estimated error probability = 0 In reality, error probability = 0.5 Example II: 1-NNR Error probability always 0 if x i x j for all i j Relations between E and R/A Use relation between E (error rate) and R (rejection rate) or A (acceptance rate) E.g. k,l-nn-classifier: Estimate acceptance rates, use them to estimate error Advantages: Smaller variance than error counting methods Classification of test set can be unknown Need less labelled data, avoid errors of labelling process 8

9 Mean risk 2 classes, their density functions and a'priori probabilities are known Conditional probability of error for pattern vector x: Estimate for mean risk NOTE: class of x can be unknown Variance of estimate References Heikkilä, Törmä: Tilastollinen hahmotunnistus, TKK, Fotogrammetrian ja kaukokartoituksen laboratorio Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Variance of error counting estimate Variance of mean risk method is smaller, because error is quantified to values (0,1) in error counting methods, now error is between [0,0.5] 9

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata

Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Machine detection of emotions: Feature Selection

Machine detection of emotions: Feature Selection Machine detection of emotions: Feature Selection Final Degree Dissertation Degree in Mathematics Leire Santos Moreno Supervisor: Raquel Justo Blanco María Inés Torres Barañano Leioa, 31 August 2016 Contents

More information

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30 Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with

Feature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting

More information

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

CLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING

CLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING CLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING BEHZAD SHAHSHAHANI DAVID LANDGREBE TR-EE 94- JANUARY 994 SCHOOL OF ELECTRICAL ENGINEERING PURDUE UNIVERSITY WEST LAFAYETTE,

More information

Motivating the Covariance Matrix

Motivating the Covariance Matrix Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

Branch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline

Branch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline Branch-and-Bound Algorithm assumption - can be used if a feature selection criterion satisfies the monotonicity property monotonicity property - for nested feature sets X j related X 1 X 2... X l the criterion

More information

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1 EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups

Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies

More information

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization

Classification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization Classification Volker Blobel University of Hamburg March 2005 Given objects (e.g. particle tracks), which have certain features (e.g. momentum p, specific energy loss de/ dx) and which belong to one of

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

5. Discriminant analysis

5. Discriminant analysis 5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data

GENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data GENOMIC SIGNAL PROCESSING Lecture 2 Classification of disease subtype based on microarray data 1. Analysis of microarray data (see last 15 slides of Lecture 1) 2. Classification methods for microarray

More information

Pattern Recognition 2

Pattern Recognition 2 Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent

More information

Preprocessing & dimensionality reduction

Preprocessing & dimensionality reduction Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016

More information

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS

CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Pattern Recognition. Parameter Estimation of Probability Density Functions

Pattern Recognition. Parameter Estimation of Probability Density Functions Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT

HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT Luis O. Jimenez David Landgrebe TR-ECE 96-5 April 1995 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING PURDUE UNIVERSITY WEST LAFAYETTE IN 47907-185

More information

MSA220 Statistical Learning for Big Data

MSA220 Statistical Learning for Big Data MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant

More information

Kernel Methods. Barnabás Póczos

Kernel Methods. Barnabás Póczos Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Pattern Classification

Pattern Classification Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric

More information

44 CHAPTER 2. BAYESIAN DECISION THEORY

44 CHAPTER 2. BAYESIAN DECISION THEORY 44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,

More information

Small sample size generalization

Small sample size generalization 9th Scandinavian Conference on Image Analysis, June 6-9, 1995, Uppsala, Sweden, Preprint Small sample size generalization Robert P.W. Duin Pattern Recognition Group, Faculty of Applied Physics Delft University

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.

Object Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont. 3/5/03 Digital Image Processing Obect Recognition Obect Recognition One of the most interesting aspects of the world is that it can be considered to be made up of patterns. Christophoros Nikou cnikou@cs.uoi.gr

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions

Notation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision

More information

Is cross-validation valid for small-sample microarray classification?

Is cross-validation valid for small-sample microarray classification? Is cross-validation valid for small-sample microarray classification? Braga-Neto et al. Bioinformatics 2004 Topics in Bioinformatics Presented by Lei Xu October 26, 2004 1 Review 1) Statistical framework

More information

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg IN 5520 30.10.18 Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg (anne@ifi.uio.no) 30.10.18 IN 5520 1 Literature Practical guidelines of classification

More information

Resampling techniques for statistical modeling

Resampling techniques for statistical modeling Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error

More information

Probability Models for Bayesian Recognition

Probability Models for Bayesian Recognition Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian

More information

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

EECS490: Digital Image Processing. Lecture #26

EECS490: Digital Image Processing. Lecture #26 Lecture #26 Moments; invariant moments Eigenvector, principal component analysis Boundary coding Image primitives Image representation: trees, graphs Object recognition and classes Minimum distance classifiers

More information

Bayesian Decision Theory

Bayesian Decision Theory Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Data Preprocessing. Cluster Similarity

Data Preprocessing. Cluster Similarity 1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same

More information

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,

More information

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier Classifier Learning Curve How to estimate classifier performance. Learning curves Feature curves Rejects and ROC curves True classification error ε Bayes error ε* Sub-optimal classifier Bayes consistent

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria

Multi-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria Multi-Class Linear Dimension Reduction by Weighted Pairwise Fisher Criteria M. Loog 1,R.P.W.Duin 2,andR.Haeb-Umbach 3 1 Image Sciences Institute University Medical Center Utrecht P.O. Box 85500 3508 GA

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4

Neural Networks Learning the network: Backprop , Fall 2018 Lecture 4 Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:

More information

Bayes Decision Theory - I

Bayes Decision Theory - I Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find

More information

Bayesian Decision and Bayesian Learning

Bayesian Decision and Bayesian Learning Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i

More information

Non-linear Measure Based Process Monitoring and Fault Diagnosis

Non-linear Measure Based Process Monitoring and Fault Diagnosis Non-linear Measure Based Process Monitoring and Fault Diagnosis 12 th Annual AIChE Meeting, Reno, NV [275] Data Driven Approaches to Process Control 4:40 PM, Nov. 6, 2001 Sandeep Rajput Duane D. Bruns

More information

Clustering VS Classification

Clustering VS Classification MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:

More information

Chemometrics: Classification of spectra

Chemometrics: Classification of spectra Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen

More information

Bayes Decision Theory

Bayes Decision Theory Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Lecture 7: Con3nuous Latent Variable Models

Lecture 7: Con3nuous Latent Variable Models CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/

More information

Introduction to Supervised Learning. Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

A Unified Bayesian Framework for Face Recognition

A Unified Bayesian Framework for Face Recognition Appears in the IEEE Signal Processing Society International Conference on Image Processing, ICIP, October 4-7,, Chicago, Illinois, USA A Unified Bayesian Framework for Face Recognition Chengjun Liu and

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview

More information

CMU-Q Lecture 24:

CMU-Q Lecture 24: CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input

More information

Linear Classifiers as Pattern Detectors

Linear Classifiers as Pattern Detectors Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2013/2014 Lesson 18 23 April 2014 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin 1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)

More information

Classification and Pattern Recognition

Classification and Pattern Recognition Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation

More information

Machine Learning Lecture 2

Machine Learning Lecture 2 Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability

More information

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction

Lecture 6 Proof for JL Lemma and Linear Dimensionality Reduction COMS 4995: Unsupervised Learning (Summer 18) June 7, 018 Lecture 6 Proof for JL Lemma and Linear imensionality Reduction Instructor: Nakul Verma Scribes: Ziyuan Zhong, Kirsten Blancato This lecture gives

More information

Outline Learning Machines and Data! Linear and nonlinear methods to compress

Outline Learning Machines and Data! Linear and nonlinear methods to compress Machine Learning Tools for the Visualisation of BigData We all love images! Statutory warning: Has been produced in the same facility which produces maths and may contains traces of maths. Amit Kumar Mishra

More information

Introduction to Statistical Inference

Introduction to Statistical Inference Structural Health Monitoring Using Statistical Pattern Recognition Introduction to Statistical Inference Presented by Charles R. Farrar, Ph.D., P.E. Outline Introduce statistical decision making for Structural

More information

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Classification: The rest of the story

Classification: The rest of the story U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher

More information