Feature selection and extraction Spectral domain quality estimation Alternatives
|
|
- Patricia Caldwell
- 6 years ago
- Views:
Transcription
1 Feature selection and extraction Error estimation Maa Data Classification and Modelling in Remote Sensing Markus Törmä Measurements Preprocessing: Remove random and systematic errors Feature selection and extraction: Decrease volume of information Feature selection: select relevant information Feature extraction: compute new features or relations between features Spatial domain: search important targets from image Time domain: time series analysis Spectral domain: decrease feature dimensionality Datastructure: Way to represent features Image: vector & matrix Recognition: Divide measurements to categories (classes) Supervised: classes are known Unsupervised: classes are unknown Important step: quality estimation Feature selection and extraction in spectral domain Try to decrease feature dimensionality Why? Theoretically, if infinite number of training samples is used then adding more features increase accuracy Measurement complexity: k n, where k number of bits, n number of channels 2 classes P c : A prior probability of one class Hughes phenomena In reality, limited number of samples in use As dimension increases, accuracy increases at first, but then starts to decrease Increasing number of training samples increases accuracy 2 classes, a prior probabilities equal m: number of training samples used Infinite number of samples Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Landgrebe Why? Theoretically, as dimensionality increases, separability would increase. Landgrebe Hughes phenomena In reality, finite number of samples N are used to estimate class statistics. If number of samples is fixed, increasing dimensionality results poorer estimates. Combined result: -optimum value of measurement complexity -increasing features do not necessarily lead to better results Alternatives Ad hoc feature extraction & selection, e.g. Extraction: Vegetation indices like NDVI, Tasselled Cap transformation Selection: knowlegde about matter-energy interaction Feature selection Select most important features using some criterion function Feature extraction Make transformation in order to decrease dimensionality Estimate parameters more carefully or using more samples Estimate covariance matrix using small number of samples (Landgrebe ch 4.8) Combine covariance matrix of classes and common covariance matrix Statistics enhancement using unlabeled samples (Landgrebe ch ) Use also unlabeled samples to estimate class means and covariances 1
2 Feature selection Choose best subset X from set Y (original measurements) where D > d Maximize some criterion function: where Ξ is any d-dimensional subset of Y In ideal case we maximize probability of correct classification Many combinations: Feature extraction Use all pattern vector y y is transformed to lower-dimensional x: where W is transformation matrix Elimination of redundant and insignificant information depends on transformation matrix Usually W is searched using criterion function Transformation is usually linear: x = W T y size of W: D*d If D=20 and d= different combinations Need to measure how far classes are from each other Probability of error Good features minimize probability of error Interclass distance Compute mean distance of pattern vectors belonging to classes a and b Large distance: classes separable Theoretically ideal measure In practice Difficult to compute Estimation error Simple but do not take into account probability density functions of classes Probabilitic distance Use knowledge about probability structure of classes: Class density function p(ξ ω i ) A priori probability P i Probabilistic dependence Measure difference ξ and ω i by measuring difference between class density functions p(ξ ω i ) and mixture density function p(ξ) 2 classes, P 1 = P 2 If p(ξ ω 1 ) = 0 for all p(ξ ω 2 ) 0: classes separable If p(ξ ω 1 ) = p(ξ ω 2 ): classes non-separable Overlapping is measured using distance between p(ξ ω 1 ) and p(ξ ω 2 ) If dependence large Big difference between class density function and mixture density function Classes separable 2
3 Entropy P(ω i ξ) s uniformly distributed poor separability Measure deviation of P(ω i ξ) s using criterion function P(ω i ξ) s same: class separability poor P(ω i ξ) s vary much: better separability Variation of P(ω i ξ) s can be measured using entropy Small entropy: good separability Large variation of P(ω i ξ) s good separability, correct classification likely Search optimal feature combination Try all possible combinations Problem: computing time E.g. D=20, d= different combinations Choose best subset by removing or adding small subsets of features to original set Initial situation: Bottom-up search Ξ 0 = empty set Add features until desired number achieved Top-down search Ξ 0 = all features Remove features until desired number achieved Branch-and-bound Only search algorithm which searches implicitly all d-dimensional subsets without complete search Top-down Backtracking makes possible to search all combinations implicitly Computationally relatively light due to efficient search process Monotonicity of intrinsic featuresets Generate intrinsic featuresets Ξ 1 from Ξ 0 by excluding some features Generate intrinsic featuresets Ξ 2 from Ξ 1 by excluding some features, and so on... This generates search tree Criteria functions Class separability does not increase as number of features decrease It is not necessary to investigate feature subsets with small J() E.g. D=6, d=2 Situation: some branches are searched to level (D-d) and best J(Ξ D-d ) = B if B > A we do not have to search subtree under node A Best features Most simple and unreliable Choose d independently best features Test each feature independently Put in order according to value of criterion function and final featureset is Sequential forward selection SFS Bottom-up (X 0 = 0) Add one feature at time to featureset Each time that feature is chosen, which maximizes criterion function E.g. featureset X k has k features Rank left-over features so that featureset at step k+1 becomes No mechanism to remove features 3
4 Generalized sequential forward selection GSFS Otherwise same as SFS, but add more (r) that 1 Featureset X k form all featuresets by adding all r-feature combinations from set Y-X k choose that featureset X k+r which satisfies More reliable that SFS, because dependencies between features have been taken into account Computationally heavier Sequential backward selection Top-down version of SFS Initial featureset X 0 = Y Remove one feature at time until D-d features have been removed Featureset X k k features have been removed from set Y Rank new featuresets: featureset at step k+1 Generalized sequential backward selection More that one is removed Featureset X k (k features are removed) form all featuresets by removing all r-feature combinations from set Xk choose that featureset X which satisfies "Add-l, take-r" selection Use backtracking 1. Add l features using SFS 2. Remove r features using SBS If X has d features, stop, else goto 1 If l > r bottom-up Does not take dependencies between features into account Generalized "add-l, take-r" selection Otherwise same as previous, but use GSFS and GSBS Interclass distance in feature selection and extraction Search x, which maximizes mean distance between feature vectors belonging to different classes δ (ξ ik,ξ jl )is distance between two d-dimensional feature vectors, one from class i and other from class j usually P must be estimated as P i = n i / n now criterion function becomes Interclass distance in feature selection and extraction Optimal feature vector by using criterion function satisfies Feature selection J(ξ) is maximized by searching best d-dimensional subset from original D- dimensional featureset Feature extraction Search best transformation W, which determines d-dimensional feature vector ξ In latter, each feature vector affect same amount to result in other words 4
5 Metrics distance functions There are many different ways to measure distance between two points 1. s-degree Minkowski metrics 2. City-block distance (s=1) 3. Euclidean distance (s=2) 4. Chebyshev distance 5. Quadratic distance where Q is positive definitive scaling matrix If Q = I same as squared Euclidean distance 6. Nonlinear distance where H and T are parameters and δ() any distance function Mean vector for class i: Mean vector of mixture density function: Mean squared Euclidean distance of data: If second term is large m i is far away from mean vectors of other classes good If first term is large mean distances between feature vectors belonging to same class are large not good We try to maximize second term and minimize first term J 1 () is not good measure for class separability opposite information influence to same direction Between-class scatter matrix: Within-class scatter matrix: Criterion function becomes: We want to minimize tr[s w ] and maximize tr[s b ] better criterion function Within-class deviation and mean distance between classes affect to criterion function value but covariances do not make transform where Σ I U T S w U = I U = S w 1/2 Now J 2 becomes: where tr = trace (sum of diagonal elements) where λ k, k=1,...,d, are eigenvalues of matrix S w -1 S b We can also use within-class scatter and between-class scatter Scatter: represents volume of set of vectors determines same thing as deviation Within-class scatter: s w = S w Between-class scatter: s b = S b Total scatter s T = Σ = S w + S b Σ is covariance matrix of mixture density function Separability is good, if ratio of s T and s w large If we diagonalize Σ and S w with matrix U so that U T ΣU = Λ, J 4 becomes where matrix Λ is diagonal matrix which is formed by eigenvalues of product S -1 w Σ λ j are diagonal elements of Λ J 4 is better than J 3 distances between classes are more uniformly distributed according to different coordinate axis when J 3 is used, usually two classes are very separable but others are not one eigenvalue dominates 5
6 Nonlinear distance J 1 and J 4 are larger for classes 1 and 2 than 1 and 3, although classes are clearly separable In some cases when density functions of classes are completely overlapping, these criterion functions get higher value that when density functions are not overlapping Two separable classes, J 1 - J 4 give wrong impression about class separability In general, criterion functions based on Euclidean distance are good only if covariance matrices of classes are same Nonlinear distance If class i feature vector is far away from other classes vector is classified correctly this "safe" distance is T If distance is greater than T, error probability is zero we can think that distance to these vectors is constant H If there is vector belonging to class j near vector belonging to class i error probability is higher than zero distance between these vectors is zero Nonlinear distance where H and T are parameters and δ() any distance function Criterion function: δ δ Probabilistic distance in feature selection Probabilistic distance in feature selection Error probability is ideal criterion for feature selection Difficult to implement, estimation problems Distance function easy to implement Low performance Probabilistic distance gives more realistic view to class separability Use knowledge about probability structure of classes: Class density functions: p(ξ ω i ) A'priori probabilities P i Distances for 2 class case Chernoff if s=0.5 Bhattacharyya e.g. -ln ( 0.1 * 0.9 ) 1.2 -ln ( 0.5 * 0.5 ) 0.7 Matusita Divergence Patrick-Fisher Lissack-Fu Error probability Kolmogorov J K = J L, if α = 1 in J L Probabilistic distance in feature selection Distance simpler, if parametric distributions are used e.g. normal distribution Probabilistic distance in feature selection If Σ 1 = Σ 2 = Σ J M = J D = 8J B = (μ 2 μ 1 ) T Σ -1 (μ 2 - μ 1 ) Mahalanobis-distance More classes we cannot generalize these distances compute pairwise distances and sum all: where J ij is probabilistic distance between classes i and j 6
7 Probabilistic dependence in feature selection We get dependence between ξ and ω i by measuring difference between class density functions and mixture density function Dependence large big difference between class density function and mixture density function classes separable If distributions of classes are parametric their mixture density function is NOT parametric we do not get simpler forms, as with probabilistic distances Chernoff Probabilistic dependence in feature selection Bhattacharyya Matusita Joshi (formed from Divergence) Patrick-Fisher Lissack-Fu Mutual information Use these if several classes & possible to implement Entropy in feature selection Separability can also be measured using a'posteriori probabilities P(ω i ξ) and mixture density function p(ξ) Measure "variation" of a'posteriori probabilities using entropy Separability measure based on Shannon entropy minimize called equivocation Separability measure based on Quadratic entropy maximize called Bayes distance Karhunen-Löwe transformation Preserve information in compressed form error of compression as small as possible decorrelate elements of feature vectors y and remove those coordinate axis for which information content is small Represent D-dimensional pattern vectors y using orthonormal basis vectors u j where a j are coefficients of basis vector u j Approximate y using d terms Karhunen-Löwe transformation It would be good to choose basis vectors so that squared error between pattern vectors and their approximations is minimized Approximation error is sum of squared coefficients which were not included to approximation, and Optimal basis vectors (=axis of new coordinate system) correspond to eigenvectors of matrix representing data (Ψ) and λ i are corresponding eigenvalues Minimize compression error by selecting those eigenvectors for transformation matrix with largest eigenvalues Karhunen-Löwe transformation Feature extraction using y D-dimensional original pattern vectors x d-dimensional transformed pattern vectors W transformation matrix W is obtained by solving A is matrix representing data U matrix containing basis vectors of K-L transformation (=eigenvectors of matrix A) Λ diagonal matrix formed by eigenvalues of matrix A W = [ u 1,...,u d ], eigenvectors corresponding to d largest eigenvalues 7
8 Karhunen-Löwe transformation How to compute A??? Depends on desired properties of transformation Maximize separability of class mean vectors Use between-class scatter matrix S b More optimal matrix can be obtained by making "whitening transformation for between-class scatter matrix If means vectors of classes almost same but covariances differ Use covariance matrices of classes No information about class means and covariances Use covariance matrix of whole dataset This is called "Principal Component Analysis" Estimation of classification error E, error rate = mean probability of error Analytical solution VERY rare Estimation errors, assumptions Empirical testing of classifier Training set & test set Classify test set by training set and count proportion of errors Generalization? Relations between E (error rate) and A (acceptance rate) k-nnr Also based on training set and test set Probabilistic distance Compute upper and lower limits for error Probabilistic distance We can approximate E* by computing upper and lower limits Limit using Mahalanobis-distance Δ = (μ 1 - μ 2 ) T Σ (μ 1 - μ 2 ) Σ = P 1 Σ 1 + P 2 Σ 2 Bhattacharyya-distance Upper limit Lower limit Divergence Error counting Resubstitution estimation Data: Test set = training set Optimistically biased Holdout estimation Data: divide 50% to training set, 50% test set Pessimistically biased Leave-one-out estimation Data: one vector test set, others training set, repeat until all have been as test set Unbiased, large variance Rotation estimation Data: divide to n parts, classify one part as test set, repeat until all have been as test set Bootstrap estimation Generate "new" pattern vectors from old Error counting Risks of resubstitution error estimation method Training set = test set Example I: 2 classes P 1 = P 2, p(x ω 1 ) = p(x ω 2 ) Linear discriminant function with perceptron learning algorithm If n < 2d, it is VERY likely that we can find hyperplane separating classes Estimated error probability = 0 In reality, error probability = 0.5 Example II: 1-NNR Error probability always 0 if x i x j for all i j Relations between E and R/A Use relation between E (error rate) and R (rejection rate) or A (acceptance rate) E.g. k,l-nn-classifier: Estimate acceptance rates, use them to estimate error Advantages: Smaller variance than error counting methods Classification of test set can be unknown Need less labelled data, avoid errors of labelling process 8
9 Mean risk 2 classes, their density functions and a'priori probabilities are known Conditional probability of error for pattern vector x: Estimate for mean risk NOTE: class of x can be unknown Variance of estimate References Heikkilä, Törmä: Tilastollinen hahmotunnistus, TKK, Fotogrammetrian ja kaukokartoituksen laboratorio Landgrebe: Signal Theory Methods in Multispectral Remote Sensing, Wiley, 2003 Variance of error counting estimate Variance of mean risk method is smaller, because error is quantified to values (0,1) in error counting methods, now error is between [0,0.5] 9
Principles of Pattern Recognition. C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata
Principles of Pattern Recognition C. A. Murthy Machine Intelligence Unit Indian Statistical Institute Kolkata e-mail: murthy@isical.ac.in Pattern Recognition Measurement Space > Feature Space >Decision
More informationL11: Pattern recognition principles
L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction
More informationMachine detection of emotions: Feature Selection
Machine detection of emotions: Feature Selection Final Degree Dissertation Degree in Mathematics Leire Santos Moreno Supervisor: Raquel Justo Blanco María Inés Torres Barañano Leioa, 31 August 2016 Contents
More informationProblem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30
Problem Set 2 MAS 622J/1.126J: Pattern Recognition and Analysis Due: 5:00 p.m. on September 30 [Note: All instructions to plot data or write a program should be carried out using Matlab. In order to maintain
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationFeature selection. c Victor Kitov August Summer school on Machine Learning in High Energy Physics in partnership with
Feature selection c Victor Kitov v.v.kitov@yandex.ru Summer school on Machine Learning in High Energy Physics in partnership with August 2015 1/38 Feature selection Feature selection is a process of selecting
More informationLecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides
Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides Intelligent Data Analysis and Probabilistic Inference Lecture
More informationDiscriminant analysis and supervised classification
Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical
More informationCLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING
CLASSIFICATION OF MULTI-SPECTRAL DATA BY JOINT SUPERVISED-UNSUPERVISED LEARNING BEHZAD SHAHSHAHANI DAVID LANDGREBE TR-EE 94- JANUARY 994 SCHOOL OF ELECTRICAL ENGINEERING PURDUE UNIVERSITY WEST LAFAYETTE,
More informationMotivating the Covariance Matrix
Motivating the Covariance Matrix Raúl Rojas Computer Science Department Freie Universität Berlin January 2009 Abstract This note reviews some interesting properties of the covariance matrix and its role
More informationLinear Dimensionality Reduction
Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis
More informationBranch-and-Bound Algorithm. Pattern Recognition XI. Michal Haindl. Outline
Branch-and-Bound Algorithm assumption - can be used if a feature selection criterion satisfies the monotonicity property monotonicity property - for nested feature sets X j related X 1 X 2... X l the criterion
More informationMachine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Machine Learning Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Machine Learning Fall 1395 1 / 47 Table of contents 1 Introduction
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationEEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1
EEL 851: Biometrics An Overview of Statistical Pattern Recognition EEL 851 1 Outline Introduction Pattern Feature Noise Example Problem Analysis Segmentation Feature Extraction Classification Design Cycle
More informationBayesian Decision Theory
Bayesian Decision Theory Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2017 CS 551, Fall 2017 c 2017, Selim Aksoy (Bilkent University) 1 / 46 Bayesian
More informationParametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012
Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood
More informationPattern Recognition Approaches to Solving Combinatorial Problems in Free Groups
Contemporary Mathematics Pattern Recognition Approaches to Solving Combinatorial Problems in Free Groups Robert M. Haralick, Alex D. Miasnikov, and Alexei G. Myasnikov Abstract. We review some basic methodologies
More informationClassification. 1. Strategies for classification 2. Minimizing the probability for misclassification 3. Risk minimization
Classification Volker Blobel University of Hamburg March 2005 Given objects (e.g. particle tracks), which have certain features (e.g. momentum p, specific energy loss de/ dx) and which belong to one of
More informationStatistical Pattern Recognition
Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction
More information5. Discriminant analysis
5. Discriminant analysis We continue from Bayes s rule presented in Section 3 on p. 85 (5.1) where c i is a class, x isap-dimensional vector (data case) and we use class conditional probability (density
More informationMachine Learning 2nd Edition
INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010
More informationECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction
ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering
More informationGENOMIC SIGNAL PROCESSING. Lecture 2. Classification of disease subtype based on microarray data
GENOMIC SIGNAL PROCESSING Lecture 2 Classification of disease subtype based on microarray data 1. Analysis of microarray data (see last 15 slides of Lecture 1) 2. Classification methods for microarray
More informationPattern Recognition 2
Pattern Recognition 2 KNN,, Dr. Terence Sim School of Computing National University of Singapore Outline 1 2 3 4 5 Outline 1 2 3 4 5 The Bayes Classifier is theoretically optimum. That is, prob. of error
More informationAlgorithm-Independent Learning Issues
Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationUniversity of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries
University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout :. The Multivariate Gaussian & Decision Boundaries..15.1.5 1 8 6 6 8 1 Mark Gales mjfg@eng.cam.ac.uk Lent
More informationPreprocessing & dimensionality reduction
Introduction to Data Mining Preprocessing & dimensionality reduction CPSC/AMTH 445a/545a Guy Wolf guy.wolf@yale.edu Yale University Fall 2016 CPSC 445 (Guy Wolf) Dimensionality reduction Yale - Fall 2016
More informationCLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS
CLASSICAL NORMAL-BASED DISCRIMINANT ANALYSIS EECS 833, March 006 Geoff Bohling Assistant Scientist Kansas Geological Survey geoff@gs.u.edu 864-093 Overheads and resources available at http://people.u.edu/~gbohling/eecs833
More informationPATTERN CLASSIFICATION
PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2014/2015 Lesson 16 8 April 2015 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationPattern Recognition. Parameter Estimation of Probability Density Functions
Pattern Recognition Parameter Estimation of Probability Density Functions Classification Problem (Review) The classification problem is to assign an arbitrary feature vector x F to one of c classes. The
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationChapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)
HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter
More informationHIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT
HIGH DIMENSIONAL FEATURE REDUCTION VIA PROJECTION PURSUIT Luis O. Jimenez David Landgrebe TR-ECE 96-5 April 1995 SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING PURDUE UNIVERSITY WEST LAFAYETTE IN 47907-185
More informationMSA220 Statistical Learning for Big Data
MSA220 Statistical Learning for Big Data Lecture 4 Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology More on Discriminant analysis More on Discriminant
More informationKernel Methods. Barnabás Póczos
Kernel Methods Barnabás Póczos Outline Quick Introduction Feature space Perceptron in the feature space Kernels Mercer s theorem Finite domain Arbitrary domain Kernel families Constructing new kernels
More informationECE662: Pattern Recognition and Decision Making Processes: HW TWO
ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are
More informationPCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani
PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given
More informationPattern Classification
Pattern Classification Introduction Parametric classifiers Semi-parametric classifiers Dimensionality reduction Significance testing 6345 Automatic Speech Recognition Semi-Parametric Classifiers 1 Semi-Parametric
More information44 CHAPTER 2. BAYESIAN DECISION THEORY
44 CHAPTER 2. BAYESIAN DECISION THEORY Problems Section 2.1 1. In the two-category case, under the Bayes decision rule the conditional error is given by Eq. 7. Even if the posterior densities are continuous,
More informationSmall sample size generalization
9th Scandinavian Conference on Image Analysis, June 6-9, 1995, Uppsala, Sweden, Preprint Small sample size generalization Robert P.W. Duin Pattern Recognition Group, Faculty of Applied Physics Delft University
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationUNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013
UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and
More informationObject Recognition. Digital Image Processing. Object Recognition. Introduction. Patterns and pattern classes. Pattern vectors (cont.
3/5/03 Digital Image Processing Obect Recognition Obect Recognition One of the most interesting aspects of the world is that it can be considered to be made up of patterns. Christophoros Nikou cnikou@cs.uoi.gr
More informationFeature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size
Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationNotation. Pattern Recognition II. Michal Haindl. Outline - PR Basic Concepts. Pattern Recognition Notions
Notation S pattern space X feature vector X = [x 1,...,x l ] l = dim{x} number of features X feature space K number of classes ω i class indicator Ω = {ω 1,...,ω K } g(x) discriminant function H decision
More informationIs cross-validation valid for small-sample microarray classification?
Is cross-validation valid for small-sample microarray classification? Braga-Neto et al. Bioinformatics 2004 Topics in Bioinformatics Presented by Lei Xu October 26, 2004 1 Review 1) Statistical framework
More informationIN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg
IN 5520 30.10.18 Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg (anne@ifi.uio.no) 30.10.18 IN 5520 1 Literature Practical guidelines of classification
More informationResampling techniques for statistical modeling
Resampling techniques for statistical modeling Gianluca Bontempi Département d Informatique Boulevard de Triomphe - CP 212 http://www.ulb.ac.be/di Resampling techniques p.1/33 Beyond the empirical error
More informationProbability Models for Bayesian Recognition
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIAG / osig Second Semester 06/07 Lesson 9 0 arch 07 Probability odels for Bayesian Recognition Notation... Supervised Learning for Bayesian
More informationQuestion of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning
Question of the Day Machine Learning 2D1431 How can you make the following equation true by drawing only one straight line? 5 + 5 + 5 = 550 Lecture 4: Decision Tree Learning Outline Decision Tree for PlayTennis
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26
Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1
More informationEECS490: Digital Image Processing. Lecture #26
Lecture #26 Moments; invariant moments Eigenvector, principal component analysis Boundary coding Image primitives Image representation: trees, graphs Object recognition and classes Minimum distance classifiers
More informationBayesian Decision Theory
Bayesian Decision Theory Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Bayesian Decision Theory Bayesian classification for normal distributions Error Probabilities
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationExpectation Maximization
Expectation Maximization Machine Learning CSE546 Carlos Guestrin University of Washington November 13, 2014 1 E.M.: The General Case E.M. widely used beyond mixtures of Gaussians The recipe is the same
More informationLINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
LINEAR MODELS FOR CLASSIFICATION Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification,
More informationClassifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier
Classifier Learning Curve How to estimate classifier performance. Learning curves Feature curves Rejects and ROC curves True classification error ε Bayes error ε* Sub-optimal classifier Bayes consistent
More informationCS6375: Machine Learning Gautam Kunapuli. Decision Trees
Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s
More informationDiscriminative Direction for Kernel Classifiers
Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationday month year documentname/initials 1
ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi
More informationMulti-Class Linear Dimension Reduction by. Weighted Pairwise Fisher Criteria
Multi-Class Linear Dimension Reduction by Weighted Pairwise Fisher Criteria M. Loog 1,R.P.W.Duin 2,andR.Haeb-Umbach 3 1 Image Sciences Institute University Medical Center Utrecht P.O. Box 85500 3508 GA
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationNeural Networks Learning the network: Backprop , Fall 2018 Lecture 4
Neural Networks Learning the network: Backprop 11-785, Fall 2018 Lecture 4 1 Recap: The MLP can represent any function The MLP can be constructed to represent anything But how do we construct it? 2 Recap:
More informationBayes Decision Theory - I
Bayes Decision Theory - I Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD Statistical Learning from Data Goal: Given a relationship between a feature vector and a vector y, and iid data samples ( i,y i ), find
More informationBayesian Decision and Bayesian Learning
Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i
More informationNon-linear Measure Based Process Monitoring and Fault Diagnosis
Non-linear Measure Based Process Monitoring and Fault Diagnosis 12 th Annual AIChE Meeting, Reno, NV [275] Data Driven Approaches to Process Control 4:40 PM, Nov. 6, 2001 Sandeep Rajput Duane D. Bruns
More informationClustering VS Classification
MCQ Clustering VS Classification 1. What is the relation between the distance between clusters and the corresponding class discriminability? a. proportional b. inversely-proportional c. no-relation Ans:
More informationChemometrics: Classification of spectra
Chemometrics: Classification of spectra Vladimir Bochko Jarmo Alander University of Vaasa November 1, 2010 Vladimir Bochko Chemometrics: Classification 1/36 Contents Terminology Introduction Big picture
More informationMachine Learning Lecture 5
Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory
More informationDesign and Implementation of Speech Recognition Systems
Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationMachine Learning Lecture 2
Machine Perceptual Learning and Sensory Summer Augmented 15 Computing Many slides adapted from B. Schiele Machine Learning Lecture 2 Probability Density Estimation 16.04.2015 Bastian Leibe RWTH Aachen
More informationBayes Decision Theory
Bayes Decision Theory Minimum-Error-Rate Classification Classifiers, Discriminant Functions and Decision Surfaces The Normal Density 0 Minimum-Error-Rate Classification Actions are decisions on classes
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationLecture 7: Con3nuous Latent Variable Models
CSC2515 Fall 2015 Introduc3on to Machine Learning Lecture 7: Con3nuous Latent Variable Models All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/
More informationIntroduction to Supervised Learning. Performance Evaluation
Introduction to Supervised Learning Performance Evaluation Marcelo S. Lauretto Escola de Artes, Ciências e Humanidades, Universidade de São Paulo marcelolauretto@usp.br Lima - Peru Performance Evaluation
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationA Unified Bayesian Framework for Face Recognition
Appears in the IEEE Signal Processing Society International Conference on Image Processing, ICIP, October 4-7,, Chicago, Illinois, USA A Unified Bayesian Framework for Face Recognition Chengjun Liu and
More informationEECS 275 Matrix Computation
EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22 Overview
More informationCMU-Q Lecture 24:
CMU-Q 15-381 Lecture 24: Supervised Learning 2 Teacher: Gianni A. Di Caro SUPERVISED LEARNING Hypotheses space Hypothesis function Labeled Given Errors Performance criteria Given a collection of input
More informationLinear Classifiers as Pattern Detectors
Intelligent Systems: Reasoning and Recognition James L. Crowley ENSIMAG 2 / MoSIG M1 Second Semester 2013/2014 Lesson 18 23 April 2014 Contents Linear Classifiers as Pattern Detectors Notation...2 Linear
More informationUnsupervised machine learning
Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels
More informationIntroduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin
1 Introduction to Machine Learning PCA and Spectral Clustering Introduction to Machine Learning, 2013-14 Slides: Eran Halperin Singular Value Decomposition (SVD) The singular value decomposition (SVD)
More informationClassification and Pattern Recognition
Classification and Pattern Recognition Léon Bottou NEC Labs America COS 424 2/23/2010 The machine learning mix and match Goals Representation Capacity Control Operational Considerations Computational Considerations
More informationUniversität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer
Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA Tobias Scheffer Overview Principal Component Analysis (PCA) Kernel-PCA Fisher Linear Discriminant Analysis t-sne 2 PCA: Motivation
More informationMachine Learning Lecture 2
Announcements Machine Learning Lecture 2 Eceptional number of lecture participants this year Current count: 449 participants This is very nice, but it stretches our resources to their limits Probability
More informationLecture 6 Proof for JL Lemma and Linear Dimensionality Reduction
COMS 4995: Unsupervised Learning (Summer 18) June 7, 018 Lecture 6 Proof for JL Lemma and Linear imensionality Reduction Instructor: Nakul Verma Scribes: Ziyuan Zhong, Kirsten Blancato This lecture gives
More informationOutline Learning Machines and Data! Linear and nonlinear methods to compress
Machine Learning Tools for the Visualisation of BigData We all love images! Statutory warning: Has been produced in the same facility which produces maths and may contains traces of maths. Amit Kumar Mishra
More informationIntroduction to Statistical Inference
Structural Health Monitoring Using Statistical Pattern Recognition Introduction to Statistical Inference Presented by Charles R. Farrar, Ph.D., P.E. Outline Introduce statistical decision making for Structural
More informationDimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014
Dimensionality Reduction Using PCA/LDA Hongyu Li School of Software Engineering TongJi University Fall, 2014 Dimensionality Reduction One approach to deal with high dimensional data is by reducing their
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationClassification: The rest of the story
U NIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN CS598 Machine Learning for Signal Processing Classification: The rest of the story 3 October 2017 Today s lecture Important things we haven t covered yet Fisher
More information