Pattern Classification

Similar documents
Kernel Methods and SVMs Extension

Support Vector Machines CS434

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Multilayer Perceptron (MLP)

Linear Classification, SVMs and Nearest Neighbors

Lecture 12: Classification

Pattern Classification

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Intro to Visual Recognition

Support Vector Machines

Which Separator? Spring 1

Support Vector Machines

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

Evaluation of classifiers MLPs

Support Vector Machines

Multigradient for Neural Networks for Equalizers 1

MIMA Group. Chapter 2 Bayesian Decision Theory. School of Computer Science and Technology, Shandong University. Xin-Shun SDU

Kernels in Support Vector Machines. Based on lectures of Martin Law, University of Michigan

Linear discriminants. Nuno Vasconcelos ECE Department, UCSD

Lecture 10 Support Vector Machines. Oct

A Tutorial on Data Reduction. Linear Discriminant Analysis (LDA) Shireen Elhabian and Aly A. Farag. University of Louisville, CVIP Lab September 2009

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Natural Language Processing and Information Retrieval

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Feature Selection: Part 1

10-701/ Machine Learning, Fall 2005 Homework 3

Support Vector Machines

Support Vector Machines

15-381: Artificial Intelligence. Regression and cross validation

18-660: Numerical Methods for Engineering Design and Optimization

Why Bayesian? 3. Bayes and Normal Models. State of nature: class. Decision rule. Rev. Thomas Bayes ( ) Bayes Theorem (yes, the famous one)

Logistic Regression Maximum Likelihood Estimation

Kristin P. Bennett. Rensselaer Polytechnic Institute

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

Maximal Margin Classifier

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Support Vector Machines CS434

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Introduction to the Introduction to Artificial Neural Network

Lecture 10 Support Vector Machines II

CSE 252C: Computer Vision III

Classification learning II

Norms, Condition Numbers, Eigenvalues and Eigenvectors

APPENDIX A Some Linear Algebra

Chapter 6 Support vector machine. Séparateurs à vaste marge

Discriminative classifier: Logistic Regression. CS534-Machine Learning

Recap: the SVM problem

Lecture 3: Dual problems and Kernels

Statistical pattern recognition

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Lecture Notes on Linear Regression

EEE 241: Linear Systems

Linear Approximation with Regularization and Moving Least Squares

Mean Field / Variational Approximations

Chapter Newton s Method

Regularized Discriminant Analysis for Face Recognition

Vector Norms. Chapter 7 Iterative Techniques in Matrix Algebra. Cauchy-Bunyakovsky-Schwarz Inequality for Sums. Distances. Convergence.

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Nonlinear Classifiers II

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Multilayer Perceptrons and Backpropagation. Perceptrons. Recap: Perceptrons. Informatics 1 CG: Lecture 6. Mirella Lapata

Advanced Introduction to Machine Learning

Inner Product. Euclidean Space. Orthonormal Basis. Orthogonal

Generalized Linear Methods

Multilayer neural networks

Video Data Analysis. Video Data Analysis, B-IT

Fisher Linear Discriminant Analysis

Neural Networks & Learning

Radial-Basis Function Networks

Generative classification models

1 Convex Optimization

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

The exam is closed book, closed notes except your one-page cheat sheet.

Statistical Foundations of Pattern Recognition

VQ widely used in coding speech, image, and video

Classification as a Regression Problem

CS4495/6495 Introduction to Computer Vision. 3C-L3 Calibrating cameras

Supporting Information

Singular Value Decomposition: Theory and Applications

INF 4300 Digital Image Analysis REPETITION

Math1110 (Spring 2009) Prelim 3 - Solutions

OPTIMISATION. Introduction Single Variable Unconstrained Optimisation Multivariable Unconstrained Optimisation Linear Programming

CHALMERS, GÖTEBORGS UNIVERSITET. SOLUTIONS to RE-EXAM for ARTIFICIAL NEURAL NETWORKS. COURSE CODES: FFR 135, FIM 720 GU, PhD

Department of Computer Science Artificial Intelligence Research Laboratory. Iowa State University MACHINE LEARNING

Composite Hypotheses testing

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

Interconnect Modeling

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

Parameter estimation class 5

1 GSW Iterative Techniques for y = Ax

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Radar Trackers. Study Guide. All chapters, problems, examples and page numbers refer to Applied Optimal Estimation, A. Gelb, Ed.

Multi-layer neural networks

Maximum Likelihood Estimation (MLE)

The Geometry of Logit and Probit

Transcription:

Pattern Classfcaton All materals n these sldes ere taken from Pattern Classfcaton (nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wley & Sons, 000 th the permsson of the authors and the publsher

Chapter 5: Lnear Dscrmnant Functons (Sectons 5.-5-3) Introducton Lnear Dscrmnant Functons and Decsons Surfaces Generalzed Lnear Dscrmnant Functons

Introducton In Ch.3, the underlyng probablty denstes ere knon (or gven) The tranng sample as used to estmate the parameters of these probablty denstes (ML, MAP estmatons) In ths chapter, e only kno the proper forms for the dscrmnant functons Use the samples to estmate the values of the parameters of the classfer Goal Determnng the dscrmnant functons.(no knoledge of the underlyng prob.dst. s reqd.) Dscrmnant functons: Lnear n or functons of They may not be optmal, but they are very smple to use Fndng a lnear dscrmnant functon Mnmzng a crteron functon. Sample crteron fnc. Sample rsk, tranng error (small tranng error small test error)

Lnear dscrmnant functons 3 and decsons surfaces Lnear dscrmnant functon g() = T 0 () here s the eght vector and 0 the bas A to-category classfer th a dscrmnant functon of the form () uses the follong rule: Decde ω f g() > 0 and ω f g() < 0 Decde ω f t > - 0 and ω otherse If g() = 0 s assgned to ether class

4

5 The equaton g() = 0 defnes the decson surface that separates ponts assgned to the category ω from ponts assgned to the category ω When g() s lnear, the decson surface s a hyperplane Algebrac measure of the dstance from to the hyperplane (nterestng result!)

6

= p r. (snce s colnear th - p and = ) 7 sn ce g() = 0 and g( ) therefore r = n partcular d(0, H) = t. = 0 In concluson, a lnear dscrmnant functon dvdes the feature space by a hyperplane decson surface The orentaton of the surface s determned by the normal vector and the locaton of the surface s determned by the bas

The mult-category case 8 We defne c lnear dscrmnant functons g 0 t ( ) = =,...,c and assgn to ω f g () > g j () j ; n case of tes, the classfcaton s undefned In ths case, the classfer s a lnear machne A lnear machne dvdes the feature space nto c decson regons, th g () beng the largest dscrmnant f s n the regon R For a to contguous regons R and R j ; the boundary that separates them s a porton of hyperplane H j defned by: g () = g j () ( j ) t ( 0 j0 ) = 0 j s normal to H j and d(,h j ) = g g j j

9

It s easy to sho that the decson regons for a lnear machne are conve, ths restrcton lmts the fleblty and accuracy of the classfer 0

Lnear Dscrmnant Functons

Non-lnear Dscrmnant Fncs

3 Hgher Dmensonal Space Constructed Feature Constructed Feature Fnd functon Φ() to map to a dfferent space

Generalzed Lnear Dscrmnant Functons Decson boundares hch separate beteen classes may not alays be lnear The complety of the boundares may sometmes request the use of hghly non-lnear surfaces A popular approach to generalze the concept of lnear decson functons s to consder a generalzed decson functon as: 4 g() = f () f () N f N () N here f (), N are scalar functons of the pattern R n (Eucldean Space)

Introducng f n () = e get: 5 g( ) = N = here = ( f ( ) =, T,..., f ( ) N, N ) T and f() = (f ( ), f ( ),..., f N ( ), f N ( )) T Ths latter representaton of g() mples that any decson functon defned by equaton () can be treated as lnear n the (N ) dmensonal space (N > n) g() mantans ts non-lnearty characterstcs n R n

6 The most commonly used generalzed decson functon s g() for hch f () ( N) are polynomals ( ~ T g( ) = ) f ( ) T: s the vector transpose form ~ Where s a ne eght vector, hch can be calculated from the orgnal and the orgnal lnear f (), N Quadratc decson functons for a -dmensonal feature space g( ) = here : = (,,..., 6 ) 3 T 4 and f() = 5 (, 6,,,,) T

For patterns R n, the most general quadratc decson functon s gven by: 7 g( ) = n j j = n n n n () = j= = The number of terms at the rght-hand sde s: l = N = n n( n ) n = ( n )( n ) Ths s the total number of eghts hch are the free parameters of the problem If for eample n = 3, the vector If for eample n = 0, the vector f() f () s 0-dmensonal s 65-dmensonal

In the case of polynomal decson functons of order m, a typcal f () s gven by: f ( ) here = e e,... e,..., m m m n and m s 0 or It s a polynomal th a degree beteen 0 and m. To avod repettons, e request m e,. 8 g m ( ) = n... n n = = = m m m...... g ( ) m m (here g 0 () = n ) s the most general polynomal decson functon of order m

9 Eample : Let n = 3 and m = then: Eample : Let n = and m = 3 then: 4 3 3 3 33 3 3 3 3 3 4 3 3 3 ) ( g = = = = 3 3 3 3 ) ( g ) ( here g ) ( g ) ( g ) ( g 3 3 3 = = = = = = = = =

The commonly used quadratc decson functon can be represented as the general n- dmensonal quadratc surface: g() = T A T b c 0 here the matr A = (a j ), the vector b = (b, b,, b n ) T and c, depends on the eghts, j, of equaton () j If A s postve defnte then the decson functon s a hyperellpsod th aes n the drectons of the egenvectors of A In partcular: f A = I n (Identty), the decson functon s smply the n-dmensonal hypersphere

If A s negatve defnte, the decson functon descrbes a hyperhyperbolod In concluson: t s only the matr A hch determnes the shape and characterstcs of the decson functon

Objectve Functons J Lnear Separablty Perceptron Relaaton Procedures Convergence No lnear separablty Mn-Squared Error Msclassfed Samples All Samples

Lnear Separablty 3

-Category Lnear Case 4

5 Termnology Weght : a A A: Weght space Solutons: Not unque: Normalzed a = Margn a T y b

b > 0 Margn 6

Gradent Descent 7

Neton Descent 8

9

30

Perceptron 3

As Neural Netork 3

Relaaton Algorthm 33

34

36 Mean-Squared Error Error-Correctng Procedures Separable Samples Perceptron Relaaton procedures Nonseparable Samples Correctons n the error-correcton procedure ll NEVER converge Heurstc modfcatons Acceptable performance on nonseparable samples! MSE No guarantee that soln. s a separatng vector

Mn-Squared Error 37

38 MSE Pseudo-Inverse Drect relaton to Fscher s lnear dscrmnant If b=, MSE appromaton to the Bayes dscrmnant functon T g( ) = a y( ) b g( ) = P( ω ) P( ω )

39

40 Wdro-Hoff or LMS Mnmze usng gradent descent Sngular pseudo-nverse Inverson of large matrces

4 Ho Kashyap Procedure If tranng samples are lnearly separable, Yaˆ = bˆ here bˆ > 0 Gradent descent rt both a and b

4 Support Vector Machnes Map to a hgher dmensonal space

43 Support Vector Machne (SVM) Decson surface s a hyperplane n feature space In summary Map the data to a predetermned very hgh-dmensonal space va a kernel functon Fnd the hyperplane that mamzes the margn beteen the to classes If data are not separable fnd the hyperplane that mamzes the margn and mnmzes the (a eghted average of the) msclassfcatons

44 Separatng Hyperplanes? X

45 Idea: Mamze Margn Select the separatng hyperplane that mamzes the margn! Margn Wdth Margn Wdth

46 Support Vectors Support Vectors Margn Wdth 46

47 Constraned Optmzaton The dth of the margn s: k r r r b = k Optmzaton problem: r r b = k k k r r b = 0 k ma s. t. ( b) k, of class ( b) k, of class

Constraned Quadratc Optmzaton 48 If class corresponds to and class corresponds to -, e can rerte as ( b), th y = ( b ), th y = y ( b), So the problem becomes: ma s. t. y ( b), or mn s. t. y ( b),

49 Comparson: SVM vs NN SVMs Kernel maps to a very-hgh dmensonal space Search space has a unque mnmum Tranng s etremely effcent Classfcaton etremely effcent Kernel and cost the to parameters to select Very good accuracy n typcal domans Etremely robust Neural Netorks Hdden Layers map to loer dmensonal spaces Search space has multple local mnma Tranng s epensve Classfcaton etremely effcent Requres number of hdden unts and layers Very good accuracy n typcal domans

50 General Summary Perceptron and relaaton procedures Do not converge on nonseparable data. MSE Works regardless of separablty, but no guarantee of separaton thout error. Ho-Kashyap Provdes a separatng vector or th the evdence of nonseparablty. (No bound on the number of steps reqd.)