LECTURE 2: Linear and quadratic classifiers

Similar documents
Lecture 7: Linear and quadratic classifiers

Chapter 4 (Part 1): Non-Parametric Classification (Sections ) Pattern Classification 4.3) Announcements

Bayes (Naïve or not) Classifiers: Generative Approach

Introduction to local (nonparametric) density estimation. methods

6. Nonparametric techniques

Point Estimation: definition of estimators

Lecture 12: Multilayer perceptrons II

Unsupervised Learning and Other Neural Networks

Applications of Multiple Biological Signals

STK4011 and STK9011 Autumn 2016

Generative classification models

Dimensionality Reduction and Learning

Bayes Decision Theory - II

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Simple Linear Regression

TESTS BASED ON MAXIMUM LIKELIHOOD

Bayesian Classification. CS690L Data Mining: Classification(2) Bayesian Theorem: Basics. Bayesian Theorem. Training dataset. Naïve Bayes Classifier

Functions of Random Variables

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

18.413: Error Correcting Codes Lab March 2, Lecture 8

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Supervised learning: Linear regression Logistic regression

Overview. Basic concepts of Bayesian learning. Most probable model given data Coin tosses Linear regression Logistic regression

Bayes Estimator for Exponential Distribution with Extension of Jeffery Prior Information

Maximum Likelihood Estimation

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

Feature Selection: Part 2. 1 Greedy Algorithms (continued from the last lecture)

An Introduction to. Support Vector Machine

6.867 Machine Learning

Introduction to Matrices and Matrix Approach to Simple Linear Regression

Econometric Methods. Review of Estimation

ESS Line Fitting

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

CS 1675 Introduction to Machine Learning Lecture 12 Support vector machines

Lecture Note to Rice Chapter 8

Chapter 8. Inferences about More Than Two Population Central Values

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

CSE 5526: Introduction to Neural Networks Linear Regression

Lecture 3. Sampling, sampling distributions, and parameter estimation

Solving Constrained Flow-Shop Scheduling. Problems with Three Machines

Chapter 14 Logistic Regression Models

CHAPTER 4 RADICAL EXPRESSIONS

Logistic regression (continued)

Chapter 5 Properties of a Random Sample

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Bounds on the expected entropy and KL-divergence of sampled multinomial distributions. Brandon C. Roy

Linear Regression Linear Regression with Shrinkage. Some slides are due to Tommi Jaakkola, MIT AI Lab

ECON 5360 Class Notes GMM

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Kernel-based Methods and Support Vector Machines

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

KLT Tracker. Alignment. 1. Detect Harris corners in the first frame. 2. For each Harris corner compute motion between consecutive frames

Regression and the LMS Algorithm

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

ENGI 3423 Simple Linear Regression Page 12-01

Support vector machines

Lecture 12: Classification

Binary classification: Support Vector Machines

Lecture 3 Probability review (cont d)

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

1 Solution to Problem 6.40

CHAPTER 3 POSTERIOR DISTRIBUTIONS

Recall MLR 5 Homskedasticity error u has the same variance given any values of the explanatory variables Var(u x1,...,xk) = 2 or E(UU ) = 2 I

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Estimation of Stress- Strength Reliability model using finite mixture of exponential distributions

Some Different Perspectives on Linear Least Squares

( ) 2 2. Multi-Layer Refraction Problem Rafael Espericueta, Bakersfield College, November, 2006

Special Instructions / Useful Data

New Schedule. Dec. 8 same same same Oct. 21. ^2 weeks ^1 week ^1 week. Pattern Recognition for Vision

Simulation Output Analysis

Machine Learning. Introduction to Regression. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Chapter 9 Jordan Block Matrices

X ε ) = 0, or equivalently, lim

2006 Jamie Trahan, Autar Kaw, Kevin Martin University of South Florida United States of America

LINEAR REGRESSION ANALYSIS

Qualifying Exam Statistical Theory Problem Solutions August 2005

CS 2750 Machine Learning. Lecture 8. Linear regression. CS 2750 Machine Learning. Linear regression. is a linear combination of input components x

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Lecture 1 Review of Fundamental Statistical Concepts

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Announcements. Recognition II. Computer Vision I. Example: Face Detection. Evaluating a binary classifier

Point Estimation: definition of estimators

ENGI 4421 Propagation of Error Page 8-01

Cubic Nonpolynomial Spline Approach to the Solution of a Second Order Two-Point Boundary Value Problem

Classification : Logistic regression. Generative classification model.

ε. Therefore, the estimate

Homework 1: Solutions Sid Banerjee Problem 1: (Practice with Asymptotic Notation) ORIE 4520: Stochastics at Scale Fall 2015

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Arithmetic Mean and Geometric Mean

Lecture 02: Bounding tail distributions of a random variable

Estimation of the Loss and Risk Functions of Parameter of Maxwell Distribution

F. Inequalities. HKAL Pure Mathematics. 進佳數學團隊 Dr. Herbert Lam 林康榮博士. [Solution] Example Basic properties

Generalized Linear Regression with Regularization

ρ < 1 be five real numbers. The

Third handout: On the Gini Index

Chapter 4 Multiple Random Variables

Transcription:

LECURE : Lear ad quadratc classfers g Part : Bayesa Decso heory he Lkelhood Rato est Maxmum A Posteror ad Maxmum Lkelhood Dscrmat fuctos g Part : Quadratc classfers Bayes classfers for ormally dstrbuted classes Eucldea ad Mahalaobs dstace classfers Numercal example g Part 3: Lear classfers Gradet descet he perceptro rule he pseudo-verse soluto Least mea squares Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Part : Bayesa Decso heory Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Lkelhood Rato est () g Assume we are to classfy a object based o the evdece provded by a measuremet (or feature vector) x g Would you agree that a reasoable decso rule would be the followg? "Choose the class that s most probable gve the observed feature vector x More formally: Evaluate the posteror probablty of each class P( x) ad choose the class wth largest P( x) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

he Lkelhood Rato est () g Let us exame ths decso rule for a two-class problem I ths case the decso rule becomes f P( x) > P( x) else choose choose g Or, a more compact form Applyg Bayes theorem P( x) < > P( x) P (x )P( ) P(x )P( P(x) < > P(x) ) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

he Lkelhood Rato est (3) P(x) does ot affect the decso rule so t ca be elmated*. Rearragg the prevous expresso P(x ) Λ(x) P(x ) < > P( ) P( ) he term Λ(x) s called the lkelhood rato, ad the decso rule s kow as the lkelhood rato test Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5

Lkelhood Rato est: a example () g Gve a classfcato problem wth the followg class codtoal destes: P(x ) π e (x 4) P(x ) π e (x 0) P(x ) P(x ) 4 0 x g Derve a classfcato rule based o the Lkelhood Rato est (assume equal prors) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

Lkelhood Rato est: a example () g Soluto Substtutg the gve lkelhoods ad prors to the LR expresso: Smplfyg, chagg sgs ad takg logs: Whch yelds: < x 7 > Λ(x) (x 4) π π e e (x 4) (x 0) (x 0) > < < 0 > hs LR result makes tutve sese sce the lkelhoods are detcal ad dffer oly ther mea value R : say R : say P(x ) P(x ) 4 0 x Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

he probablty of error g he probablty of error s the probablty of assgg x to the wrog class For a two-class problem, P[error x] s smply P(error x) P( P( x) x) f f we decde we decde g It makes sese that the classfcato rule be desged to mmze the average prob. of error P[error] across all possble values of x + P(error) P(error,x)dx P(error x)p(x)dx + g o mmze P(error) we mmze the tegrad P(error x) at each x: choose the class wth maxmum posteror P( x) hs s called the MAXIMUM A POSERIORI (MAP) RULE Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8

Mmzg probablty of error g We prove the optmalty of the MAP rule graphcally he rght plot shows the posteror for each of the two classes he bottom plots shows the P(error) for the MAP rule ad a alteratve decso rule Whch oe has lower P(error) (color-flled area)? P(w x) x HE MAP RULE HE OHER RULE Choose RED Choose BLUE Choose RED Choose RED Choose BLUE Choose RED Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9

he Bayes Rsk () g So far we have assumed that the pealty of msclassfyg as s the same as the recprocal I geeral, ths s ot the case Msclassfcatos the fsh sortg lead to dfferet costs Medcal dagostcs errors are very asymmetrc g We capture ths cocept wth a cost fucto C j C j represets the cost of choosg class whe class j s the true class g Ad defe the Bayes Rsk as the expected value of the cost R E[C] j C j P[choose ad x j] Cj P[x R j] P[ j] j Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 0

he Bayes Rsk () g What s the decso rule that mmzes the Bayes Rsk? It ca be show* that the mmum rsk ca be acheved by usg the followg decso rule: P(x ) > (C P(x ) < (C C C ) P[ ] ) P[ ] *For a tutve proof vst my lecture otes at AMU g Notce ay smlartes wth the LR? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Bayes Rsk: a example () g Cosder a classfcato problem wth two classes defed by the followg lkelhood fuctos P(x ) P(x ) π π e 3 e (x ) x 3 lkelhood 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 0-6 -4-0 4 6 x g What s the decso rule that mmzes P[error]? Assume P[ ]P[ ]0.5, C C 0, C ad C 3 / Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

he Bayes Rsk: a example () Λ(x) e e x x 3 (x ) x 3 π π e 3 > < e (x ) + (x ) > 0 < 3 > x + 0 x 4.73,.7 < x 3 > < 0. 0.8 0.6 0.4 0. 0. 0.08 0.06 0.04 0.0 R R 0-6 -4-0 4 6 x R Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Varatos of the LR g he LR that mmzes the Bayes Rsk s called the Bayes Crtero Λ(x) P(x ) > P(x ) < (C (C C C ) P[ ] ) P[ ] g Maxmum A Posteror Crtero Sometmes we wll be terested mmzg P[error], whch s a specal case of the Bayes Crtero f we use a zero-oe cost fucto C j 0 j j Λ(x) P(x ) P(x ) > P( ) < P( ) P( P( x) > x) < Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Varatos of the LR g Maxmum Lkelhood Fally, the smplest for of the LR s obtaed for the case of equal prors P[ ]/ ad zero-oe cost fucto: 0 j j j P( ) C C P(x ) > Λ(x) P(x ) < Whe would you wat to use a ML crtero? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5

Mult-class problems g he prevous decso rules were derved for twoclass problems, but geeralze gracefully for multple classes: o mmze P[error] choose the class wth hghest P[ x] argmax P( C x) o mmze Bayes rsk choose the class wth lowest R[ x] C C argmr( j x) argm CjP( j x) C j Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

Dscrmat fuctos () g All these decso rules have the same structure At each pot x feature space choose class whch maxmzes (or mmzes) some measure g (x) hs structure ca be formalzed wth a set of dscrmat fuctos g (x),..c, ad the followg decso rule " assg x to class f g (x) > g j(x) j " We ca the express the three basc decso rules (Bayes, MAP ad ML) terms of Dscrmat Fuctos: Crtero Dscrmat Fucto Bayes g (x)-r(α x) MAP g (x)p( x) ML g (x)p(x ) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

Dscrmat fuctos () g herefore, we ca vsualze the decso rule as a etwork that computes C dscrmat fuctos ad selects the class correspodg to the largest dscrmat Class assgmet Select max Costs Dscrmat fuctos g (x) g (x) g C (x) Features x x x 3 x d Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8

Recappg g he LR s a theoretcal result that ca oly be appled f we have complete kowledge of the lkelhoods P[x ] P[x ] geerally ukow, but ca be estmated from data If the form of the lkelhood s kow (e.g., Gaussa) the problem s smplfed b/c we oly eed to estmate the parameters of the model (e.g., mea ad covarace) g hs leads to a classfer kow as QUADRAIC, whch we cover ext If the form of the lkelhood s ukow, the problem becomes much harder, ad requres a techque kow as oparametrc desty estmato g hs techque s covered lecture 3 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9

Part : Quadratc classfers Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 0

Bayes classfer for Gaussa classes () g For Normally dstrbuted classes, the dscrmat fuctos reduce to very smple expressos he (multvarate) Gaussa desty ca be defed as p(x) ( π) / / exp (x µ) (x µ) Usg the Bayes rule, the MAP dscrmat fucto ca be wrtte as P(x )P( ) g (x) P( x) P(x) / ( π) / exp (x µ ) (x µ ) P( ) P(x) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Bayes classfer for Gaussa classes () Elmatg costat terms akg logs g(x) -/ exp (x µ) (x µ) P() g (x) (x µ ) (x µ ) - log + ( ) log( P( )) hs s kow as a QUADRAIC dscrmat fucto (because t s a fucto of the square of x) I the ext few sldes we wll aalyze what happes to ths expresso uder dfferet assumptos for the covarace Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty

Case : Σ σ I () g hs stuato occurs whe the features are statstcally depedet, ad have the same varace for all classes I ths case, the quadratc dscrmat fucto becomes - ( σ I) (x µ ) - log( σ I ) + log( P( )) (x µ ) (x µ ) log( P( )) g (x) (x µ ) + σ Assumg equal prors ad droppg costat terms DIM (x µ ) (x µ ) - µ g (x) ( x ) hs s called a Eucldea-dstace or earest mea classfer From [Schalkoff, 99] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Case : Σ σ I () g hs s probably the smplest statstcal classfer that you ca buld: Assg a ukow example to the class whose ceter s the closest usg the Eucldea dstace x µ µ µ C Eucldea Dstace Eucldea Dstace Eucldea Dstace Mmum Selector class Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 5 Case : Σ σ I, example [ ] [ ] [ ] 0 0 Σ 0 0 Σ 0 0 Σ 5 µ 4 7 µ 3 µ 3 3

Case : Σ Σ (Σ o-dagoal) g All the classes have the same covarace matrx, but the matrx s ot dagoal I ths case, the quadratc dscrmat becomes g (x) (x µ ) (x µ ) - log + ( ) log( P( )) Assumg equal prors ad elmatg costat terms g (x) (x µ ) Σ - (x µ ) hs s kow as a Mahalaobs-dstace classfer x Σ µ µ µ C Mahalaobs Dstace Mahalaobs Dstace Mahalaobs Dstace Mmum Selector class Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 6

he Mahalaobs dstace g he quadratc term s called the Mahalaobs dstace, a very mportat metrc SPR he Mahalaobs metrc s a vector dstace that uses a - orm - ca be thought of as a stretchg factor o the space Note that for a detty covarace matrx ( I), the Mahalaobs dstace becomes the famlar Eucldea dstace x µ x x - µ K x - µ Κ Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 7

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 8 [ ] [ ] [ ] 0.7 0.7 Σ 0.7 0.7 Σ 0.7 0.7 Σ 5 µ 4 5 µ 3 µ 3 3 Case : Σ Σ (Σ o-dagoal), example

Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 9 [ ] [ ] [ ] 3 0.5 0.5 0.5 Σ 7 Σ Σ 5 µ 4 5 µ 3 µ 3 3 Case 3: Σ Σ j geeral case, example Zoom out

Numercal example () g Derve a lear dscrmat fucto for the twoclass 3D classfcato problem defed by /4 0 0 µ 0 0 /4 [ 0 0 0] ; µ [ ] ;Σ Σ 0 /4 0 ; p( ) p( ) g Aybody would dare to sketch the lkelhood destes ad decso boudary for ths problem? Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 30

Numercal example () g Soluto x -µ x 4 0 0 x -µ x g (x) ( x -µ ) Σ ( x -µ ) logσ + logp( ) y -µ y 0 4 0 y -µ y + logp( ) z -µ z 0 0 4 z - -µ z x - 0 4 0 0 x - 0 g(x) y - 0 0 4 0 y - 0 + log ; 3 z - 0 0 0 4 z - 0 x - g(x) y - z - 4 0 0 0 4 0 0 0 4 x - y - + log z - 3 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Numercal example (3) g Soluto (cotued) ( ) > ( + y + z ) + log - ( x ) + ( y ) + ( z ) - x > g (x) g (x) < 3 < + log 3 x + y + z > 6 log < 4.3 Classfy the test example x u [0. 0.7 0.8] > 0.+ 0.7+ 0.8.6.3 x < u Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 3

Coclusos g he Eucldea dstace classfer s Bayes-optmal* f Gaussa classes ad equal covarace matrces proportoal to the detty matrx ad equal prors g he Mahalaobs dst. classfer s Bayes-optmal f Gaussa classes ad equal covarace matrces ad equal prors *Bayes optmal meas that the classfer yelds the mmum P[error], whch s the best ANY classfer ca acheve Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 33

Part 3: Lear classfers Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 34

Lear Dscrmat Fuctos () g he objectve of ths secto s to preset methods for learg lear dscrmat fuctos of the form g ( x) w x + w 0 g g ( x) ( x) > 0 < 0 x x x w x+w 0 >0 x ( where w s the weght vector ad w 0 s the threshold or bas Smlar dscrmat fuctos were derved the prevous secto as a specal case of the quadratc classfer w x+w 0 <0 d x x ( w x I ths chapter, the dscrmat fuctos wll be derved a o-parametrc fasho, ths s, o assumptos wll be made about the uderlyg destes Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 35

Lear Dscrmat Fuctos () g For coveece, we wll focus o bary classfcato Exteso to the multcategory case ca be easly acheved by g g Usg /ot dchotomes Usg / dchotomes Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 36

Gradet descet () g Gradet descet s a geeral method for fucto mmzato From basc calculus, we kow that the mmum of a fucto J(x) s defed by the zeros of the gradet [ ] J(x) 0 x* argm J(x) x x Oly very specal cases ths mmzato fucto has a closed form soluto I some other cases, a closed form soluto may exst, but s umercally ll-posed or mpractcal (e.g., memory requremets) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 37

Gradet descet () g Gradet descet fds the mmum a teratve fasho by movg the drecto of steepest descet. Start wth a arbtrary soluto x(0). Compute the gradet x J(x(k)) 3. Move the drecto of steepest descet: x ( k + ) x ( k ) η x J( x ( k )) 4. Go to (utl covergece) J(w) J<0 w>0 Local mmum J>0 w<0 w where η s a learg rate x 0 Ital guess Global mmum - - 0 x Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 38

Perceptro learg () g Let s ow cosder the problem of solvg a bary classfcato problem wth a lear dscrmat As usual, assume we have a dataset X{x (,x (, x (N } cotag examples from the two classes For coveece, we wll absorb the tercept w 0 by augmetg the feature vector x wth a addtoal costat dmeso: w x + w 0 x [ ] w w a y 0 From [Duda, Hart ad Stork, 00] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 39

Perceptro learg () Keep md that our objectve s to fd a vector a such that g ( x) a > y < 0 0 x x o smplfy the dervato, we wll ormalze the trag set by replacg all examples from class by ther egatve y [ y] y g hs allows us to gore class labels ad look for a weght vector such that a y > 0 y From [Duda, Hart ad Stork, 00] Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 40

Perceptro learg (3) g o fd ths soluto we must frst defe a objectve fucto J(a) A good choce s what s kow as the Perceptro crtero J P ( a) ( a y) y Υ M g g where Y M s the set of examples msclassfed by a Note that J P (a) s o-egatve sce a y<0 for msclassfed samples Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Perceptro learg (4) g o fd the mmum of J P (a), we use gradet descet he gradet s defed by a ( a) ( y) Ad the gradet descet update rule becomes a J P y ( k + ) a( k) hs s kow as the perceptro batch update rule. g he weght vector may also be updated a o-le fasho, ths s, after the presetato of each dvdual example Υ M + η y Υ M y ( k ) ( ( k ) a( k) ηy a + + Perceptro rule where y ( s a example that has bee msclassfed by a(k) Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 4

Perceptro learg (5) g If classes are learly separable, the perceptro rule s guarateed to coverge to a vald soluto g However, f the two classes are ot learly separable, the perceptro rule wll ot coverge Sce o weght vector a ca correctly classfy every sample a o-separable dataset, the correctos the perceptro rule wll ever cease Oe ad-hoc soluto to ths problem s to eforce covergece by usg varable learg rates η(k) that approach zero as k approaches fte Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 43

Perceptro learg example g Cosder the followg classfcato problem class defed by feature vectors x{[0, 0], [0, ] }; class defed by feature vectors x{[, 0], [, ] }. g Apply the perceptro algorthm to buld a vector a that separates both classes. Use learg rate η ad a(0) [, -, -]. Update the vector a o a per-example bass Preset examples the order whch they were gve above. Draw a scatterplot of the data, ad the separatg le you foud wth the perceptro rule. Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 44

Mmum Squared Error soluto () g he classcal Mmum Squared Error (MSE) crtero provdes a alteratve to the perceptro rule he perceptro rule seeks a weght vector a that satsfes the equalty a y ( >0 g he perceptro rule oly cosders msclassfed samples, sce these are the oly oes that volate the above equalty Istead, the MSE crtero looks for a soluto to the equalty a y ( b (, where b ( are some pre-specfed target values (e.g., class labels) g As a result, the MSE soluto uses ALL of the samples the trag set Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 45

Mmum Squared Error soluto () g he system of equatos solved by MSE s where a s the weght vector, each row Y s a trag example, ad each row b s the correspodg class label g y y M M y ( 0 ( 0 (N 0 y y y ( ( M M (N L L L y y y ( D ( D M M (N D a a M a 0 D b b M M b For cosstecy, we wll cotue assumg that examples from class have bee replaced by ther egatve vector, although ths s ot a requremet for the MSE soluto ( ( (N Ya b Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 46

Mmum Squared Error soluto (3) g A exact soluto to Yab ca sometmes be foud If the umber of (depedet) equatos (N) s equal to the umber of ukows (D+), the exact soluto s defed by g I practce, however, Y wll be sgular so ts verse Y - does ot exst a Y b Y wll commoly have more rows (examples) tha colums (ukow), whch yelds a over-determed system, for whch a exact soluto caot be foud Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 47

Mmum Squared Error soluto (4) g he soluto ths case s to fd a weght vector that mmzes some fucto of the error betwee the model (ay) ad the desred output (b) I partcular, MSE seeks to Mmze the sum of the Squares of these Errors: N ( ) ( ( ( ) a a y b Ya -b J MSE whch, as usual, ca be foud by settg ts gradet to zero Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 48

he pseudo-verse soluto () g he gradet of the objectve fucto s N ( ( ( ( a) ( a y b ) y Y ( Ya b) 0 a JMSE wth zeros defed by Ya Notce that Y Y s ow a square matrx! Y g If Y Y s osgular, the MSE soluto becomes a Y b ( ) - Y Y Y b Y b Pseudo-verse soluto where the matrx Y (Y Y) - Y s kow as the pseudo-verse of Y (Y YI) g Note that, geeral, YY I geeral Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty 49

Least-mea-squares soluto () g he objectve fucto J MSE (a) Ya-b ca also be foud usg a gradet descet procedure hs avods the problems that arse whe Y Y s sgular I addto, t also avods the eed for workg wth large matrces g Lookg at the expresso of the gradet, the obvous update rule s ( + ) a( k) + η( k) Y ( b - Ya( k) ) a k It ca be show that f η(k)η()/k, where η() s ay postve costat, ths rule geerates a sequece of vectors that coverge to a soluto to Y (Ya-b)0 Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 50

Least-mea-squares soluto () g he storage requremets of ths algorthm ca be reduced by cosderg each sample sequetally ( ( ) y ( ( k ) a( k) η( k) b - y ( a( k) a + + LMS rule hs s kow as the Wdrow-Hoff, least-mea-squares (LMS) or delta rule [Mtchell, 997] ( + ) a( k) + η( k) Y ( b - Ya( k) ) a k Itroducto to Patter Aalyss Rcardo Guterrez-Osua exas A&M Uversty From [Duda, Hart ad Stork, 00] 5