Linear Classifiers 1

Similar documents
HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551

Machine Learning (CS 567) Lecture 5

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Linear Models for Classification

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning 2017

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning Lecture 7

Lecture 5: Logistic Regression. Neural Networks

Ch 4. Linear Models for Classification

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Discrimination Functions

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Linear discriminant functions

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Models in Machine Learning

ECE521 week 3: 23/26 January 2017

Linear and logistic regression

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Neural Network Training

1 Machine Learning Concepts (16 points)

Linear & nonlinear classifiers

Regression with Numerical Optimization. Logistic

Midterm exam CS 189/289, Fall 2015

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CSC242: Intro to AI. Lecture 21

Machine Learning Lecture 5

Logistic regression and linear classifiers COMS 4771

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning Neural Networks

Midterm: CS 6375 Spring 2015 Solutions

The Perceptron algorithm

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Logistic Regression. William Cohen

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

SGN (4 cr) Chapter 5

Generative v. Discriminative classifiers Intuition

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Artificial Neural Networks

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Introduction to Machine Learning

Linear Regression (continued)

Introduction to Machine Learning

Overfitting, Bias / Variance Analysis

Stochastic Gradient Descent

Machine Learning

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Machine Learning and Data Mining. Linear classification. Kalev Kask

Online Learning With Kernel

Generative v. Discriminative classifiers Intuition

Logistic Regression. Machine Learning Fall 2018

Midterm, Fall 2003

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multiclass Classification-1

The perceptron learning algorithm is one of the first procedures proposed for learning in neural network models and is mostly credited to Rosenblatt.

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Multilayer Perceptron

Introduction to Machine Learning

Least Squares Regression

Support Vector Machines

Input layer. Weight matrix [ ] Output layer

Machine Learning, Fall 2012 Homework 2

Gaussian and Linear Discriminant Analysis; Multiclass Classification

11 More Regression; Newton s Method; ROC Curves

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Intelligent Systems Discriminative Learning, Neural Networks

Single layer NN. Neuron Model

Reading Group on Deep Learning Session 1

Linear Classifiers and the Perceptron

Mining Classification Knowledge

Linear Models for Classification

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Statistical Machine Learning Hilary Term 2018

Machine Learning - Waseda University Logistic Regression

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Discriminative Models

Machine Learning

Least Squares Regression

Logistic Regression. COMP 527 Danushka Bollegala

Learning from Data Logistic Regression

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

EE 511 Online Learning Perceptron

Classification. Sandro Cumani. Politecnico di Torino

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Transcription:

Linear Classifiers 1

Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion Logistic Regression Model Max Likelihood Estimation (MLE) of P( y x ) Gradient descent (MSE; MLE) Linear Discriminant Analysis Max Likelihood Estimation (MLE) of P( y, x ) Direct Computation 2

Diagnosing Butterfly-itis Hmmm perhaps Butterfly-it is?? 3

Classifier: Decision Boundaries Classifier: partitions input space X into decision regions #wings + + - - + + + + + + - - #antennae Linear threshold unit has a linear decision boundary Defn: Set of points that can be separated by linear decision boundary is linearly separable" 4

Linear Separators Draw separating line - #wings + + + + + + +? + - + 2 - - - - #antennae If #antennae 2, then butterfly-itis So? is Not butterfly-itis. 5

Can be angled #wings + + + + + + + + +? - - #antennae 2.3 #w + 7.5 #a + 1.2 = 0 If 2.3 #Wings + 7.5 #antennae + 1.2 then butterfly-itis > 0 6

Linear Separators, in General Given data (many features) F1 Temp. F2 Press Fn Color diseasex? Class 35 95 3 Pale No 22 22 80 80-2 Clear Yes Yes : : : : 10 50 1.9 Pale No find weights {w1, w2,, wn, w0} such that w1 F1 + + wn Fn + w0 > 0 means Class = Yes 7

Linear Separator 35 F1 w1 95 F2 w2 : : 3 Fn wn Just view F0 = 0, so w0 Σi w F i i Yes No 8

Linear Separator 46.8 35 F1 2.3 95 F2 7.5 : : 3 Fn i 21 Yes Yes No Given {wi}, and values for instance, compute response Learning i Performance Σi w F Given labeled data, find correct {wi} Linear Threshold Unit Perceptron 9

Linear Separators Facts GOOD NEWS: If data is linearly separated, Then FAST ALGORITHM finds correct {wi}! But - + + - 10

Linear Separators Facts GOOD NEWS: If data is linearly separated, Then FAST ALGORITHM finds correct {wi}! But - + + - Some data sets are NOT linearly separatable! ay t S! d e n tu 11

Geometric View ( [1.0, 1.0]; 1 ) ( [0.5; 3.0]; 1 ) ( [2.0; 2.0]; 0 ) Consider 3 training examples: Want classifier that looks like... 12

Linear Equation is Hyperplane Equation w x = i wi xi is plane y(x) = 1 0 if w x > 0 otherwise 13

Linear Threshold Unit: Perceptron Squashing function: sgn: ℜ {-1, +1 } sgn(r) = 1 0 if r > 0 otherwise ( heaviside ) Actually w x > b but... Create extra input x0 fixed at 1 Corresponding w0 corresponds to -b 14

Learning Perceptrons Can represent Linearly-Separated surface... any hyper-plane between two half-spaces Remarkable learning algorithm: [Rosenblatt 1960] If function f can be represented by perceptron, then learning alg guaranteed to quickly converge to f! enormous popularity, early / mid 60's But some simple fns cannot be represented (Boolean XOR) [Minsky/Papert 1969] Killed the field temporarily! 15

Perceptron Learning Hypothesis space is... Fixed Size: O(2n^2) distinct perceptrons over n boolean features Deterministic Continuous Parameters Learning algorithm: Various: Local search, Direct computation,... Eager Online / Batch 16

Task Input: labeled data Transformed to Output: w ℜr+1 Goal: Want w s.t. i sgn( w [1, x(i) ]) = y(i)... minimize mistakes wrt data... 17

Error Function Given data { [x(i), y(i) ] }i=1..m, optimize... 1. Classification error Perceptron Training; Matrix Inversion 2. Mean-squared error Matrix Inversion; Gradient Descent m 1 err Class w = I [ y i o w x i ] m i =1 m 1 1 err MSE w = [ y i o w x i ]2 m i =1 2 3. (Log) Conditional Probability m 1 LCL w = log P w y i x i m i =1 MSE Gradient Descent; LCL Gradient Descent 4. (Log) Joint Probability Direct Computation m 1 LL w = log P w y i, x i m i =1 18

#1: Optimal Classification Error For each labeled instance [x, y] Err = y ow(x) y = f(x) is target value ow(x) = sgn(w x) is perceptron output Idea: Move weights in appropriate direction, to push Err 0 If Err > 0 (error on POSITIVE example) need to increase sgn(w x) need to increase w x ple) m a x e Input j contributes wj xj to w x VE I T A G E N n o r if xj > 0, increasing wj will increase w x o err ( 0 < r If Erwj will increase wx j x if xj < 0, decreasing w j wj 19 w w + x

#1a: Mistake Bound Perceptron Alg Initialize w = 0 Do until bored Predict + iff w x > 0 else " Mistake on positive: w w + x Mistake on negative: w w x Weights Instance Action [0 0 0] #1 +x [1 0 0] #2 -x [0-1 0] #3 +x [1 0 1] #1 OK [1 0 1] #2 -x [0-1 1] #3 +x [1 0 2] #1 OK [1 0 2] #2 -x [0-1 2] #3 OK [1-1 2] #1 +x [1-1 2] #2 OK [1-1 2] #3 OK [1-1 2] #1 OK 20

Mistake Bound Theorem Theorem: [Rosenblatt 1960] If data is consistent w/some linear threshold w, then number of mistakes is (1/ )2, where measures wiggle room available: If x = 1, then is max, over all consistent planes, of minimum distance of example to that plane w is to separator, as w x = 0 at boundary So w x is projection of x onto plane, PERPENDICULAR to boundary line ie, is distance from x to that line (once normalized) 21

Proof of Convergence x wrong wrt w iff w x < 0 Let w* be unit vector rep'ning target plane = minx { w* x } Let w be hypothesis plane Consider: On each mistake, add x to w w = Σ{x x w<0} x 22

Proof (con't) If w is mistake = minx { w* x } w = Σ{x x w<0} x 23

#1b: Perceptron Training Rule For each labeled instance [x, y] Err( [x, y] ) = y ow(x) { -1, 0, +1 } If Err( [x, y] ) = 0 Correct! Do nothing! w = 0 Err( [x, y] ) x If Err( [x, y] ) = +1 Mistake on positive! Increment by +x w = +x Err( [x, y] ) x If Err( [x, y] ) = -1 Mistake on negative! Increment by -x w = -x Err( [x, y] ) x In all cases... w(i) = Err( [x(i), y(i) ] ) x(i) = [y(i) ow(x(i))] x(i) Batch Mode: do ALL updates at once! wj = i wj(i) = i x(i)j ( y(i) ow(x(i)) ) wj += η w j 24

0. ww 0. Fix New 1. For each row i, compute a. w = 0 b. E(i) = y(i) ow(x(i)) c. w += E(i) x(i) feature j x(i) x(i)j w wj [ wj += E(i) x(i)j ] 2. Increment w += η w E(i) 25

Correctness Rule is intuitive: Climbs in correct direction... Thrm: Converges to correct answer, if... training data is linearly separable sufficiently small η Proof: Weight space has EXACTLY 1 minimum! (no non-global minima) with enough examples, finds correct function! Explains early popularity If η too large, may overshoot If η too small, takes too long So often η = η(k) which decays with # of iterations, k 26

#1c: Matrix Version? 27

Issues 1. Why restrict to only yi { 1, +1 }? If from discrete set yi { 0, 1,, m } : General (non-binary) classification If ARBITRARY yi ℜ: Regression 2. What if NO w works?...x is singular; overconstrained... Could try to minimize residual i I[ y(i) w x(i) ] y X w 1 = i y(i) w x(i) y X w 2 = i ( y(i) w x(i) )2 NP-Hard! Easy! 28

L2 error vs 0/1-Loss 0/1 Loss function not smooth, differentiable MSE error is smooth, differentiable and is overbound... 29

Gradient Descent for Perceptron? Why not Gradient Descent for THRESHOLDed perceptron? Needs gradient (derivative), not Gradient Descent is General approach. Requires + continuously parameterized hypothesis + error must be differentiatable wrt parameters But... can be slow (many iterations) 30 may only find LOCAL opt

#1. LMS version of Classifier View as Regression Find best linear mapping w from X to Y w* = argmin ErrLMS(X, Y)(w) Err LMS (X, Y) Threshold: (w) = i ( y(i) w x(i) )2 if w.x > 0.5, return 1; else 0 See Chapter 3 31

Use Linear Regression for Classification? 1. Use regression to find weights w 2. Classify new instance x as sgn( w x ) But regression minimizes sum of squared errors on target function which gives strong influence to outliers Great separation Bad separation 32

#3: Logistic Regression Want to compute Pw(y=1 x)... based on parameters w But w x has range [-, ] probability must be in range [0; 1] Need squashing function [-, ] [0, 1] 1 σ x = x 1 e 33

Alternative Derivation P x y P y P y x = P x y P y P x y P y 1 = 1 exp a P x y P y a= ln P x y P y 34

Logistic Regression (con t) Assume 2 classes: 1 P w y x =σ w x = 1 e x w x w 1 e P w y x = 1 = x w x w 1 e 1 e Log Odds: P w y x log = x w P w y x 35

How to learn parameters w? depends on goal? A: Minimize MSE? i ( y(i) ow(x(i)) )2 B: Maximize likelihood? i log Pw(y(i) x(i)) 36

MSError Gradient for Sigmoid Unit Error: j ( y(j) ow(x(j)) )2 = j E(j) σ z = For single training instance Input: x(j) = [x(j)1,, x(j)k] Computed Output: o(j) = σ( i x(j)i wi ) = σ( z(j) ) where 1 1 e z z(j) = i x(j)i wi using current { wi } Correct output: y(j) Stochastic Error Gradient (Ignore (j) superscript) 37

Derivative of Sigmoid d d 1 σ a = da da 1 e a 1 d 1 a a = 1 e = e a 2 da a 2 1 e 1 e a a e 1 e = = = σ a [ 1 σ a ] a 2 a a 1 e 1 e 1 e 38

Updating LR Weights (MSE) Update wi += wi where 39

B: Or... Learn Conditional Probability As fitting probability distribution, better to return probability distribution ( w) that is most likely, given training data, S Bayes Rules As P(S) does not depend on w As P(w) is uniform As log is monotonic 40

ML Estimation P( S w) likelihood function L(w) = log P( S w) w* = argmaxw L(w) is maximum likelihood estimator (MLE) 41

Computing the Likelihood As training examples [x(i), y(i)] are iid drawn independently from same (unknown) prob Pw(x, y) log P( S w) = log Πi Pw(x(i), y(i) ) = i log Pw(x(i), y(i) ) = i log Pw(y(i) x(i)) + i log Pw( x(i)) Here Pw(x(i)) = 1/n not dependent on w, over empirical sample S w* = argmaxw i log Pw(y(i) x(i)) 42

Fit Logistic Regression by Gradient Ascent Want w* = argmaxw J(w) J(w) For = i r(y(i), x(i), w) y {0, 1} r(y, x, w) = log Pw( y x ) = y log( Pw( y=1 x ))+(1 y) log(1 Pw( y=1 x )) J w = So climb along wj i i r y, x,w w i j 43

Gradient Descent r y, x,w = [ y log p1 1 y log 1 p1 wj wj y p1 p1 y p1 1 y p1 = 1 = p1 w j 1 p1 w j p1 1 p1 w j p1 P w y=1 x = = σ x w wj wj wj x w = p1 1 p1 x ij =σ x w [ 1 σ x w ] wj i i J w r y, x,w = = wj i wj i = y i P w y=1 x x ij i y i p1 p1 1 p1 x ij p 1 1 p 1 44

Gradient Ascent for Logistic Regression (MLE) w p1 i y(i) wj p1 i η w 45

Comments on MLE Algorithm This is BATCH; obvious online alg (stochastic gradient ascent) Can use second-order (Newton-Raphson) alg for faster convergence weighted least squares computation; aka Iteratively-Reweighted Least Squares (IRLS) 46

Use Logistic Regression for Classification Return YES iff P y=1 x P y=1 x P y=0 x P y=1 x ln P y=0 x 1 / 1 exp w x ln exp w x / 1 exp w x 1 ln =w x 0 exp w x P y=0 x 1 0 0 stic i g Lo r g e R n o i s es ns r a le! TU L a 47

Logistic Regression for K > 2 Classes No t e: e k-1 di ffe ach of d rent w ime nsio i weight s, n x 48

Learning LR Weights 1 1 exp w x exp w x 1 exp w x if y=1 0 1 if y=0 w(i)j = (o(i) y(i)) o(i) (1 o(i) ) w(i)j = (y(i) p(1 x(i) )) x(i)j 49

(LMS) (MaxProb) feature j x(i) x(i)j w wj 0. ww 0. Fix New 1. For each row i, compute a. w = 0 (i) (i) (i) (o yw(i)(x) (i)o(i))(i)))(1 o(i) ) b. E(i) = y(y op(1 x c. w += E(i) x(i) [ wj += E(i) x(i)j ] 2. Increment w += η w E(i) 50

Logistic Regression Algs for LTUs Learns Conditional Probability Distribution P( y x ) Local Search: Begin with initial weight vector; iteratively modify to maximize objective function log likelihood of the data (ie, seek w s.t. probability distribution Pw( y x ) is most likely given data.) Eager: Classifier constructed from training examples, which can then be discarded. Online or batch 51

#4: Linear Discriminant Analysis LDA learns joint distribution P( y, x ) As P( y, x ) P( y x ); optimizing P( y, x ) optimizing P( y x ) generative model P( y,x ) model of how data is generated Eg, factor P( y, x ) = P( y ) P( x y ) P( y ) generates value for y; then P( x y ) generates value for x given this y Belief net: Y X 52

Linear Discriminant Analysis, con't P( y, x ) = P( y ) P( x y ) P( y ) is a simple discrete distribution Eg: P( y = 0 ) = 0.31; P( y = 1 ) = 0.69 (31% negative examples; 69% positive examples) Assume P( x y ) is multivariate normal, with mean µk and covariance 53

Estimating LDA Model Linear discriminant analysis assumes form P( x,y) = µy is mean for examples belonging to class y; covariance matrix is shared by all classes! Can estimate LDA directly: mk = #training examples in class y = k Estimate of P( y = k ): pk = mk / m 1 μ k = {i : y =k } x i i m Σ = 1 x μ x i μ y T i m i i yi (Subtract each xi from corresponding μ y i 54 before taking outer product)

Example of Estimation on d : e N ot e-pe r p OT m=7 examples; m+ = 3 positive; m- = 4 negative p+ = 3/7 p- = 4/7 1! = x 0 nd 4 55

Estimation z(7) := 56

Classifying, Using LDA How to classify new instance, given estimates Class for instance x = [5, 14, 6]? 57

LDA learns an LTU Consider 2-class case with a 0/1 loss function Classify ŷ = 1 if iff 58

LDA Learns an LTU (2) (x µ1)t -1 (x µ1) (x µ0)t -1 (x µ0) = xt -1 (µ0 µ1) + (µ0 µ1)t -1 x + µ1t -1 µ1 µ0t -1 µ0 As -1 is symmetric, = 2 xt -1 (µ0 µ1)+ µ1t -1 µ1 µ0t -1 µ0 59

LDA Learns an LTU (3) So let Classify ŷ = 1 iff LTU!! w x+c>0 60

Variants of LDA Covariance matrix n features; k classes Same for all classes? Diagonal #param s + + k + - n2 LDA kn Naïve Gaussian Classifier k n2 General Gaussian Classifier - + - Name 61

Generalizations of LDA General Gaussian Classifier Allow each class k to have its own k Classifier quadratic threshold unit (not LTU) Naïve Gaussian Classifier Allow each class k to have its own k but require each k be diagonal. within each class, any pair of features xi and xj are independent Classifier is still quadratic threshold unit but with a restricted form 62

Summary of Linear Discriminant Analysis Learns Joint Probability Distr'n P( y, x ) Direct Computation. MLEstimate of P( y, x ) computed directly from data without search. But need to invert matrix, which is O(n3) Eager: Classifier constructed from training examples, which can then be discarded. Batch: Only a batch algorithm. An online LDA alg requires online alg for incrementally updating -1 [Easy if -1 is diagonal... ] 63

Two Geometric Views of LDA View 1: Mahalanobis Distance Squared Mahalanobis distance between x and µ DM2(x, µ) = (x µ)t -1 (x µ) -1 linear distortion converts standard Euclidean distance into Mahalanobis distance. LDA classifies x as 0 if DM2(x, µ0) < DM2(x, µ1) log P( x y = k ) log πk ½ DM2(x, µk) 64

View 2: Most Informative Low Dimensional Projection nt a in LDA Finds K 1 dim hyperplane (K = number of classes) Project x and { µk } to that hyperplane Classify x as nearest µk within hyperplane r's e h Fis L ar e in rim c Dis Goal: Hyperplane that maximally separates projection of x's wrt -1 project onto: vertical axis LDA's w 65

Fisher Linear Discriminant Recall any vector w projects ℜn ℜ Goal: Want w that separates classes Each w x+ far from each w x Perhaps project on m+ m? Still overlap why? 66

Fisher Linear Discriminant Problem with m+ m : Doesn t consider scatter within class Goal: Want w that separates classes Each w x+ far from each w x Positive x+'s: w x+ close to each other Negative x 's: w x close to each other scatter of +instance; instance s 2 = i y(i) (w x(i) m+)2 s 2 = i (1 y(i) ) (w x(i) m+)2 + 67

Fisher Linear Discriminant Recall any vector w projects ℜn ℜ Goal: Want w that separates classes Positive x+'s: w x+ close to each other Negative x 's: w x close to each other Each w x+ far from each w x scatter of +instance; instance s 2 = i y(i) (w x(i) m+)2 s 2 = i (1 y(i) ) (w x(i) m+)2 + 68

FLD, con't Separate means m and m+ maximize (m m+)2 Minimize each spead s+2, s 2 maximize (s+2 + s 2) Objective function: maximize J S w = m m 2 2 2 s s #1:(m m+)2 = ( wt m+ wt m )2 = wt (m+ m )(m+ m )T w = wt SB w between-class scatter SB = (m+ m )(m+ m )T 69

FLD, III J S w = m m 2 2 2 s s s+2 = i y(i) (w x(i) m+)2 = i wt y(i) (x(i) m+) (x(i) m+)t w = wt S+ w S+ = i y(i) (x(i) m+) (x(i) m+)t within-class scatter matrix for + S = i (1 y(i)) (x(i) m ) (x(i) m )T within-class scatter matrix for Sw = S+ + S so s+2 + s 2 = wt SW w 70

FLD, IV J S w = Solving m m 2 2 2 s s = wt S B w T w Sw w J S w =0 wj = w T m1 m2 2 wt S w w w = α SB-1(m+ m ) 71

m m 2 FLD, V J S w = Solving J S w =0 wj When P( x yi ) ~ N(µi; ) s 2 s 2 = wt S B w T w Sw w = w T m1 m2 2 wt S w w w = α SB-1(m+ m ) LINEAR DISCRIMINANT: w = -1(µ+ µ ) FLD is optimal classifier, if classes normally distributed Can use even if not normal: After projecting d-dim to 1, just use any classification method Analogous derivation for K > 2 classes 72

Comparing LMS, Logistic Regression, LDA Which is best: LMS, LR, LDA? Ongoing debate within machine learning community about relative merits of direct classifiers [ LMS ] conditional models P( y x ) [ LR ] generative models P( y, x ) [ LDA ] Stay tuned... 73

Issues in Debate Statistical efficiency If generative model P( y, x ) is correct, then usually gives better accuracy, particularly if training sample is small Computational efficiency Generative models typically easiest to compute (LDA computed directly, without iteration) Robustness to changing loss functions LMS must re-train the classifier when the loss function changes. no retraining for generative and conditional models Robustness to model assumptions. Generative model usually performs poorly when the assumptions are violated. Eg, LDA works poorly if P( x y ) is non-gaussian. Logistic Regression is more robust, LMS is even more robust Robustness to missing values and noise. In many applications, some of the features xij may be missing or corrupted for some of the training examples. Generative models typically provide better ways of handling this than non-generative models. 74

Other Algorithms for learning LTUs Naive Bayes [Discuss later] For K = 2 classes, produces LTU Winnow [?Discuss later?] Can handle large numbers of irrelevant" features (features whose weights should be zero) 75