HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551

Size: px

Start display at page:

Download "HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551"

Oliver Sims
5 years ago
Views:

1 HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/55

2 Outline Framework Eact Minimize Mistakes Perceptron Training Matri inversion LMS Logistic Regression Ma Likelihood Estimation MLE of P Gradient descent MSE; MLE Newton-Raphson Linear Discriminant Analsis Ma Likelihood Estimation MLE of P, Direct Computation Fisher s Linear Discriminant

3 Diagnosing Butterfl-itis 3

4 Classifier: Decision Boundaries Classifier: partitions input space X into decision regions #wings #antennae Linear threshold unit has a linear decision boundar Defn: Set of points that can be separated b linear decision boundar is linearl separable" 4

5 Linear Separators Draw separating line If #antennae, then butterfl-itis So? is Not butterfl-itis. 5

6 Can be angled.3 #w 7.5 #a. 0 If.3 #Wings 7.5 #antennae. > 0 then butterfl-itis 6

7 Linear Separators, in General Given data man features Temp. F Press F Color F n diseasex? Class Pale 3 No 80 Clear - Yes : : : : 0 50 Pale.9 No find weights {w, w,, w n, w 0 } such that means 7

8 Linear Separator ' Σ!! "# ' $ % & # 8

9 Linear Separator ' *,- Σ!! "# ' Performance Given {w i }, and values for instance, compute response Learning Given labeled data, find correct {w i } Linear Threshold Unit Perceptron 9

10 Geometric View Consider 3 training eamples: [.0,.0]; [0.5; 3.0]; [.0;.0]; 0 Want classifier that looks like... 0

11 Linear Equation is Hperplane Equation w i w i i is plane if w > 0 0 otherwise

12 Linear Threshold Unit: Perceptron Squashing function: sgn: R {-, } sgnr Heaviside if r > 0 0 otherwise Actuall w > b but... Create etra input 0 fied at Corresponding w 0 corresponds to -b

13 Learning Perceptrons Can represent Linearl-Separated surface... an hper-plane between two half-spaces Remarkable learning algorithm: [Rosenblatt 960] If function f can be represented b perceptron, then learning alg guaranteed to quickl converge to f! enormous popularit, earl / mid 60's But some simple fns cannot be represented killed the field temporaril! 3

14 Perceptron Learning Hpothesis space is... Fied Size: O n^ distinct perceptrons over n boolean features Deterministic Continuous Parameters Learning algorithm: Various: Local search, Direct computation,... Eager Online / Batch 4

15 Task Input: labeled data Transformed to Output: w R r Goal: Want w s.t. i sgn w [, i ] i... minimize mistakes wrt data... 5

16 Error Function Given data { [ i, i ] } i..m, optimize.... Classification error Perceptron Training; Matri Inversion err Class w m I[ m i o i i w ]. Mean-squared error LMS Matri Inversion; Gradient Descent err MSE w m m i [ i o w i ] 3. Log Conditional Probabilit LR MSE Gradient Descent; LCL Gradient Descent LCL w m m i log P w i i 4. Log Joint Probabilit LDA; FDA Direct Computation LL w m m i log P w, i i 6

17 #: Optimal Classification Error For each labeled instance [, ] Err o w f is target value o w sgnw is perceptron output Idea: Move weights in appropriate direction, to push Err 0 If Err > 0 error on POSITIVE eample need to increase sgnw need to increase w Input j contributes w j j to w if j > 0, increasing w j will increase w if j < 0, decreasing w j will increase w w j w j j If Err < 0 error on NEGATIVE eample w j w j j 7

18 Local Search via Gradient Descent 8

19 #a: Mistake Bound Perceptron Alg Initialize w 0 Do until bored Predict iff w > 0 else " Mistake on : w w Mistake on w w Weights [0 0 0] [ 0 0] [0-0] [ 0 ] [ 0 ] Instance # # #3 # # Action - OK - [0 - ] #3 [ 0 ] # OK [ 0 ] # - [0 - ] #3 OK [ - ] # [ - ] # OK [ - ] #3 OK [ - ] # OK 9

20 Mistake Bound Theorem See SVM Theorem: [Rosenblatt 960] If data is consistent w/some linear threshold w, then number of mistakes is /, where measures wiggle room available: If, then is ma, over all consistent planes, of minimum distance of eample to that plane w is to separator, as w 0 at boundar So w is projection of onto plane, PERPENDICULAR to boundar line ie, is distance from to that line once normalized 0

21 Proof of Convergence wrong wrt w iff w < 0 Let w * be unit vector rep'ning target plane min { w * } Let w be hpothesis plane Consider: On each mistake, add to w w Σ { w < 0 }

22 Proof con't If w is mistake min { w * } w Σ { w < 0 }

23 #b: Perceptron Training Rule For each labeled instance [, ] Err [, ] o w { -, 0, } If Err [, ] 0 Correct! Do nothing! w 0 Err [, ] If Err [, ] Mistake on positive! Increment b w Err [, ] If Err [, ] - Mistake on negative! Increment b - w - Err [, ] In all cases... w i Err [ i, i ] i [ i o w i ] i Batch Mode: do ALL updates at once! w j i w j i i i j i o w i w j η w j 3

24 feature j 0. Fi New ww w : 0. For each row i, compute a. E i : i o w i b. w E i i [ w j E i i j ]. Increment w η w i i j E i w w j 4

25 Correctness Rule is intuitive: Climbs in correct direction... Thrm: Converges to correct answer, if... training data is linearl separable sufficientl small η Proof: Weight space has EXACTLY minimum! no non-global minima with enough eamples, finds correct function! Eplains earl popularit If η too large, ma overshoot If η too small, takes too long So often η ηk which decas with # of iterations, k 5

26 #c: Matri Version? Task: Given { i, i } i i {, } is label Find { w i } s.t. Linear Equalities X w Solution: w X - 6

27 Issues. Wh restrict to onl i {, }? If from discrete set i { 0,,, m } : General non-binar classification If ARBITRARY i R: Regression. What if NO w works?...x is singular; overconstrained... Could tr to minimize residual i [ i w i ] X w i i w i X w i i w i NP-Hard! Eas! 7

28 L error vs 0/-Loss 0/ Loss function not smooth, differentiable MSE error is smooth, differentiable and is overbound... 8

29 Gradient Descent for Perceptron? Wh not Gradient Descent for THRESHOLDed perceptron? Needs gradient derivative, not Gradient Descent is General approach. Requires continuousl parameterized hpothesis error must be differentiatable wrt parameters But... can be slow man iterations ma onl find LOCAL opt 9

30 Linear Separators Facts GOOD NEWS: If data is linearl separated, Then FAST ALGORITHM finds correct {w i }! But 30

31 Linear Separators Facts GOOD NEWS: If data is linearl separated, Then FAST ALGORITHM finds correct {w i }! But Some data sets are NOT linearl separatable! Sta tuned! 3

32 #. LMS version of Classifier View as Regression Find best linear mapping w from X to Y w * argmin Err X, Y LMS w Err X, Y LMS w i i w i Threshold: if w T > 0.5, return ; else 0 See Chapter 3 3

33 General Idea Use a discriminant function δ k for each class k Eg, δ k P Gk X Classification rule: Return k argma j δ j If each δ j is linear, decision boundaries are piecewise hperplanes 33

34 Linear Classification using Linear Regression D Input space: X X, X K-3 classes: Training sample N5: Y Y, Y, Y3 X [,0,0] [0,,0] [0,0,] , Y Regression output: T T Y ˆ, X X X Y T T T β β β3 Classification rule: Gˆ arg ma ˆ Yk k Yˆ Yˆ Yˆ β β β 3 34

35 Use Linear Regression for Classification? But regression minimizes sum of squared errors on target function which gives strong influence to outliers Great separation Bad separation 35

36 #3: Logistic Regression Want to compute P w... based on parameters w But w has range [-, ] probabilit must be in range [0; ] Need squashing function [-, ] [0, ] σ e 36

37 37 Alternative Derivation ep a P P P P P P P ln P P P P a

38 Sigmoid Unit 38

39 39 Logistic Regression con t Assume classes: w w e w P σ w w w w e e e P Log Odds: w P P w w log Linear

40 How to learn parameters w? depends on goal? A: Minimize MSE? i i o w i B: Maimize likelihood? i log P w i i 40

41 MSError Gradient for Sigmoid Unit Error: j j o w j j E j For single training instance Input: j [ j,, j k] Computed Output: o j σ i j i w i σ z j where z j i j i w i using current { w i } Correct output: j Stochastic Error Gradient Ignore j superscript σ z e z 4

42 4 Derivative of Sigmoid ] [ a a e e e e e e e e da d e e da d a da d a a a a a a a a a a σ σ σ

43 Updating LR Weights MSE σ z e z Note: As alread computed o j σ z j to get answer, trivial to compute σ z j σ z j σ z j Update w i w i where 43

44 LMS feature j 0. New Fi ww w 0. For each row i, compute a. E i o i i o i w o i i o i b. w E i i [ w j E i i j ]. Increment w η w i i j E i w w j 44

45 B: Or... Learn Conditional Probabilit As fitting probabilit distribution, better to return probabilit distribution w that is most likel, given training data, S Baes Rules As PS does not depend on w As Pw is uniform As log is monotonic 45

46 ML Estimation P S w likelihood function Lw log P S w w * argma w Lw is maimum likelihood estimator MLE 46

47 Computing the Likelihood As training eamples [ i, i ] are iid drawn independentl from same unknown prob P w, log P S w log Π i P w i, i i log P w i, i i log P w i i i log P w i Here P w i /n not dependent on w, over empirical sample S w * argma w i log P w i i 47

48 Fit Logistic Regression b Gradient Ascent Want w * argma w Jw Jw i r i, i, w For {0, } r,, w log P w log P w log P w So climb along i i J r w w j i j 48

49 49 Gradient Descent j j j j j w p p p p w p p w p p p p w w r log log [ ] [ i j j j j w j p p w w w w w w w P w p σ σ σ,, i j i w i i j i i i j i i j P p p p p p w w r w w J

50 Gradient Ascent for Logistic Regression MLE w p i i p i w j η w 50

51 Comments on MLE Algorithm This is BATCH; obvious online alg stochastic gradient ascent Can use second-order Newton-Raphson alg for faster convergence weighted least squares computation; aka Iterativel-Reweighted Least Squares IRLS 5

52 Use Logistic Regression for Classification Return YES iff P P P 0 P ln P 0 / ep w ln ep w / ep w ln w > 0 ep w > P 0 > > 0 > 0 Logistic Regression learns a LTU! 5

53 Logistic Regression for K > Classes Note: k- different w i weights, each of dimension 53

54 Learning LR Weights ep w ep w ep w if if 0 0 w i j o i i o i o i w i j i p i i j 54

55 MaProb LMS feature j 0. New Fi ww w 0. For each row i, compute a. E i o i i op i w o i i o i b. w E i i [ w j E i i j ]. Increment w η w i i j E i w w j 55

56 56 Logistic Regression Computation p non-linear equations Solve b Newton-Raphson method: 0 ep ep N i i T T i l β β β β β β β β β β ] [Jacobian - old old old new l l N i i T i i T i N i i T i i T i N i i i i i N i i i X G X G X G l ep log ep log 0 logpr logpr } {log Pr β β β β β

57 Newton-Raphson Method A gen l technique for solving f0 even if non-linear Talor series: f n f n n n f n n n [ f n f n ] / f n When n near root, f n 0 n : n f f n n Iteration 57

58 58 Newton-Raphson in Multi-dimensions To solve the equations: Talor series: N-R: 0,,, 0,,, 0,,, N N N N f f f N j f f f N k k k j j j,...,,,,,,,,,,, n N n n N n N n n n N n n N N N N N N n N n n n N n n f f f f f f f f f f f f Jacobian matri

59 59 Newton-Raphson : Eample Solve 0 sin, 0 cos, 3 f f 3 sin cos 3 cos sin n n n n n n n n n n n n n n

60 Maimum Likelihood Parameter Estimation Find the unknown parameters mean & standard deviation of a Gaussian pdf, µ p ; µ, σ ep πσ σ given N independent samples, {,., N } Estimate the parameters that maimize the N likelihood function L i µ µ, σ ep ˆ, µ ˆ σ arg ma L µ, σ µ, σ i πσ σ 60

61 Logistic Regression Algs for LTUs Learns Conditional Probabilit Distribution P Local Search: Begin with initial weight vector; iterativel modif to maimize objective function log likelihood of the data ie, seek w s.t. probabilit distribution P w is most likel given data. Eager: Classifier constructed from training eamples, which can then be discarded. Online or batch 6

62 Masking of Some Class Linear regression of the indicator matri can lead to masking D input space and three classes Masking Y ˆ 3 β 3 Y ˆ β Y ˆ β Viewing direction LDA can avoid this masking 6

63 #4: Linear Discriminant Analsis LDA learns joint distribution P, As P, P ; optimizing P, optimizing P generative model P, model of how data is generated Eg, factor P, P P P generates value for ; then P generates value for given this Belief net: Y X 63

64 Linear Discriminant Analsis, con't P, P P P is a simple discrete distribution Eg: P 0 0.3; P % negative eamples; 69% positive eamples Assume P is multivariate normal, with mean µ k and covariance 64

65 Estimating LDA Model Linear discriminant analsis assumes form P, µ is mean for eamples belonging to class ; covariance matri is shared b all classes! Can estimate LDA directl: m k #training eamples in class k Estimate of P k : p k m k / m ˆ µ k m { i: k} i i Σ ˆ i i i i i m ˆ µ ˆ µ Subtract each i from corresponding before taking outer product µˆ i T m k 65

66 Eample of Estimation m7 eamples; m 3 positive; m - 4 negative p 3/7 p - 4/7 Note: do NOT pre-pend 0! 4 T T T T 66 T

67 Estimation T T T T T T z 7 : 67

68 Classifing, Using LDA How to classif new instance, given estimates T T Class for instance [5, 4, 6] T? T T T T T T T T T T 68

69 LDA learns an LTU Consider -class case with a 0/ loss function Classif if iff 69

70 LDA Learns an LTU µ T - µ µ 0 T - µ 0 T - µ 0 µ µ 0 µ T - µ T - µ µ 0 T - µ 0 As - is smmetric, T - µ 0 µ µ T - µ µ 0 T - µ 0 70

71 LDA Learns an LTU 3 So let Classif iff w c > 0 LTU!! 7

72 LDA: Eample LDA was able to avoid masking here 7

73 View LDA wrt Mahalanobis Distance Squared Mahalanobis distance between and µ D M, µ µ T - µ - linear distortion converts standard Euclidean distance into Mahalanobis distance. LDA classifies as 0 if D M, µ 0 < D M, µ log P k log π k ½ D M, µ k 73

74 Generalizations of LDA General Gaussian Classifier: QDA Allow each class k to have its own k Classifier quadratic threshold unit not LTU Naïve Gaussian Classifier Allow each class k to have its own k but require each k be diagonal. within each class, an pair of features i and j are independent Classifier is still quadratic threshold unit but with a restricted form Most discriminating Low Dimensional Projection Fisher s Linear Discriminant 74

75 QDA and Masking Better than Linear Regression in terms of handling masking: Usuall computationall more epensive than LDA 75

76 Variants of LDA Covariance matri n features; k classes Name Same for all classes? Diagonal #param s k LDA n Naïve Gaussian Classifier k n General Gaussian Classifier k n 76

77 Versions of?l?q?n? DA LDA Quadratic Naïve SuperSimple 77

78 Summar of Linear Discriminant Analsis Learns Joint Probabilit Distr'n P, Direct Computation. MLEstimate of P, computed directl from data without search. But need to invert matri, which is On 3 Eager: Classifier constructed from training eamples, which can then be discarded. Batch: Onl a batch algorithm. An online LDA alg requires online alg for incrementall updating - [Eas if - is diagonal... ] 78

79 Fisher's Linear Discriminant LDA Finds K dim hperplane K number of classes Project and { µ k } to that hperplane Classif as nearest µ k within hperplane Better: Find hperplane that maimall separates projection of 's wrt - Fisher s Linear Discriminant 79

80 Fisher Linear Discriminant Recall an vector w projects R n R Goal: Want w that separates classes Each w far from each w µ µ - Perhaps project onto m m? Still overlap wh? 80

81 Fisher Linear Discriminant µ µ - Problem with m m : Does not consider scatter within class Goal: Want w that separates classes Each w far from each w Positive 's: w close to each other Negative 's: w close to each other scatter of instance; instance s i i w i m s i i w i m - 8

82 Fisher Linear Discriminant Recall an vector w projects R n R Goal: Want w that separates classes Positive 's: w close to each other Negative 's: w close to each other Each w far from each w µ µ - scatter of instance; instance s i i w i m s i i w i m 8

83 FLD, con't Separate means m and m maimize m m Minimize each spread s, s minimize s s Objective function: maimize µ µ J S w s s #:µ µ w T m w T m w T m m m m T w w T S B w between-class scatter S B m m m m T 83

84 FLD, III µ µ J S w s s s i i w i m i w T i i m i m T w w T S w S i i i m i m T within-class scatter matri for S i i i m i m T within-class scatter matri for S w S S so s s w T S W w 84

85 85 FLD, IV w * is eigenvector of S B - S w w T B T S S S s s J µ µ, S B S w L λ λ Minimizing J S w w * argmin w w T S B w s.t. w T S w w Lagrange: Lw, λ w T S B w λ - w T S w w λ λ 0, S B S w L

86 FLD, V J µ µ S s s T T S S B w Optimal w * is eigenvector of S B - S w When P i ~ Nµ i ; LINEAR DISCRIMINANT: w - µ µ FLD is optimal classifier, if classes normall distributed Can use even if not Gaussian: After projecting d-dim to, just use an classification method 86

87 Fisher s LD vs LDA Fisher s LD LDA when Prior probabilities are same Each class conditional densit is multivariate Gaussian with common covariance matri Fisher s LD does not assume Gaussian densities can be used to reduce dimensions even when multiple classes scenario 87

88 Comparing LMS, Logistic Regression, LDA, FLD Which is best: LMS, LR, LDA, FLD? Ongoing debate within machine learning communit about relative merits of direct classifiers [ LMS ] conditional models P [ LR ] generative models P, [ LDA, FLD ] Sta tuned... 88

89 Issues in Debate Statistical efficienc If generative model P, is correct, then usuall gives better accurac, particularl if training sample is small Computational efficienc Generative models tpicall easiest to compute LDA/FLD computed directl, without iteration Robustness to changing loss functions LMS must re-train the classifier when the loss function changes. no retraining for generative and conditional models Robustness to model assumptions. Generative model usuall performs poorl when the assumptions are violated. Eg, LDA works poorl if P is non-gaussian. Logistic Regression is more robust, LMS is even more robust Robustness to missing values and noise. In man applications, some of the features ij ma be missing or corrupted for some of the training eamples. Generative models tpicall provide better was of handling this than non-generative models. 89

90 Other Algorithms for learning LTUs Naive Baes [Discuss later] For K classes, produces LTU Winnow [?Discuss later?] Can handle large numbers of irrelevant" features features whose weights should be zero 90

91 Learning Theor Assume data is trul linearl separable... Sample Compleit: Given ε, δ 0,, want LTU has error rate on new eamples less than ε with probabilit > δ. Suffices to learn from be consistent with labeled training eamples. Computational Compleit: There is a polnomial time algorithm for finding a consistent LTU reduction from linear programming Agnostic case different 9

Linear Classifiers 1

Linear Classifiers 1 Outline Framework Exact Minimize Mistakes (Perceptron Training) Matrix inversion Logistic Regression Model Max Likelihood Estimation (MLE) of P( y x ) Gradient descent (MSE; MLE) Linear