Lecture 10: Linear Discriminant Functions Perceptron. Aykut Erdem March 2016 Hacettepe University

Size: px

Start display at page:

Download "Lecture 10: Linear Discriminant Functions Perceptron. Aykut Erdem March 2016 Hacettepe University"

Beverly Jessie Reed
5 years ago
Views:

1 Lecture 10: Linear Discriminant Functions Perceptron Aykut Erdem March 2016 Hacettepe University

2 Administrative Project proposal due today! a half page description problem to be investigated, why it is interesting, what data you will use, etc. 2

3 Last time Logistic Regression Assumes%the%following%func$onal%form%for%P(Y X):% functional form for P(Y X): Logistic Logis$c%func$on%applied%to%a%linear% function linear function func$on%of%the%data% of data slide by Aarti Singh & Barnabás Póczos Logistic Logis&c( function func&on( (or(sigmoid):( logit%(z)% Features(can(be(discrete(or(con&nuous!( continuous! z% 8% 3

Regression(parameters( Representation equivalence But

4 Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters( Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? slide by Aarti Singh & Barnabás Póczos 4

5 Naïve Bayes vs. Logistic Regression Set(of(Gaussian(( Naïve(Bayes(parameters( (feature(variance(( independent(of(class(label)( Set(of(Logis&c(( Regression(parameters( slide by Aarti Singh & Barnabás Póczos Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? LR makes no assumption about P(X Y) in learning!!! Loss function!!! Optimize different functions! Obtain different solutions 5

6 Naïve Bayes vs. Logistic Regression Consider Y Boolean, Xi continuous X=<X1 Xd> Number of parameters: NB: 4d+1 %%π, (µ 1,y, µ 2,y,, µ d,y ),%(σ 2 1,y, σ2 2,y,, σ2 d,y )%% y=0,1 LR: d+1 %%w 0,%w 1,%,%w d% slide by Aarti Singh & Barnabás Póczos Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 6

7 Generative vs. Discriminative [Ng & Jordan, NIPS 2001] Given infinite data (asymptotically), If conditional independence assumption holds, Discriminative and generative NB perform similar. slide by Aarti Singh & Barnabás Póczos If conditional independence assumption does NOT holds, Discriminative outperforms generative NB. 7

8 Generative vs. Discriminative [Ng & Jordan, NIPS 2001] Given finite data (n data points, d features), slide by Aarti Singh & Barnabás Póczos Naïve Bayes (generative) requires n = O(log d) to converge to its asymptotic error, whereas Logistic regression (discriminative) requires n = O(d). Why? Independent class conditional densities parameter estimates not coupled each parameter is learnt independently, not jointly, from training data. 8

9 Naïve Bayes vs. Logistic Regression Verdict slide by Aarti Singh & Barnabás Póczos Both learn a linear decision boundary. Naïve Bayes makes more restrictive assumptions and has higher asymptotic error, BUT converges faster to its less accurate asymptotic error. 9

10 Experimental Comparison (Ng-Jordan 01) UCI Machine Learning Repository 15 datasets, 8 continuous features, 7 discrete features 0.5 pima (continuous) 0.5 adult (continuous) boston (predict if > median price, continuous) 0.45 error m 0.4 optdigits (0 s and 1 s, continuous) error m 0.4 optdigits (2 s and 3 s, continuous) error m 0.5 ionosphere (continuous) More in the paper slide by Aarti Singh & Barnabás Póczos error m error m Naïve Bayes error m Logistic Regression 10

11 What you should know slide by Aarti Singh & Barnabás Póczos LR is a linear classifier decision rule is a hyperplane LR optimized by maximizing conditional likelihood no closed-form solution concave! global optimum with gradient ascent Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit 11

12 Today Linear Discriminant Functions - Two Classes - Multiple Classes - Fisher s Linear Discriminant Perceptron 12

13 Linear Discriminant Functions 13

14 Linear Discriminant Function Linear discriminant function for a vector x y(x) =w T x + w 0 where w is called weight vector, and w0 is a bias. The classification function is C(x) =sign(w T x + w 0 ) slide by Ce Liu where step function sign( ) is defined as sign(a) = ( +1, a > 0 1, a < 0 14

15 Properties of Linear Discriminant Functions y>0 y =0 y<0 x 2 R 2 R 1 w x? w 0 kwk x y(x) kwk x 1 The decision surface, shown in red, is perpendicular to w, and its displacement from the origin is controlled by the bias parameter w 0. The signed orthogonal distance of a general point x from the decision surface is given by y(x)/ w y(x) gives a signed measure of the perpendicular distance r of the point x from the decision surface slide by Ce Liu y(x) = 0 for x on the decision surface. The normal distance from the origin to the decision surface is w T x kwk = w 0 kwk So w 0 determines the location of the decision surface. 15 he decision surfac

16 Properties of Linear y>0 y =0 y<0 x 2 R 2 R 1 Discriminant Functions Let where x is the projection x on the decision surface. Then x = x? + r w kwk he decision surface. w T x = w T x? + r wt w kwk w T x + w 0 = w T x? + w 0 + rkwk y(x) =rkwk r = y(x) kwk w x? w 0 kwk x y(x) kwk x 1 ew =(w 0, w) d ex =(1, x) Simpler notion: define and so that e slide by Ce Liu y(x) =ew T ex e e 16

17 Multiple Classes: Simple Extension One-versus-the-rest classifier: classify Ck and samples not in Ck. One-versus-one classifier: classify every pair of classes.? C 1 C 3 R 1 R 1 R 2 C 1? R 3 C 3 slide by Ce Liu C 1 R 3 not C 1 C 2 not C 2 C 2 R 2 C 2 17

18 Multiple Classes: K-Class Discriminant A single K-class discriminant comprising K linear functions y k (x) =w T k x + w k0 Decision function C(x) =k, if y k (x) > y j (x) 8 j 6= k The decision boundary between C C class C k and C j is given by y k (x) = y j (x) C C (w k w j ) T x +(w k0 w j0 )=0 slide by Ce Liu 18

19 Property of the Decision Regions Theorem The decision regions of the K-class discriminant y k (x) =w T k x + w k0 are singly connected and convex. Proof. Suppose two points x A and x B both lie inside decision region R k. Any point ˆx on the line between x A and x B can be expressed as ˆx = x A +(1 )x B So y k (ˆx) = y k (x A )+(1 )y k (x B ) > y j (x A )+(1 )y j (x B )(8 j 6= k) = y j (ˆx) (8 j 6= k) slide by Ce Liu Therefore, the regions R k is single connected and convex. 19

20 Property of the Decision Regions Theorem The decision regions of the K-class discriminant y k (x) =w T k x + w k0 are singly connected and convex. Proof. Suppose two points x A and x B both lie inside decision region R k. Any point ˆx on the line between x A and x B can be expressed as ˆx = x A +(1 )x B So y k (ˆx) = y k (x A )+(1 )y k (x B ) > y j (x A )+(1 )y j (x B )(8 j 6= k) = y j (ˆx) (8 j 6= k) slide by Ce Liu Therefore, the regions R k is single connected and convex. 20

21 Property of the Decision Regions Theorem The decision regions of the K-class discriminant y k (x) =w T k x + w k0 are singly connected and convex. R i R k R j If two points x A and x B both lie inside the same decision region R k, then any point x that lies on the line connecting these two points must also lie in R k, and hence the decision region must be singly connected and convex. x B slide by Ce Liu x A ˆx 21

22 Fisher s Linear Discriminant Fisher s Linear Discriminant Pursue the the optimal optimal linear projection linear projection on which theon twowhich classes can the betwo maximally classes can separated be maximally separated y = y w= T wx T x The mean vectors vectors of the two of the classes two classes m X 1 = 1 X N 1 n2c 1 x n, m 2 = 1 N 2 X n2c 2 x n A way to view a linear classification model is in terms of dimensionality reduction slide by Ce Liu Difference of means Difference of means Fisher linear discriminant Fisher s Linear Discriminant 22

23 What s a Good Projection? After projection, the two classes are separated as much as possible. Measured by the distance between projected center 2 w T (m 1 m 2 ) = w T (m 1 m 2 )(m 1 m 2 ) T w = w T S B w where S B = (m 1 m 2 )(m 1 m 2 ) T is called between-class covariance matrix. slide by Ce Liu After projection, the variances of the two classes are as small as possible. Measured by the within-class covariance where w T S W w S W = X (x n m 1 )(x n m 1 ) T X+ X (x n m 2 )(x n m 2 ) T n2c 1 n2c 2 23

24 Fisher s Linear Discriminant Fisher criterion: maximize the ratio w.r.t. w J(w) = Recall the quotient rule: for Setting J(w) = 0, we obtain Between-class variance Within-class variance = wt S B w w T S W w r f (x) = g(x) h(x) f 0 (x) = g0 (x)h(x) g(x)h 0 (x) h 2 (x) (w T S B w)s W w =(w T S W w)s B w (w T S B w)s W w =(w T S W w)(m 2 m 1 ) (m 2 m 1 ) T w slide by Ce Liu Terms w T S B w, w T S W w and (m 2 m 1 ) T w are scalars, and we only care about directions. So the scalars are dropped. Therefore w / S 1 W (m 2 m 1 ) 24

25 From Fisher s Linear Discriminant to Classifiers Fisher s Linear Discriminant is not a classifier; it only decides on an optimal projection to convert high-dimensional classification problem to 1D. slide by Ce Liu A bias (threshold) is needed to form a linear classifier (multiple thresholds lead to nonlinear classifiers). The final classifier has the form y(x) =sign(w T x + w 0 ) where the nonlinear activation function sign( ) is a step function sign(a) = How to decide the bias w 0? ( +1, a > 0 1, a < 0 25

26 1D Linear Classifier Theorem Given a set of labeled samples {x n, t n } N n=1 where x n 2 R and t n 2 { 1, 1}. Without loss of generality, assume that x n s are sorted, namely x n 6 x n+1, 8 n < N. Define an accumulative function F( ): NX F(x) = t n (x n 6 x) n=1 The optimal classifier (suppose the sign is correct) y(x) =sign(x w 0 ) the minimizes the training error NX n=1 y(x n )t n slide by Ce Liu is decided by w 0 = arg max F(x) 26

27 Perceptron 27

28 early theories of the brain

29 Basic Idea Biology and Learning - Good behavior should be rewarded, bad behavior punished (or not rewarded). This improves system fitness. - Killing a sabertooth tiger should be rewarded... - Correlated events should be combined. - Pavlov s salivating dog. Training mechanisms - Behavioral modification of individuals (learning) Successful behavior is rewarded (e.g. food). - Hard-coded behavior in the genes (instinct) The wrongly coded animal does not reproduce. 29

30 Neurons Soma (CPU) Cell body - combines signals Dendrite (input bus) Combines the inputs from several other nerve cells Synapse (interface) Interface and parameter store between neurons Axon (cable) May be up to 1m long and will transport the activation signal to neurons at different locations 30

31 Neurons x 1 x 2 x 3... x n w 1 w n synaptic weights output f(x) = X i w i x i = hw, xi 31

32 Perceptron Weighted linear x 1 x 2 x 3... x n combination Nonlinear decision function w 1 synaptic weights w n Linear offset (bias) output f(x) = (hw, xi + b) Linear separating hyperplanes (spam/ham, novel/typical, click/no click) Learning Estimating the parameters w and b 32

33 Perceptron Ham Spam 33

34 Perceptron Rosenblatt Widom

35 The Perceptron initialize w = 0 and b =0 repeat if y i [hw, x i i + b] apple 0 then w w + y i x i and b b + y i end if until all classified correctly Nothing happens if classified correctly Weight vector is linear combination w = X i2i y i x i Classifier is linear combination of inner products X f(x) = y i hx i,xi + b i2i 35

36 Convergence Theorem If there exists some (w,b ) y i [hx i,w i + b ] with unit length and for all i then the perceptron converges to a linear separator after a number of steps bounded by b 2 +1 r where kx i kappler Dimensionality independent Order independent (i.e. also worst case) Scales with difficulty of problem 36

37 Proof Starting Point We start from w 1 =0and b 1 =0. Step 1: Bound on the increase of alignment Denote by w i the value of w at step i (analogously b i ). Alignment: h(w i,b i ), (w,b )i For error in observation (x i,y i ) we get h(w j+1,b j+1 ) (w,b )i = h[(w j,b j )+y i (x i, 1)], (w,b )i = h(w j,b j ), (w,b )i + y i h(x i, 1) (w,b )i h(w j,b j ), (w,b )i + j. Alignment increases with number of errors. 37

38 Proof Step 2: Cauchy-Schwartz for the Dot Product h(w j+1,b j+1 ) (w,b )iapplek(w j+1,b j+1 )kk(w,b )k Step 3: Upper Bound on k(w j,b j )k If we make a mistake we have k(w j+1,b j+1 )k 2 = k(w j,b j )+y i (x i, 1)k 2 = p 1+(b ) 2 k(w j+1,b j+1 )k = k(w j,b j )k 2 +2y i h(x i, 1), (w j,b j )i + k(x i, 1)k 2 applek(w j,b j )k 2 + k(x i, 1)k 2 apple j(r 2 + 1). Step 4: Combination of first three steps j apple p 1+(b ) 2 k(w j+1,b j+1 )kapple p j(r 2 + 1)((b ) 2 + 1) Solving for j proves the theorem. 38

39 Consequences Only need to store errors. This gives a compression bound for perceptron. Stochastic gradient descent on hinge loss l(x i,y i,w,b) = max (0, 1 y i [hw, x i i + b]) Fails with noisy data do NOT train your avatar with perceptrons Black & White 39

40 Hardness: margin vs. size hard easy 40

53 Concepts & version space Realizable concepts - Some function exists that can separate data and is included in the concept space - For perceptron - data is linearly separable Unrealizable concept - Data not separable - We don t have a suitable function class (often hard to distinguish) 53

54 Minimum error separation XOR - not linearly separable Nonlinear separation is trivial Caveat (Minsky & Papert) Finding the minimum error linear separator is NP hard (this killed Neural Networks in the 70s). 54

55 Nonlinear Features Regression We got nonlinear functions by preprocessing Perceptron - Map data into feature space - Solve problem in this space x! (x) - Query replace hx, x 0 i by h (x), (x 0 )i for code Feature Perceptron - Solution in span of (x i ) 55

56 Quadratic Features Separating surfaces are Circles, hyperbolae, parabolae 56

57 Constructing Features (very naive OCR system) Construct features manually. E.g. for OCR we c 57

58 Delivered-To: Received: by with SMTP id s51cs361171web; Tue, 3 Jan :17: (PST) Received: by with SMTP id s17mr eba ; Tue, 03 Jan :17: (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [ ]) by mx.google.com with ESMTPS id n4si eef (version=tlsv1/sslv3 cipher=other); Tue, 03 Jan :17: (PST) Received-SPF: neutral (google.com: is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) clientip= ; Authentication-Results: mx.google.com; spf=neutral (google.com: is neither permitted nor denied by best guess record for domain of alex +caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googl .com Received: by eaal1 with SMTP id l1so eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan :17: (PST) Received: by with SMTP id ie18mr bkc ; Tue, 03 Jan :17: (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by with SMTP id k6cs206093bki; Tue, 3 Jan :17: (PST) Received: by with SMTP id bh19mr vdb ; Tue, 03 Jan :17: (PST) Return-Path: <althoff.tim@googl .com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [ ]) by mx.google.com with ESMTPS id dt4si vdb (version=tlsv1/sslv3 cipher=other); Tue, 03 Jan :17: (PST) Received-SPF: pass (google.com: domain of althoff.tim@googl .com designates as permitted sender) client-ip= ; Received: by vcbf13 with SMTP id f13so vcb.10 for <alex@smola.org>; Tue, 03 Jan :17: (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googl .com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=wcbdz5sxac25dph02xcrydodts993hkwsavxpgrfh0w=; b=wk2b2+exwnf/gvtkw6uuvkup4xeoknljq3usytm0rark8dsfjyoqsiheap9yssxp6o 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uibbdna4ludxj6ufe16spldckptd8oz3gr7+o= MIME-Version: 1.0 Received: by with SMTP id e17mr vcp ; Tue, 03 Jan :17: (PST) Sender: althoff.tim@googl .com Received: by with HTTP; Tue, 3 Jan :17: (PST) Date: Tue, 3 Jan :17: X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making From: Tim Althoff <althoff@eecs.berkeley.edu> To: alex@smola.org Content-Type: multipart/alternative; boundary=f46d043c7af4b07e8d04b5a7113a --f46d043c7af4b07e8d04b5a7113a Content-Type: text/plain; charset=iso Feature Engineering for Spam Filtering bag of words pairs of words date & time recipient path IP number sender encoding links... secret sauce... 58

59 More feature engineering Two Interlocking Spirals Transform the data into a radial and angular part (x 1,x 2 )=(r sin,rcos ) Handwritten Japanese Character Recognition - Break down the images into strokes and recognize it - Lookup based on stroke order Medical Diagnosis - Physician s comments - Blood status / ECG / height / weight / temperature... - Medical knowledge Preprocessing - Zero mean, unit variance to fix scale issue (e.g. weight vs. income) - Probability integral transform (inverse CDF) as alternative 59

60 The Perceptron on features nction initialize w, b =0 repeat { } {± Pick (x i,y i ) from data if y i (w (x i )+b) apple 0 then w 0 = w + y i (x i ) b 0 = b + y i until y i (w (x i )+b) > 0 for all i d Nothing happens if classified correctly X Weight vector is linear combination Classifier is linear combination of w = X i2i y i (x i ) inner products f(x) = X i2i y i h (x i ), (x)i + b 60

61 Problems Problems - Need domain expert (e.g. Chinese OCR) - Often expensive to compute - Difficult to transfer engineering knowledge Shotgun Solution - Compute many features - Hope that this contains good ones - Do this efficiently 61

62 Solving XOR (x 1,x 2 ) (x 1,x 2,x 1 x 2 ) XOR not linearly separable Mapping into 3 dimensions makes it easily solvable 62

Introduction to Machine Learning

Introduction to Machine Learning 4. Perceptron and Kernels Alex Smola Carnegie Mellon University http://alex.smola.org/teaching/cmu2013-10-701 10-701 Outline Perceptron Hebbian learning & biology Algorithm