x 1 Outline IAML: Logistic Regression Decision Boundaries Example Data

Outline IAML: Lgistic Regressin Charles Suttn and Victr Lavrenk Schl f Infrmatics Semester Lgistic functin Lgistic regressin Learning lgistic regressin Optimizatin The pwer f nn-linear basis functins Least-squares classificatin Generative and discriminative mdels Relatinships t Generative Mdels Multiclass classificatin Reading: W & F 4.6 (but pairwise classificatin, perceptrn learning rule, Winnw are nt required) / 24 2 / 24 Decisin Bundaries Eample Data 2 In this class we will discuss linear classifiers. Fr each class, there is a regin f feature space in which the classifier The decisin bundary is the bundary f this regin. (i.e., Where the tw classes are tied ) In linear classifiers the decisin bundary is a line. 3 / 24 4 / 24

Linear Classifiers A Gemetric View 2 2 In a tw-class linear classifier, we learn a functin F (, w) = w + w that represents hw aligned the instance is with y =. w are parameters f the classifier that we learn frm data. T d predictin f an input : w (y = ) if F(, w) > 5 / 24 6 / 24 Eplanatin f Gemetric View Tw Class Discriminatin The decisin bundary in the previus case is { w + w = } w is a nrmal vectr t this surface (Remember hw lines can be written in terms f their nrmal vectr.) Ntice that in mre than 2 dimensins, this bundary will be a hyperplane. Fr nw cnsider a tw class case: y {, }. Frm nw n we ll write = (,, 2,... d ) and w = (w, w,... d ). We will want a linear, prbabilistic mdel. We culd try P(y = ) = w. But this is stupid. Instead what we will d is P(y = ) = f (w ) f must be between and. It will squash the real line int [, ] Furthermre the fact that prbabilities sum t ne means P(y = ) = f (w ) 7 / 24 8 / 24

The The lgistic lgistic functin functin We need a functin that returns prbabilities (i.e. stays between We need aand functin ). that returns prbabilities (i.e. stays between and ). The lgistic functin prvides this The lgistic functin prvides this f (z) = σ(z) /( ep( z)). f (z) =σ(z) /( + ep( z)). As z ges frm t, s ges frm t, As z ges frm t, s f ges frm t, a squashing squashing functin functin It It has hasa a sigmid shape (i.e. (i.e. S-like shape).9.8.7.6.5.4.3.2. Linear weights Linear weights + lgistic squashing functin == lgistic regressin. We mdel the class prbabilities as p(y = ) = σ( D w j j ) = σ(w T ) σ(z) =.5 when z =. Hence the decisin bundary is given by w T + w =. j= Decisin bundary is a M hyperplane fr a M dimensinal prblem. 6 4 2 2 4 6 6 / 24 9 / 24 / 24 Lgistic regressin Learning Lgistic Regressin Fr this slide write w = (w, w 2,... w d ) (i.e., eclude the bias w ) The bias parameter w shifts the psitin f the hyperplane, but des nt alter the angle The directin f the vectr w affects the angle f the hyperplane. The hyperplane is perpendicular t w The magnitude f the vectr w effects hw certain the classificatins are Fr small w mst f the prbabilities within a regin f the decisin bundary will be near t.5. Fr large w prbabilities in the same regin will be clse t r. Want t set the parameters w using training data. As befre: Write ut the mdel and hence the likelihd Find the derivatives f the lg likelihd w.r.t the parameters. Adjust the parameters t maimize the lg likelihd. / 24 2 / 24

Assume data is independent and identically distributed. Call the data set D = {(, y ), ( 2, y 2 ),... ( n, y n )} The likelihd is p(d w) = = n p(y = y i i, w) n p(y = i, w) y i ( p(y = i, w)) y i Hence the lg likelihd L(w) = lg p(d w) is given by It turns ut that the likelihd has a unique ptimum (given sufficient training eamples). It is cnve. Hw t maimize? Take gradient L = n (y i σ(w T i )) ij (Aside: smething similar hlds fr linear regressin E = n (w T φ( i ) y i ) ij L(w) = n y i lg σ(w i ) + ( y i ) lg( σ(w i )) where E is squared errr.) Unfrtunately, yu cannt maimize L(w) eplicitly as fr linear regressin. Yu need t use a numerical methd (see net lecture). 3 / 24 4 / 24 Gemetric Intuitin f Gradient Gemetric Intuitin f Gradient One training pint, y =. Let s say there s nly ne training pint D = {(, y )}. Then L = (y σ(w )) j Als assume y =. (It will be symmetric fr y =.) Nte that (y σ(w )) is always psitive because σ(z) < fr all z. There are three cases: If is classified as right answer with high cnfidence, e.g., σ(w ) =.99 If is classified wrng, e.g., (σ(w ) =.2) If is classified crrectly, but just barely, e.g., σ(w ) =.6. L = (y σ(w )) j Remember: gradient is directin f steepest increase. We want t maimize, s let s nudge the parameters in the directin L If σ(w ) is crrect, e.g.,.99 Then (y σ(w )) is nearly, s we dn t change w j. If σ(w ) is wrng, e.g.,.2 This means w is negative. It shuld be psitive. The gradient has the same sign as j If we nudge w j, then w j will tend t increase if j > r decrease if j <. Either way w ges up! If σ(w ) is just barely crrect, e.g.,.6 Same thing happens as if we were wrng, just mre slwly. 6 / 24 5 / 24

Fitting this int the general structure fr learning algrithms: Define the task: classificatin, discriminative Decide n the mdel structure: lgistic regressin mdel Decide n the scre functin: lg likelihd Decide n ptimizatin/search methd t ptimize the scre functin: numerical ptimizatin rutine. Nte we have several chices here (stchastic gradient descent, cnjugate gradient, BFGS). XOR and Linear Separability XOR and Linear Separability A prblem is linearly separable if we can find weights s that A w T prblem is linearly separable if we can find weights s that + w > fr all psitive cases (where y = ), and w T + w fr all negative cases (where y = ) w T + w > fr all psitive cases (where y = ), and w T + w fr all negative cases (where y = ) XOR, a failure fr the perceptrn XOR, a failure fr the perceptrn XOR can be slved by a perceptrn using a nnlinear XORtransfrmatin can be slved φ() byfathe perceptrn input; can yu using find ne? a nnlinear transfrmatin φ() f the input; can yu find ne? / 24 7 / 24 8 / 24 The pwer f nn-linear basis functins Generative and Discriminative Mdels 2 Using tw Gaussian basis functins φ () and φ 2 () φ 2.5.5 φ Figure credit: Chris Bishp, PRML As fr linear regressin, we can transfrm the input space if we want φ() 9 / 24 Ntice that we have dne smething very different here than with naive Bayes. Naive Bayes: Mdelled hw a class generated the feature vectr p( y). Then culd classify using p(y ) p( y)p(y). This called is a generative apprach. Lgistic regressin: Mdel p(y ) directly. This is a discriminative apprach. Discriminative advantage: Why spent effrt mdelling p()? Seems a waste, we re always given it as input. Generative advantage: Can be gd with missing data (remember hw naive Bayes handles missing data). Als gd fr detecting utliers. Or, smetimes yu really d want t generate the input. 2 / 24

Generative Classifiers can be Linear T Multiclass classificatin Tw scenaris where naive Bayes gives yu a linear classifier.. Gaussian data with equal cvariance. If p( y = ) N(µ, Σ) and p( y = ) N(µ 2, Σ) then p(y = ) = σ( w T + w ) fr sme (w, w) that depends n µ, µ 2, Σ and the class prirs 2. Binary data. Let each cmpnent j be a Bernulli variable i.e. j {, }. Then a Naïve Bayes classifier has the frm p(y = ) = σ( w T + w ) Create a different weight vectr w k fr each class Then use the sftma functin p(y = k ) = ep(w T k ) C j= ep(wt j ) Nte that p(y = k ) and C j= p(y = j ) = This is the natural generalizatin f lgistic regressin t mre than 2 classes. 3. Eercise fr keeners: prve these tw results 2 / 24 22 / 24 Least-squares classificatin Summary Lgistic regressin is mre cmplicated algrithmically than linear regressin Why nt just use linear regressin with / targets? 4 2 2 4 6 4 2 2 4 6 The lgistic functin, lgistic regressin Hyperplane decisin bundary The perceptrn, linear separability We still need t knw hw t cmpute the maimum f the lg likelihd. Cming sn! 8 4 2 2 4 6 8 8 4 2 2 4 6 8 Green: lgistic regressin; magenta, least-squares regressin Figure credit: Chris Bishp, PRML 23 / 24 24 / 24