FMA901F: Machine Learning Lecture 4: Linear Models for Classification. Cristian Sminchisescu

FMA90F: Machie Learig Lecture 4: Liear Models for Classificatio Cristia Smichisescu

Liear Classificatio Classificatio is itrisically o liear because of the traiig costraits that place o idetical iputs i the same class Differeces i the iput vector sometimes causes 0 chage i the aser Liear classificatio meas that the adaptive part is liear The adaptive part is cascaded ith a fixed o liearity It may also be preceded by a fixed o liearity he oliear basis fuctios are used fixed o liear fuctios T y x x 0, c f y x adaptive liear parameters decisio

Approach : Discrimiat Fuctio Use discrimiat fuctios directly, ad do ot compute probabilities Covert the iput vector ito oe or more real values so that a simple process threshholdig, or a majority vote ca be applied to assig the iput to the class The real values should be chose to maximize the useable iformatio about the class label preset i the real value Give discrimiat fuctios,, Classify as class, iff,

Approach : Class coditioal Probabilities Ifer coditioal class probabilities Use coditioal distributio to make optimal decisios, e.g. by miimizig some loss fuctio Example, classes, here exp

Approach 3: Class Geerative Model Compare the probability of the iput uder separate, classspecific, geerative models Model both the class coditioal desities, ad the prior class probabilities Compute posterior usig Bayes theorem class coditioal desity class prior posterior for class = Example: fit a multivariate Gaussia to the iput vectors correspodig to each class, model class prior probabilities by traiig data frequecy couts, ad see hich Gaussia makes a test data vector most probable usig Bayes theorem

Differet Types of Plots i the Course Weight space Each axis correspods to a eight A poit is a eight vector Dimesioality = #eights + extra dimesio for the loss Data space Each axis correspods to a iput value A poit is a data vector. A decisio surface is a plae. Dimesioality = dimesioality of a data vector Case space used for the geometric iterpretatio of least squares, L3 Each axis correspods to a traiig case Dimesioality = #traiig cases

class case: The decisio surface i data space for the liear discrimiat fuctio T y x x 0 is orthogoal to ay vector hich lies o the decisio surface, 0 cotrols the orietatio of the decisio surface 0 x

Represet Target Values: Biary vs. Multiclass To classes N=: typically use a sigle real valued output that has target values of for the positive class ad 0 or for the egative class For probabilistic class labels, the target ca be the probability of the positive class ad the output of the model ca be the probability the model assigs to the positive class For the multiclass N>, e use a vector of N target values cotaiig a sigle for the correct class ad zeros elsehere For probabilistic labels e ca the use a vector of class probabilities as the target vector

Discrimiat Fuctios for Multiclass Oe possibility is to use biary ay discrimiats Each fuctio separates oe class from the rest Aother possibility is to use biary ay discrimiats Each fuctio discrimiates betee to specific classes. We have discrimiat for each class pair Both methods have ambiguities

Problems ith Multi class Discrimiat Fuctios Costructed from Biary Classifiers vs. all vs. If e base our decisio o biary classifiers, e ca ecouter ambiguities

Simple Solutio Use discrimiat fuctios,,,,, ad take the max over their respose Cosider liear discrimiats The decisio boudary betee class ad is give by the hyperplae 0 I this liear case the decisio regios are covex,,,0 From the liearity of But y k x A y j x A ad yk xb y j xb Hece is covex also lies iside

Least Squares for Classificatio This is ot ecessarily the right approach i priciple, ad it does ot ork as ell as more advaced methods, but is simple It reduces classificatio to least squares regressio We already ko ho to do regressio. We ca solve for the optimal eights usig the ormal equatios L3 We set the target to be the coditioal probability of the class give iput Whe more tha to classes, e treat each class as a separate problem The justificatio for usig least squares is that it approximates the coditioal expectatio. For the biary codig scheme, this expectatio is give by the vector of posterior probabilities. Ufortuately these are approximated rather poorly e.g. values outside the rage 0,, due to the limited flexibility of the model

Least Squares Classificatio Assume each class has its o liear model: The e ca rite:, ith th colum a dim vector,,, Give,,,, ; ro of is ; s ro is The sum of squares error fuctio for classificatio is: Tr } 0 is the pseudoiverse of Closed form solutio Property: every vector i the traiig set ad the model predictio for ay value of, satisfy some liear costrait: 0, 0, for some costats,.

Problems ith usig least squares for classificatio logistic regressio least squares regressio Least squares solutios lack robustess to outliers If the right aser is ad the model says.5, it loses, so it chages the boudary to avoid beig too correct

For o Gaussia targets, least squares regressio gives poor decisio surfaces Least Squares Logistic Regressio Remember that least squares correspods to the Maximum Likelihood uder a Gaussia coditioal distributio Clearly the biary target vectors have a distributio that is far from Gaussia

Fisher s Liear Discrimiat We ca vie classificatio i terms of dimesioality reductio A simple liear discrimiat fuctio is a projectio of the dimesioal data do to dimesio Project: ; Classify: if the else Hoever projectio results i loss of iformatio. Classes ell separated i the origial iput space may strogly overlap i d We ill adjust the projectio eight vector to achieve the best separatio amog classes. But hat do e mea by best separatio?

Fisher s Vie of Class Separatio I The simplest measure of class separatio he projected oto is the separatio of the projected class meas. This suggests choosig so to miimize,,, This ca be made arbitrarily large by icreasig. We could hadle this by imposig uit orm costraits usig Lagrage multipliers. We get max, s.th. Hoever, still, if the mai directio of variace i each class is ot orthogoal to the directio betee meas, e ill ot get good separatio see ext slide

Advatage of usig Fisher s Criterio Whe projected oto the lie joiig the class meas, the classes are ot ell separated Fisher chooses a directio that makes the projected classes much tighter, eve though their projected meas are less far apart

Fisher s Vie of Class Separatio II Fisher: maximize a fuctio that gives a large separatio betee the projected class meas, hile also givig a small variace ithi each class, thereby miimizig class overlap Choose directio maximizig the ratio of betee class variace to ithi class variace This is the directio i hich the projected poits cotai the most iformatio about class membership uder Gaussia assumptios

Fisher s Liear Discrimiat We seek a liear trasformatio that is best for discrimiatio y T x The projectio oto the vector separatig the class meas seems right m m But e also at small variace ithi each class, Fisher s objective fuctio J m s m s Betee class Withi class

solutio: Optimal here m m S m x m x m x m x S m m m m S S S W C T C T W T B W T B T s s m m J Fisher s Liear Discrimiat Derivatios lx scalar scalar The above result is ko as Fischer s liear discrimiat. Strictly it is ot a discrimiat, but rather a directio of projectio that ca be used for classificatio i cojuctio ith a decisio e.g. thresholdig operatio.

Fischer s Liear Discrimiat Computatio Hoever, the objective is ivariat to rescalig. We ca chose the deomiator to be uity. We ca the miimize mi This correspods to the primal Lagragia From the KKT coditios Geeralized eigevalue problem, as ot symmetric

Fischer s Liear Discrimiat Computatio Give that is symmetric positive defiite, e ca rite / / here, / / Defiig /, e get / / We have to solve a regular eigevalue problem for a symmetric, positive defiite matrix / / We ca fid solutios ad correspodig to / Which eigevector ad eigevalue should e chose? The largest! Why? Trasformig to dual, cost. eed to maximize over

The Perceptro Model cca. 96 Liear discrimiat model Iput vector is first mapped usig a fixed o liear trasformatio, to give a feature vector, the used to costruct liear model here, 0, 0 Typically use for class, for Feature vector icludes a bias compoet

Perceptro Criteria I Perceptro s algorithm ca be motivated by error fuctio miimizatio A atural error ould be the umber of misclassified patters Hoever this does ot lead to a simple learig algorithm, because the error is a pieceise fuctio of Discotiuities heever a chage i causes the decisio boudary to move across oe of the datapoits Gradiet methods caot be immediately applied, as the gradiet is zero almost everyhere

Perceptro Criteria II Patters i class ill have Patters i class ill have Target codig Hece e ould like all patters to satisfy The perceptro associates error to correctly classified patters, hereas for a misclassified patter, it tries to miimize the quatity

Perceptro Criteria III The perceptro criterio is give by here is the set of misclassified examples By applyig stochastic gradiet descet = Sice perceptio s fuctio is ivariat to the rescalig of, e ca set As chages, so ill the set of misclassified patters

Algorithm We cycle through the traiig patters i tur For each patter e evaluate the perceptro fuctio output If the patter is correctly classified, the the eight vector remais uchaged If the patter is icorrectly classified For class e add vector to the curret estimate of the eight vector For class C e subtract vector from the curret estimate of the eight vector

Weight ad Data Space Imagie a space i hich each axis correspods to a feature value or to the eight o that feature A poit i this space is a eight vector. Feature vectors are sho i blue traslated aay from the origi to reduce clutter. Each traiig case defies a plae. O oe side of the plae the output is rog. To get all traiig cases right e eed to fid a poit o the right side of all the plaes. This feasible regio if it exists is a coe ith its tip at the origi A feature vector ith correct aser= good eights A feature vector ith correct aser=0 bad eights o the origi Slide from Hito

Perceptro s Covergece Cotributio to error fuctio from a misclassified patter is reduced Hoever, this does ot imply that cotributios from other misclassified patters ill have bee reduced The perceptro rule is ot guarateed to reduce the total error fuctio at each stage Novikoff 96 proved that the perceptro algorithm coverges after a fiite umber of iteratios, if the data set is liearly separable The eight vector is alays adjusted by a bouded amout i a directio it has a egative dot product ith, ad thus ca be bouded above by here is the umber of chages to. But it ca also be bouded belo by because if there exists a uko feasible, the every chage makes progress i this directio by a positive amout that depeds oly o the iput vector. This ca be used to sho that the umber of updates to the eight vector is bouded by, here is the maximum orm of a iput vector.

Summary: Perceptro s Covergece Perceptro s covergece theorem: if there exists a exact solutio data is liearly separable, the the perceptro algorithm is guarateed to fid a exact solutio i a fiite umber of steps The umber of steps could be very large, though Util covergece e caot distiguish betee a o separable problem, or oe that is just slo to coverge Eve for liearly separable data, there may be may solutios, depedig o the parameter iitializatio ad the order i hich datapoits are preseted

Perceptro at Work

Other Issues ith the Perceptro Does ot provide probabilistic outputs Does ot geeralize readily to more tha classes Is based o liear combiatios of fixed basis fuctios

What Perceptros Caot Lear The adaptive part of a perceptro caot eve tell if to sigle bit features have the same value! Same:, ; 0,0 Differet:,0 0; 0, 0 0, Data Space, The four feature output pairs give four iequalities that are impossible to satisfy:,, 0 0,0,0 The positive ad egative cases caot be separated by a plae Slide from Hito

The Logistic Sigmoid due to S shape This is also called a squashig fuctio because it maps the etire real axis ito a fiite iterval For classificatio, the output is a smooth fuctio of the iputs ad the eights Properties, l y 0.5 logit fuctio a y a a dy da i T x x i y 0 y e a a x i i 0 0 a

Probabilistic Geerative Models Use a class prior ad a separate geerative model of the iput vectors for each class, ad compute hich model makes a test iput vector most probable The posterior probability of class is give by: l l x x x x x x x x C p C p C p C p C p C p a here e C p C p C p C p C p C p C p a z is called the logit ad is give by the log odds Logistic sigmoid

Multiclass Model Softmax here l This is ko as the ormalized expoetial Ca be vieed as a multiclass geeralizatio of the logistic sigmoid It is also called a softmax fuctio it is a smoothed versio of `max if, the ad 0

Gaussia Class Coditioals Assume that the iput vectors for each class are from a Gaussia distributio, ad all classes have the same covariace matrix. The class coditioals are For to classes, ad, the posterior turs out to be a logistic exp / k T k C k Z p μ x μ x x l 0 0 C p C p C p T T T μ Σ μ μ Σ μ μ μ Σ x x iverse covariace matrix ormalizer Quadratic terms caceled due to commo covariace

Iterpretatio of Decisio Boudaries Quadratic terms caceled due to commo covariace The sigmoid takes a liear fuctio of as argumet The decisio boudaries correspod to surfaces alog hich the posteriors are costat, so they ill be give by liear fuctios of. Thus, decisio boudaries are liear fuctios i iput space The prior probabilities eter oly through the bias parameter, so chages i priors have the effect of makig parallel shifts of the decisio boudary more geerally of the parallel cotours of costat posterior probability l 0 0 C p C p C p T T T μ Σ μ μ Σ μ μ μ Σ x x

A picture of the to Gaussia models ad the resultig posterior for the red class, The logistic sigmoid i the right had plot is coloured usig a proportio of red toe give by ad a proportio of blue toe give by.

Class posteriors he covariace matrices are differet for differet classes The decisio surface is plaar he the covariace matrices are the same ad quadratic he they are ot

Effect of usig Basis Fuctios Ceters of Gaussia basis fuctios ad ith gree iso cotours Liear decisio boudary logistic regressio i feature space Decisio boudary iduced i iput space

Probabilistic Discrimiative Models Logistic Regressio I our discussio of geerative approaches, e sa that uder geeral assumptios, the class posterior for ca be ritte as a logistic sigmoid actig o a liear fuctio of the feature vector I logistic regressio, e use the fuctioal form of the geeralized liear model explicitly, here exp Feer adaptive parameters compared to the geerative model For dimesioal feature space Discrimiative: parameters Geerative: parameters for the meas + shared! covariace total parameters Quadratic versus liear umber of parameters! parameters for

Maximum Likelihood for Logistic Regressio ; For dataset,, ith 0,,,,, exp,,,,,, l l Cross etropy error l Similar form as the gradiet of the sum of squares regressio model

Iterative Reeighted Least Squares The Neto Raphso update Logistic model is the x desig matrix ith th ro here,0 ; The 0, It follos that Normal equatios ith o costat eightig matrix here

Logistic Regressio Chai Rule for Error Derivatives T t y a da dy y E E y y da dy y y t y y E a a, 0, l l l N N y y t y y t y t y E y t y t y t p E

Facts o IRLS, The eightig matrix is ot costat, but the Hessia is positive defiite This meas that e have to iterate to fid the solutio, but the likelihood fuctio is cocave i. We have a uique optimum The th compoet of ca be iterpreted as a effective target value obtaied by makig a local liear approximatio to the logistic sigmoid aroud the curret operatig poit The elemets of the diagoal eightig matrix ca be iterpreted as variaces We ca iterpret IRLS as the solutio to a liearized problem i the space of the variable the sigmoid argumet

Readigs Bishop Ch. 4, up to 4.3.4