INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08
Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton usng a lnear classfer. Lnk to probablstc classfers and SVM INF 5860 3
Relevant addtonal vdeo lnks: https://www.youtube.com/playlst?lst=pl3f W7Lu35JvHM8lY-zLfQRF3EO8sYv Lecture and 3 Remark: they do not cover regreson. INF 5860 4
From last week: Introducton to logstc regresson Let us show how a regresson problem can be transformed nto a bnary -class classfcaton problem usng a nonlnear loss functon. Then generalze to multple classes net week. INF 5860 5
From last week: What f we ftted t to a functon f that s close to ether 0 or? Hypothess h s now a non-lnear functon of Classfcaton: y=0 or Threshold h : f h >0.5 : set y=, otherwse set y=0 Desrable to have h Cod= Herrng=0 INF 5860 length 6
Logstc regresson model Want 0 h bnary problem Let gz s called the sgmod functon INF 5860 7
Decsons for logstc regresson Decde y= f h > 0.5, and y=0 otherwse h X T g g z z e h X e T gz>0.5 f z>0 w+b>0 gz<0 f z<0 w+b<0 Here the compact notaton means the vector of parameters [w,b] INF 5860 8
Loss functon for logstc regreeson We have two classes, and 0. Let us use a probablstc model Let the parameters be =[w,.w nk,b] f we have nk features. Py=, = h Py=0,= - h Ths can be wrtten more compactly as py, = h y - h -y INF 5860 9
Loss functon for logstc regreeson The lkelhood of the parameter values s It s easer to mamze the log-lkelhood We wll use gradent descent to mamze ths, takng a step n the postve drecton snce we are mamzng, not mnmzng INF 5860 0
Computng the gradent of the lkelhood functon Here, we used the fact the g z=gz-gz INF 5860
Gradent descent of J=-L INF 5860,,: : Repeat usng gradent descent that mnmze J fnd : To fnd,: log,: log X y X h m X h y X h y m J m m Ths algorthm looks smlar to lnear regresson, but now T e h
Overfttng and regularzaton For any classfer, t s a rsk of overfttng to the tranng data. Overfttng: Hgh accuracy on tranng data Lower accuracy on valdaton data. Ths rsk s hgher the more parameters the classfer can use. INF 5860 3
Eample: polynomal regresson If a lnear model s not suffcent, we can etend to allow hgherorder terms or cross-terms between the varables by changng our hypothess h INF 5860 4 0 3 3 0... h h
The danger of overfttng A hgher-order model can easly overft the tranng data For the hgher order terms: The hgher the value of the coeffcents, the more the curve can fluctuate Ths s not vald for the frst two coeffcents Restrctng only the value of hgher-order terms s dffcult n general e.g. for neural nets But we can restrct the magntude of the coeffcents ecept 0. INF 5860 5
Overfttng for classfcaton Overfttng must be avoded for classfaton also ths s partly why we start wth smple lnear models INF 5860 6
Regularzaton - ntuton 0 3 4 0 3 4 Suppose we add a penalty to restrct 3 and 4 m J,: 00 3 00 4 h X y m To mnmze, 3 and 4 must be small INF 5860 7
Regularzed cost functon Smplfy the hypothess by havng small values for,. n n m J h X,: y m s the regularzaton parameter Ths s L-regularzaton, later we wll see Dropout, ma norm Remark: we do not regularze the offset b also called 0 INF 5860 8
What f s very large? Wll we get overft or underft? INF 5860 9
Gradent descent wth regularzaton: lnear regresson INF 5860 0 : Repeat NO penalty on Note : usng gradent descent that mnmze J fnd : To fnd 0,,:,,:,,: 0 0 X y X h m m m X y X h m X y X h m m m m
Regularzed logstc regresson: gradent descent INF 5860,...,,:,0,: : Repeat 0 0 X m m T e X h n m X y X h m X y X h m
Introducng classfyng CIFAR mages CIFAR-0 mages: 333 pels Stack one mage nto a vector of length 333=307 Classfcaton wll be to fnd a mappng fw,,b from mage space to a set of C classes. For CIFAR: pel pel 307 weght for pelfor class W weght for pelfor class0 weght for pel 307 for class weght for pel 307 for class0 b b b0 INF 5860
Small eample classes 40 Graylevel mage 6 Score for class 0.5 Score for class.0 36. 0. 40 36 6 0. 0.5 W 0.5.0. 0. 0. 0.5 40.0 36. 4.5 0.3 6 0.3 43..0 0.3 W: 4 One weght w, for pel for class b. 0.3 INF 5860 3
If color mage, append the r,g,b bands nto one long vector. Note: no spatal nformaton concernng pel neghbors s used here. Convolutonal nets use spatal nformaton. All mages are standarzed to the same sze! For CIFAR-0 t s 33. If a classfer s traned on CIFAR and we have a new mage to classfy, resze to 33. INF 5860 4
W for multclass mage classfcaton W s a Cn+-matr C classes, n pels n the mage plus for b We tran one lnear model pr. class, so each class has a dfferent Wc,:-vector If Wc,: s a vector of length n+ pel pel 307 b W b C weght for pelfor class weght for pelfor class C... weght for pel 307 for class weght for pel 307 for class C Let the score for class s c be fw,=wc,: b s ncluded n W and INF 5860 5
From to C classes: alternatve One vs. all classfcaton: Tran a logstc classfer h,c for each class c to predct the probablty for y=c. Classfy new sample by pckng the class c that mamze ma, c h c INF 5860 6
From to multple classes: Softma The common generalzaton to multple clasess s the softma classfer. We want to predct the class label y ={, C} for sample X,:, y can take one of C dscrete values, so t follows a multnomal probablty dstrbuton. Ths s derved from an assumpton that the probablty/score of class y=k s T k e h p y k, C e T INF 5860 7
Softma predcton/classfcaton Assgn each sample to the class that mamze the score: T k e h p y k, C T e INF 5860 8
Cross-entropy From nformaton theory, the cross entropy between a true dstrbuton p and an estmated dstrbuton q s: H p, q p log q Softma mnmze the cross-entropy between the estmated class probabltes and the true dstrbuton the dstrbuton where all the mass s n the correct class. INF 5860 9
Softma From a tranng data set wth m samples, we formulate the loglkelhood functon that the model fts the data: l m log p y X,:, We can now fnd that mamze the lkelhood usng e.g. gradent ascent of the log-lkelhood functon. Or we can mnmze l usng gradent descent More detals on dervng softma net week Ole-Johan INF 5860 30
Cross-entropy loss functon for softma The loss functon for softma, ncludng regularzaton: Iy= s the ndcator functon that s f y= and zero otherwse. See http://ufldl.stanford.edu/wk/nde.php/softma_regresson INF 5860 3 n n C C C l n T W y p y I n J W e e y I n J W X T l T,, log the row for class,:, let, values for mage the n pel,,: 0
Softma predcton eample INF 5860 3
Gradents of the cross entropy loss, ncludng regularzaton INF 5860 33 n n C C C l n T W y p y I n J W e e y I n J W X T l T,, log the row for class,:, let, values for mage the n pel,,: 0
For those who want calculus.. Computng the dervatve of the softma functon: see all detals at https://el.thegreenplace.net/06/thesoftma-functon-and-ts-dervatve/ INF 5860 34
Lnk to Gaussan classfers In INF 4300, we used a tradtonal Gaussan classfer Ths type of models s called generatve models, where a specfc dstrbuton s assumed. INF 5860 35
FROM INF 4300:Dscrmnant functons for the Gaussan densty When fndng the class wth the hghest probablty, these functons are equvalent: Wth a multvarate Gaussan we get: If we assume all classes have equal dagonal covarance matr, the dscrmnant functon s a lnear functon of : INF 4300 36 ln ln ln t P d g μ μ ln ln P p g P p g p P p P g ln T T P μ μ μ
Gaussan classfer vs. logstc regresson These Gaussan wth dagonal covarance and the logstc regresson/softma classfer wll result n dfferent lnear decson boundares. If the Gaussan assumpton s correct, we wll epect that ths classfer has the lowest error rate. The logstc regreson mght be better f the data s not entrely Gaussan. NOTE: SOFTMAX reduces to logstc regresson f we have classes. INF 5860 37
Support Vector Machne classfers Another popular classfer s the Support Vector Machne SVM formulaton, whch also can be formulated n terms of loss functons The followng fols are for completeness, only a basc understand of the SVM as a mamum-margn classfer s epected n ths course. INF 5860 38
Hyperplanes and margns Background SVM. Have a margn of w w w. Requre that all pels are correctly classfed: w w T T w w 0 0,, Goal: fnd w and w 0 39
Support Vector Machne loss A SVM loss functon can be formulated by havng as large margn as possble. Ths s generalzed to multple classes so the SVM wants the correct class to have a score hgher than the scores for the ncorrect classes by som margn If s s the score for class, the loss functon for SVM s L ma 0, s s y Ths s called the hnge loss 40
SVM and gradent descent We can also solve the SVM usng gradent descent also, we wll not cover ths, but see http://www.robots.o.ac.uk/~az/lectures/ml/lect.pdf INF 5860 4
FROM INF 4300:Dscrmnant functons for the Gaussan densty When fndng the class wth the hghest probablty, these functons are equvalent: Wth a multvarate Gaussan we get: If we assume all classes have equal dagonal covarance matr, the dscrmnant functon s a lnear functon of : INF 4300 4 ln ln ln t P d g μ μ ln ln P p g P p g p P p P g ln T T P μ μ μ
Net week: Feed forward nets and learnng by backpropagaton Readng materal: http://cs3n.gthub.o/neural-networks-/ http://cs3n.gthub.o/neural-networks-/ http://cs3n.gthub.o/optmzaton-/ Deep learnng Chapter 6 INF 5860 43