Machine Learning Lecture 10

Size: px

Start display at page:

Download "Machine Learning Lecture 10"

Joleen Chandler
5 years ago
Views:

1 Today s Topic Machie Learig Lecture 10 Neural Networks Bastia Leibe RWTH Aache leibe@visio.rwth-aache.de Deep Learig 2 Course Outlie Recap: AdaBoost Adaptive Boostig Fudametals Bayes Decisio Theory Probability Desity Estimatio Classificatio Approaches Liear Discrimiats Support Vector Machies Esemble Methods & Boostig (Radom Forests) Mai idea [Freud & Schapire, 1996] Iteratively select a esemble of compoet classifiers After each iteratio, reweight misclassified traiig examples. Icrease the chace of beig selected i a sampled traiig set. Or icrease the misclassificatio cost whe traiig o the full set. Compoets h m (x): weak or base classifier Coditio: <50% traiig error over ay distributio H(x): strog or fial classifier Deep Learig Foudatios Covolutioal Neural Networks Recurret Neural Networks 3 AdaBoost: Costruct a strog classifier as a thresholded liear combiatio of the weighted weak classifiers: Ã M! X H(x) = sig m h m (x) m=1 4 Recap: AdaBoost Algorithm Recap: Miimizig Expoetial Error 1. Iitializatio: Set w (1) = 1 for = 1,,N. N 2. For m = 1,,M iteratios a) Trai a ew weak classifier h m (x) usig the curret weightig coefficiets W (m) by miimizig the weighted error fuctio J m = w (m) I(hm(x) 6= t) b) Estimate the weighted error of this classifier o X: ² m = w(m) I(h m(x) 6= t ) w(m) c) Calculate a weightig coefficiet for h m (x): m =? d) Update the weightig coefficiets: w (m+1) =? How should we do this exactly? 5 The origial algorithm used a expoetial error fuctio E = exp f t f m (x )g where f m (x) is a classifier defied as a liear combiatio of base classifiers h l (x): f m (x) = 1 mx l h l (x) 2 l=1 Goal Miimize E with respect to both the weightig coefficiets l ad the parameters of the base classifiers h l (x). 6 1

2 Recap: Miimizig Expoetial Error AdaBoost Miimizig Expoetial Error Sequetial Miimizatio (cotiuatio from last lecture) Oly miimize with respect to m ad h m (x) f m(x) = 1 mx E = exp f t f m (x )g with lh l(x) 2 l=1 ½ = exp t f m 1 (x ) 1 ¾ 2 t m h m (x ) = w (m) exp = cost. ½ 1 ¾ 2 t m h m (x ) = E = ³e m=2 X N m=2 e w (m) I(hm(x) 6= t) + e m=2 w (m) Miimize with respect to h m (x): = 0 E = ³e m(x ) X N m=2 e w (m) I(hm(x) 6= t) + e m=2 w (m) = cost. This is equivalet to miimizig J m = w (m) I(hm(x) 6= t) (our weighted error fuctio from step 2a) of the algorithm) We re o the right track. Let s cotiue = cost. 8 AdaBoost Miimizig Expoetial Miimize with respect to m : = 0 E = ³e m X N m=2 e w (m) I(hm(x) 6= t) + e m=2 Update for the coefficiets: µ 1 2 e m=2 + 1 X N 2 e m=2 w (m) I(hm(x) 6= t)! weighted error ² m := w(m) I(h m(x ) 6= t ) w(m) = 1 2 e m=2 N X = w (m) w (m) e m=2 e m=2 + e m=2 1 ² m = e m + 1 ½ ¾ 1 ²m m = l ² m AdaBoost Miimizig Expoetial Error Remaiig step: update the weights Recall that ½ E = w (m) exp 1 ¾ 2 t m h m (x ) Therefore w (m+1) Update for the weight coefficiets. This becomes w (m+1) i the ext iteratio. ½ = w (m) exp 1 ¾ 2 t m h m (x ) = ::: = w (m) exp f m I(h m (x ) 6= t )g 9 10 AdaBoost Fial Algorithm AdaBoost Aalysis 1. Iitializatio: Set w (1) = 1 for = 1,,N. N 2. For m = 1,,M iteratios a) Trai a ew weak classifier h m (x) usig the curret weightig coefficiets W (m) by miimizig the weighted error fuctio J m = w (m) I(hm(x) 6= t) b) Estimate the weighted error of this classifier o X: ² m = w(m) I(h m(x) 6= t ) w(m) c) Calculate a weightig ½coefficiet ¾ for h m (x): 1 ²m m = l ² m d) Update the weightig coefficiets: w (m+1) = w (m) exp f mi(h m(x ) 6= t )g 11 Result of this derivatio We ow kow that AdaBoost miimizes a expoetial error fuctio i a sequetial fashio. This allows us to aalyze AdaBoost s behavior i more detail. I particular, we ca see how robust it is to outlier data poits. 12 2

3 Recap: Error Fuctios Recap: Error Fuctios Ideal misclassificatio error Ideal misclassificatio error Squared error Sesitive to outliers! Pealizes too correct data poits! Not differetiable! Ideal misclassificatio error fuctio (black) This is what we wat to approximate, Ufortuately, it is ot differetiable. The gradiet is zero for misclassified poits. z = t y(x ) We caot miimize it by gradiet descet. 13 Squared error used i Least-Squares Classificatio Very popular, leads to closed-form solutios. However, sesitive to outliers due to squared pealty. Pealizes too correct data poits z = t y(x ) Geerally does ot lead to good classifiers. 14 Recap: Error Fuctios Discussio: AdaBoost Error Fuctio Robust to outliers! Ideal misclassificatio error Squared error Hige error Ideal misclassificatio error Squared error Hige error Expoetial error Not differetiable! Favors sparse solutios! z = t y(x ) z = t y(x ) Hige error used i SVMs Zero error for poits outside the margi (z > 1) sparsity Liear pealty for misclassified poits (z < 1) robustess Not differetiable aroud z = 1 Caot be optimized directly. 15 Expoetial error used i AdaBoost Cotiuous approximatio to ideal misclassificatio fuctio. Sequetial miimizatio leads to simple AdaBoost scheme. Properties? 16 Discussio: AdaBoost Error Fuctio Discussio: Other Possible Error Fuctios Sesitive to outliers! Ideal misclassificatio error Squared error Hige error Expoetial error Ideal misclassificatio error Squared error Hige error Expoetial error Cross-etropy error E = X ft l y + (1 t) l(1 y)g Expoetial error used i AdaBoost No pealty for too correct data poits, fast covergece. Disadvatage: expoetial pealty for large egative values! Less robust to outliers or misclassified data poits! z = t y(x ) 17 z = t y(x ) Cross-etropy error used i Logistic Regressio Similar to expoetial error for z>0. Oly grows liearly with large egative values of z. Make AdaBoost more robust by switchig to this error fuctio. GetleBoost 18 3

Summary: AdaBoost Today s Topic Properties Simple combiatio of multiple classifiers. Easy to implemet. Ca be used with may differet types of classifiers. Noe of them eeds to be too good o its ow.

4 Summary: AdaBoost Today s Topic Properties Simple combiatio of multiple classifiers. Easy to implemet. Ca be used with may differet types of classifiers. Noe of them eeds to be too good o its ow. I fact, they oly have to be slightly better tha chace. Commoly used i may areas. Empirically good geeralizatio capabilities. Limitatios Origial AdaBoost sesitive to misclassified traiig data poits. Because of expoetial error fuctio. Improvemet by GetleBoost Sigle-class classifier Multiclass extesios available 19 Deep Learig 20 Perceptros Defiitio Loss fuctios Regularizatio 1957 Roseblatt ivets the Perceptro Ad a cool learig algorithm: Perceptro Learig Hardware implemetatio Mark I Perceptro for pixel image aalysis Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio The embryo of a electroic computer that [...] will be able to walk, talk, see, write, reproduce itself ad be coscious of its existece Image source: Wikipedia, clipartpada.com 1957 Roseblatt ivets the Perceptro 1969 Misky & Papert They showed that (sigle-layer) Perceptros caot solve all problems. This was misuderstood by may that they were worthless Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Some otable successes with multi-layer perceptros. Backpropagatio learig algorithm Neural Networks do t work! OMG! They work like the huma brai! Oh o! Killer robots will achieve world domiatio! 23 Image source: colourbox.de, thikstock 24 Image sources: clipartpada.com, cliparts.co 4

1957 Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Some otable successes with multi-layer perceptros.

1957 Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks 1995+ Iterest shifts to other learig methods Notably Support Vector Machies Machie Learig becomes a disciplie

5 1957 Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Some otable successes with multi-layer perceptros. Backpropagatio learig algorithm But they are hard to trai, ted to overfit, ad have uituitive parameters. So, the excitemet fades agai sigh! 1957 Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Iterest shifts to other learig methods Notably Support Vector Machies Machie Learig becomes a disciplie of its ow. I ca do sciece, me! 25 Image source: clipartof.com, colourbox.de 26 Image source: clipartof.com 1957 Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Iterest shifts to other learig methods Notably Support Vector Machies Machie Learig becomes a disciplie of its ow. The geeral public ad the press still love Neural Networks. I m doig Machie Learig Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Iterest shifts to other learig methods Gradual progress Better uderstadig how to successfully trai deep etworks Availability of large datasets ad powerful GPUs Still largely uder the radar for may disciplies applyig ML So, you re usig Neural Networks? Are you usig Neural Networks? Actually... Come o. Get real! Roseblatt ivets the Perceptro 1969 Misky & Papert 1980s Resurgece of Neural Networks Iterest shifts to other learig methods Gradual progress 2012 Breakthrough results ImageNet Large Scale Visual Recogitio Challege A CovNet halves the error rate of dedicated visio approaches. Deep Learig is widely adopted. It works! Perceptros Defiitio Loss fuctios Regularizatio Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio 29 Image source: clipartpada.com, clipartof.com 30 5

6 Perceptros (Roseblatt 1957) Stadard Perceptro Extesio: Multi-Class Networks Oe output ode per class Output layer Output layer Weights Weights Iput layer Iput layer Iput Layer Had-desiged features based o commo sese Outputs Liear outputs Logistic outputs Outputs Liear outputs Logistic outputs Learig = Determiig the weights w 31 Ca be used to do multidimesioal liear regressio or multiclass classificatio. Slide adapted from Stefa Roth 32 Extesio: No-Liear Basis Fuctios Extesio: No-Liear Basis Fuctios Straightforward geeralizatio Straightforward geeralizatio Output layer Weights Feature layer Mappig (fixed) Iput layer W kd Output layer Weights Feature layer Mappig (fixed) Iput layer Outputs Liear outputs Logistic outputs Remarks Perceptros are geeralized liear discrimiats! Everythig we kow about the latter ca also be applied here. Note: feature fuctios Á(x) are kept fixed, ot leared! Perceptro Learig Perceptro Learig Very simple algorithm Let s aalyze this algorithm... Process the traiig cases i some permutatio If the output uit is correct, leave the weights aloe. If the output uit icorrectly outputs a zero, add the iput vector to the weight vector. If the output uit icorrectly outputs a oe, subtract the iput vector from the weight vector. Process the traiig cases i some permutatio If the output uit is correct, leave the weights aloe. If the output uit icorrectly outputs a zero, add the iput vector to the weight vector. If the output uit icorrectly outputs a oe, subtract the iput vector from the weight vector. This is guarateed to coverge to a correct solutio if such a solutio exists. Traslatio w ( +1) kj = w ( ) kj (y k(x ; w) t k ) Á j (x ) Slide adapted from Geoff Hito 35 Slide adapted from Geoff Hito 36 6

7 Perceptro Learig Let s aalyze this algorithm... Process the traiig cases i some permutatio If the output uit is correct, leave the weights aloe. If the output uit icorrectly outputs a zero, add the iput vector to the weight vector. If the output uit icorrectly outputs a oe, subtract the iput vector from the weight vector. Traslatio w ( +1) kj = w ( ) kj (y k(x k(x ; w) t k ) Á j (x ) This is the Delta rule a.k.a. LMS rule! Perceptro Learig correspods to 1 st -order (stochastic) Gradiet Descet (e.g., of a quadratic error fuctio)! Slide adapted from Geoff Hito 37 Loss Fuctios We ca ow also apply other loss fuctios L2 loss L1 loss: Cross-etropy loss Hige loss Softmax loss L(t; y(x)) = P P k Least-squares regressio Media regressio Logistic regressio SVM classificatio Multi-class probabilistic classificatio I (t = k) l P exp(yk(x)) j exp(yj(x)) o 38 Regularizatio Limitatios of Perceptros I additio, we ca apply regularizers E.g., a L2 regularizer What makes the task difficult? Perceptros with fixed, had-coded iput features ca model ay separable fuctio perfectly......give the right iput features. This is kow as weight decay i Neural Networks. We ca also apply other regularizers, e.g. L1 sparsity Sice Neural Networks ofte have may parameters, regularizatio becomes very importat i practice. We will see more complex regularizatio techiques later o... For some tasks this requires a expoetial umber of iput features. E.g., by eumeratig all possible biary iput vectors as separate feature uits (similar to a look-up table). But this approach wo t geeralize to usee test cases! It is the feature desig that solves the task! Oce the had-coded features have bee determied, there are very strog limitatios o what a perceptro ca lear. Classic example: XOR fuctio Wait... Did t we just say that... Perceptros correspod to geeralized liear discrimiats Ad Perceptros are very limited... Does t this mea that what we have bee doig so far i this lecture has the same problems??? Yes, this is the case. A liear classifier caot solve certai problems (e.g., XOR). However, with a o-liear classifier based o the right kid of features, the problem becomes solvable. So far, we have solved such problems by had-desigig good features Á ad kerels Á > Á. Ca we also lear such feature represetatios? Perceptros Defiitio Loss fuctios Regularizatio Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio

8 Multi-Layer Perceptros Multi-Layer Perceptros Addig more layers Output layer Hidde layer Mappig (leared!) Iput layer Activatio fuctios g : For example: g (2) (a) = ¾(a), g (1) (a) = a The hidde layer ca have a arbitrary umber of odes There ca also be multiple hidde layers. Output Uiversal approximators A 2-layer etwork (1 hidde layer) ca approximate ay cotiuous fuctio of a compact domai arbitrarily well! (assumig sufficiet hidde odes) Slide adapted from Stefa Roth 43 Slide credit: Stefa Roth 44 Learig with Hidde Uits Networks without hidde uits are very limited i what they ca lear More layers of liear uits do ot help still liear Fixed output o-liearities are ot eough. We eed multiple layers of adaptive o-liear hidde uits. But how ca we trai such ets? Need a efficiet way of adaptig all weights, ot just the last layer. Learig the weights to the hidde uits = learig features This is difficult, because obody tells us what the hidde uits should do. Mai challege i deep learig. Slide adapted from Geoff Hito 45 Learig with Hidde Uits How ca we trai multi-layer etworks efficietly? Need a efficiet way of adaptig all weights, ot just the last layer. Idea: Gradiet Descet Set up a error fuctio with a loss L( ) ad a regularizer ( ). E.g., Update each weight i the directio of the gradiet L 2 loss L 2 regularizer ( weight decay ) 46 Gradiet Descet Two mai steps 1. Computig the gradiets for each weight 2. Adjustig the weights i the directio of the gradiet today ext lecture Perceptros Defiitio Loss fuctios Regularizatio Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio

9 Obtaiig the Gradiets Approach 1: Naive Aalytical Differetiatio Excursio: Chai Rule of Differetiatio Oe-dimesioal case: Scalar fuctios Compute the gradiets for each variable aalytically. What is the problem whe doig this? Excursio: Chai Rule of Differetiatio Obtaiig the Gradiets Multi-dimesioal case: Total derivative Approach 1: Naive Aalytical Differetiatio Need to sum over all paths that lead to the target variable x. 51 Compute the gradiets for each variable aalytically. What is the problem whe doig this? With icreasig depth, there will be expoetially may paths! Ifeasible to compute this way. 52 Obtaiig the Gradiets Approach 2: Numerical Differetiatio Perceptros Defiitio Loss fuctios Regularizatio Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio 53 Give the curret state W ( ), we ca evaluate E(W ( ) ). Idea: Make small chages to W ( ) ad accept those that improve E(W ( ) ). Horribly iefficiet! Need several forward passes for each weight. Each forward pass is oe ru over the etire dataset! 54 9

10 Obtaiig the Gradiets Approach 3: Icremetal Aalytical Differetiatio Perceptros Defiitio Loss fuctios Regularizatio Multi-Layer Perceptros Defiitio Learig with hidde uits Obtaiig the Gradiets Naive aalytical differetiatio Numerical differetiatio Backpropagatio Idea: Compute the gradiets layer by layer. Each layer below builds upo the results of the layer above. The gradiet is propagated backwards through the layers. Backpropagatio algorithm Backpropagatio Algorithm Backpropagatio Algorithm Core steps 1. Covert the discrepacy betwee each output ad its target value ito a error derivate. 2. Compute error derivatives i each hidde layer from error derivatives i the layer above. 3. Use error derivatives w.r.t. activities to get error derivatives w.r.t. the icomig weights Slide adapted from Geoff Hito y i w ji 58 Notatio y j z j y j z j y i Slide adapted from Geoff Hito Output of layer k Iput of layer k z = y j j Coectios: = g z j z j = w ji yi i y j = g zj 63 Backpropagatio Algorithm Backpropagatio Algorithm y j z j z = y j j = g z j y j z j z = y j j = g z j y i y i = j = y i j w ji y i y i = j = y i j w ji Notatio y j Output of layer k Coectios: z j Iput of layer k Slide adapted from Geoff Hito z j = w ji yi i z y j j = g zj 64 y = w ji i Notatio y j z j Slide adapted from Geoff Hito Output of layer k Iput of layer k w = z j = y i ji w ji Coectios: z j = w ji yi i z y j j = g zj 65 w = y i ji 10

11 Backpropagatio Algorithm Summary: MLP Backpropagatio y j z j z = y j j = g z j Forward Pass for k = 1,..., l do Backward Pass for k = l, l-1,...,1 do y i y i = j = y i j w ji edfor Efficiet propagatio scheme w = z j = y i ji w ji y i is already kow from forward pass! (Dyamic Programmig) Propagate back the gradiet from layer k ad multiply with y i. 66 Slide adapted from Geoff Hito edfor Notes For efficiecy, a etire batch of data X is processed at oce. deotes the elemet-wise product 67 Aalysis: Backpropagatio Refereces ad Further Readig Backpropagatio is the key to make deep NNs tractable However... More iformatio o Neural Networks ca be foud i Chapters 6 ad 7 of the Goodfellow & Begio book The Backprop algorithm give here is specific to MLPs It does ot work with more complex architectures, e.g. skip coectios or recurret etworks! Wheever a ew coectio fuctio iduces a differet fuctioal form of the chai rule, you have to derive a ew Backprop algorithm for it. Tedious... I. Goodfellow, Y. Begio, A. Courville Deep Learig MIT Press, 2016 Let s aalyze Backprop i more detail This will lead us to a more flexible algorithm formulatio Next lecture

Machine Learning Lecture 10

Machine Learning Lecture 10 Neural Networks 26.11.2018 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Today s Topic Deep Learning 2 Course Outline Fundamentals Bayes