Machine Learning. Ilya Narsky, Caltech

Size: px

Start display at page:

Download "Machine Learning. Ilya Narsky, Caltech"

Clifton Rafe Sparks
5 years ago
Views:

1 Machie Learig Ilya Narsky, Caltech

2 Lecture 4 Multi-class problems. Multi-class versios of Neural Networks, Decisio Trees, Support Vector Machies ad AdaBoost. Reductio of a multi-class problem to a set of biary problems.

3 Multi-class problems Various ad-hoc strategies ca be used to reduce a multi-class problem to a set of two-class problems Oe agaist Oe Oe agaist All This approach oly works if oe class clearly domiates. Not always the case. Example: oe-vs-oe strategy for 3 classes where red arrow stads for more likely tha A uified framework for reducig multi-class problems to biary Allwei, Schapire ad Siger, J. of Machie Learig Research (2) Alteratively, use a multi-class versio of your favorite classifier 3 2 Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 3

..,} (i - th positio) k - th class is ecoded as y k The output of a eural et is a vector, {f k

4 Multi-class Neural Network Extesio to multi-class case comes aturally: N(odes i output layer)n(classes) but we eed oly oe ode for 2 {,...,,,,...,} (i - th positio) k - th class is ecoded as y k The output of a eural et is a vector, {f k (x)}; k,,; f k (x). Evet is classified to category k if f k (x) is largest. Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 4

5 Decisio trees 2: maximize Gii idex -p 2 -q 2 // p+q ay : maximize Gii idex A, B, C cross-etropy p*log(p)+q*log(q) // p+q cross-etropy e.g., A,B,C A 2,B 2,C 2 p A k A Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 5 k 2 p k p k log p k A + B + C subject to k p k Note that this has othig to do with biary vs multi-way splits for tree costructio. Oe ca use biary splits for multi-class problems ad multi-way splits for biary problems. Are multi-way splits a good idea? Depeds o who you ask.

6 Pealize some misclassificatios more tha others L ij pealty for misclassifyig class i as class j L kk Regular Gii idex k p 2 k i j p i p j Modified Gii idex: i, j L ij p i p j I HEP aalysis, we do t play this game. (Although it would be iterestig to try oe day!) But uequal misclassificatio pealties are ofte used i the statistics literature. Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 6

7 2: Support Vector Machies T f ( x) w + w h( x) y sigal y backgroud mi [ y f ( x )] w N + + λ w 2 Lee, Li ad Wahba, Multicategory Support Vector Machies, J. of the America Statistical Associatio 99 (24) ay : k - th class is ecoded as y {,...,,,,..., } (i k - th positio) k k k ( ) { ( )} x f x f ( x) f k fk ( x) wk + hk ( x) L( y) is a pealty fuctio : L( y) {,...,,,,...,} ( i k - th positio) if y belogs to class k N T ( )[ ( ) ] + 2 mi L y f x y λ h + k h k Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 7

8 mi h Multi-class SVM (cotiued) N L If y is from class k: T ( y )[ ( ) ] + f x y λ + k f k (x ) ca be large because L k (y ) > does ot cotribute to the miimized loss f i (x ) for i k must be as small as possible, otherwise f i (x )>y (i) ad the loss is icreasig Caot make f k (x ) arbitrarily large ad f i (x ) arbitrarily small because of the pealty term which eforces smoothess of the solutio. h k 2 After traiig is completed, a evet is classified to category k if f k (x) is largest amog {f k (x)}. Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 8

9 iteratio m : correctly classified evets : weight of AdaBoost (still 2 classes oly) ( m) m( x) : ε m w <.5 f misclassified evets : w w classifier m : β misclassified ( m) ( m) m 2 ( m) w 2( ε w 2ε ( m) m m ) ε log ε m m What do we eed to chage for multiple classes? Nothig! It is the same algorithm. Except For 2 classes, oe ca always build a classifier with ε</2. If you have a classifier with ε>/2, you ca flip its output. For multiple classes it is ot always possible. Freud ad Schapire, A Decisio-Theoretic Geeralizatio of O-Lie Learig ad a Applicatio to Boostig, J. of Computer ad System Scieces 55 (997) N, z true ε I( f ( x ) y ) I( z) N otherwise Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 9

10 AdaBoost.M (multiple classes) Solutio trai util ε /2 ad the abort. This algorithm is kow as AdaBoost.M. Ok, this is a solutio. But ca we come up with somethig better? (see ext slide) What is the output of multi-class AdaBoost? M F( x) βm m f m ( x) ( 2) M k m m m { ( k )( )} ( k ) F x : F ( x) β I f ( x) ( ( k ) y ) I plai Eglish, compute average iverse misclassificatio error over the built classifiers which predict class k for this poit. ad the choose the class which maximizes this quatity.

11 AdaBoost.MH: soft score vs hard score Suppose we built a decisio tree for 3 classes, A, B ad C. I oe of the termial odes we have N A, N B, ad N C traiig evets. Suppose N A >N B >N C. Cotributio to the overall classificatio error from this ode is ε(n B +N C )/N (N is the overall size of the traiig sample). Istead of returig a hard classificatio label (A, B or C), we ca retur a soft score, for example, f A N A /(N A +N B +N C ); same for f B ad f C. I geeral, we ca defie a fuctio, - <h(x,y)<, which represets cofidece that the true class label at x is y. What are the advatages of soft score agaist hard score? It ca be show that for ay dataset the classificatio error defied through soft score is guarateed to be /2. Cotiuous. More accurately reflects the amout of misclassificatio. Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide

12 AdaBoost.MH: Yet aother trick I two-class versio of AdaBoost, we had oe weight per evet, a measure of how ofte the evet was assiged to the wrog category by the weak learers. But ow we have may wrog categories. ( x, A) ( x, B) ( x, C) 3 Example: For evet draw from class A it is easy to discrimiate class A from class B, but ot so easy to discrimiate class A from class C. Solutio: Itroduce oe weight for each class for each evet. (( x, A), + ), (( x, B), ), (( x, C), ) 2 (( x2, A), ), (( x2, B), + ), (( x2, C), ) (( x, A), ), (( x, B), ), (( x, C), + ) 3 We have trasformed a problem with 3 evets, 3 classes ad D iput variables ito a problem with 9 evets, 2 classes ad D+ iput variables. Now build a weak learer ad compute h(x,y); ya,b,c; for each traiig x usig, for example, decisio tree leaf purities. Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 2 3 3

13 AdaBoost.MH full algorithm Schapire ad Siger, Improved Boostig Algorithms Usig Cofidece-rated Predictios, Machie Learig 37 (999) replace ( ) {( ) ( ( k ) x, y with k poits x, k, J y y )} ( k ) ( ( k ) ), y y J y y ( k ), y y () iteratio : w ( k) / ( N) ;,..., N; k,..., iteratio m: classifier with scorig fuctio h w ( m) ( ) ( ) [ ( ) ( )] ( m) ( k) ( k) k w k expβmj y, y hm x, y Zm AdaBoost respose : k Friedma, Hastie ad Tibshirai, Additive logistic regressio: A statistical view of boostig, The Aals of Statistics 28 (2): The AdaBoost.MH algorithm for a -class problem ca be effectively reduced to solvig problems, each class agaist the rest. m ( x, y) { ( k )( )} ( k )( ) ( ( k ) f x ) k : f x βmhm x, y M m Notatio: N traiig poits, classes, y (k) is the true class label for class k.

14 What have we leared? May biary classifiers have multi-class versios. For some classifiers (eural et, decisio tree), geeralizatio from biary to multi-class is obvious. For others (SVM, AdaBoost), less straightforward but doable. Multi-class algorithms are the subject of ogoig research. If you switch from grad school i physics to grad school i machie learig (ot that I ecourage you to), you may very well ivet oe. But let us look at this through the eyes of a practitioer. Eve if you kow how to do a multi-class versio of your favorite algorithm, do you have meas to do so? Do you have a piece of software? Is it easy to write oe? What ca you do if the aswer to both questios above is o? Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 4

15 Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 5 Reduce multi-class problem to a set of two-class problems usig a iteractio matrix For example, problem with 4 classes: Iteractio matrix *C matrix for classes ad C biary classifiers Allwei, Schapire ad Siger, Reducig Multiclass to Biary: A Uifyig Approach to Margi Classifiers, J. of Machie Learig Research (2) Μ ONE-VS-ALL Μ ONE-VS-ONE Reducig multi-class to biary: A Uified Approach C ( ) 2 C

16 Oe-vs-oe, oe-vs-all what else? Ca implemet ay o-stadard strategy. For example, two sigals with similar sigatures ad backgroud. We wat to separate both sigals from backgroud ad the separate them from each other. classifiers Μ backgroud sigal sigal 2 classes Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 6

17 How do you classify ew evets? For each evet, you get C umbers, oe for each colum of the iteractio matrix. Compute user-defied loss (quadratic, expoetial etc) for each row of the iteractio matrix For example, compute average quadratic error E k C c C ( Μ f ( x) ) c ; k,..., kc ad assig evet X to the class k which gives the miimal quadratic error E k. 2 Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 7

18 What about predictive power? It is ot clear if true multi-class algorithms offer ay advatage over multiclass-through-biary methods. Example: multiclass SVM by Lee, Li ad Wahba multiclass SVM Oe vs Rest earest eighbor Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 8

19 Lecture 5 Variable selectio V Geetic algorithms X StatPatterRecogitio X Ilya Narsky SLUO Statistics Lectures, August 26 Lecture 4 Slide 9

10-701/ Machine Learning Mid-term Exam Solution

10-701/ Machine Learning Mid-term Exam Solution 0-70/5-78 Machie Learig Mid-term Exam Solutio Your Name: Your Adrew ID: True or False (Give oe setece explaatio) (20%). (F) For a cotiuous radom variable x ad its probability distributio fuctio p(x), it