Lecture 3. 1 Introduction. 2 The Winnow Algorithm. 2.1 TheAlgorithm. CS-621 Theory Gems September 26, 2012

Size: px

Start display at page:

Download "Lecture 3. 1 Introduction. 2 The Winnow Algorithm. 2.1 TheAlgorithm. CS-621 Theory Gems September 26, 2012"

Esther Sparks
5 years ago
Views:

1 CS-621 Theory Gems September 26, 2012 Lecture 3 Lecturer: Aleksander Mądry Scribes: Yann Barbotin and Vlad Ureche 1 Introduction The main topic of this lecture is another classical algorithm for finding linear classifiers the Winnow Algorithm. We also briefly discuss Support Vector Machines(SVMs). 2 The Winnow Algorithm Recalltheproblemoflinearclassificationwediscussedinthelastlecture.Inthisproblem,wearegivena collectionof mlabeledpoints {(x j,l j )} j witheach x j R n.ourgoalistofindavector(linearclassifier) (w,θ) R n+1 suchthat,forall j, sign(w x j θ) = l j. That is, we want the hyperplane corresponding to (w, θ) to separate the positive examples from the negativeones.aswealreadyarguedpreviously,wlogwecanconstrainourselvestothecasewhen θ = 0 (i.e.,thehyperplanepassesthroughtheorigin)andthereareonlypositiveexamples(i.e., l j = 1,forall j). Last time we presented a simple algorithm for this problem called Perceptron algorithm. Today, we willseeadifferent(andalsosimple)algorithmforthesametask thewinnowalgorithm. 2.1 TheAlgorithm The Winnow algorithm can be seen as a direct application of the multiplicative weights update paradigm that we discussed in Lecture 1. To make this connection precise, instead of providing an explicit description of the algorithm, we will cast it as an application of the Multiplicative Weights Update(MWU) algorithm to an appropriately tailored execution of the(general) learning-from-expert framework cf. Section5inthenotesfromLecture1.(Aswewillseelaterinthecourse,therearemany,quitediverse, algorithmsthatcanbecastinthismanner-thisisonereasonwhythemultiplicativeweightsupdate paradigm has such a wide array of applications.) SimilarlytothecaseoftheanalysisofthePerceptronalgorithm,wewillwlogassumethatourdata isnormalized. However,thistimewewillnormalizeitin l norminsteadof l 2. So,fromnowon, x j = max i x j i = 1,forall j. Winnow algorithm: Consider the following execution of the learning-from-expert-advice framework that is interacting with themwualgorithm(asdescribedinsection5innotesfromlecture1)for ρ = 1andsome 0 < 1 2. Wehave nexperts-oneforeachcoordinate(feature).ineachround t: Let (p t 1,...,pt n )betheconvexcombinationsuppliedbythemwualgorithm; Set w t tobethelinearclassifiergivenby p t i s,i.e., wt (p t 1,..., pt n ); Checkifthereexists x jt suchthat w t x jt 0(i.e., x jt ismisclassified); Ifnosuch j t existsthenwejuststopthealgorithmandoutputtheclassifier w t (asitclassifiesall the data correctly); Otherwise,wesetthelossvectortobe li t = xjt i,foreach i,andproceedtothenextround; 1

2 Clearly, given that our stopping condition ensures correctness of the output classifier, our only task is to bound the total number of executed rounds(provided an appropriate linear classifier exists). Theorem1Let w beacorrectlinearclassifiersuchthat w 1 = 1, w i 0forall i,andlet γ W = min j x j w.if γw 4 γw 2 thenthenumber ToftheiterationsoftheWinnowalgorithmisatmost T 8logn γw 2. Beforeweprovethistheorem,letusdiscusstheassumptionsthatitmakes,aswellas,providesome intuition underlying the algorithm. Aswealreadydiscussedinthelastlecture,wecanalwaysassumethatacorrectlinearclassifieris normalized.ontheotherhand,tomakesurethatallcoordinatesof w arenon-negative,wecanjust applytoeach x j atransformation x j (x j, x j ). Thiswilldoublethenumberofourfeatures,but now,asitisnothardtosee,ifthereexistedacorrectlinearclassifierfortheoriginaldatathenthereis one for the transformed one that has all its coordinates non-negative and achieves the same margin. So, theassumptionmadeon w arenotreallyconstrainingtheapplicabilityofthetheorem. To understand the idea behind the algorithm and its analysis, note that we setup the loss vectors insuchawaythatineachround tthemwualgorithmssuffersalossproportionaltotheextentit misclassifiesthepoint x jt.so,inparticular,thislossisalwaysnon-negative. Ontheotherhand,theexistenceofthecorrectclassifier w isacertificatethatthereexistsaconvex combinationofexpertsthatneversuffersalossthatislargerthan γ W (andthisquantityisstrictly negative). Therefore, by the asymptotically optimal loss guarantee of the MWU algorithm, we know thatsuchapoorperformanceofthemwualgorithmcannotlastfortoolong.(inparticular,theconvex combination (p t 1,...,pt n )thatitmaintainswilleventuallyshiftitsmasstowardsthefeatures/experts that are most important for correct classification.) Finally,notethatthetheoremrequiresthattheMWUalgorithmisusedwiththevalueof being withinafactoroftwoofthevalueof γw 2. Thismightbeproblematicasweusuallydonotknowhow large γ W is.thereis,however,asimpletechniquethatallowsustocircumventthisproblem. Inthistechnique,westartwith = 1 2 (whichisanobviousupperboundonthevalueof γw 2 )and runthewinnowalgorithmfor 8logn rounds.ifthealgorithmterminatesbythattime,wegetacorrect 2 classifierandaredone.otherwise-bytheorem1-weknowthatitmusthavebeenthecasethatour valueof wastoolarge(i.e., γw 2 < ).Thuswehalfthevalueof andrepeattheprocedure(cf.figure 1). Clearly,thisprocedurewillfinishonce becomeslessthat γw 2 (orevenearlier),anditisnothard toseethattheoverallnumberofiterationsperformedisatmosttwiceaslargeastheonepromisedby Theorem1incaseof beingsetcorrectly. data 1/2 Run MWU for 8logn 2 iter. /2 converged? no yes Correct classifier w Figure 1: Iterative margin estimation using Theorem 1 Wenowturntotheproofofthetheorem. 2

3 Proof Wehavethatthetotalloss l MWU ofthemwualgorithmis l MWU = p t i lt i = wi t xjt i = t t t i i w t x jt 0, aseach x jt waschosensoastheclassifier w t misclassifiesit. Therefore,bythelossguaranteeofMWUalgorithm(cf.Theorem6inthenotesfromLecture1),we have for any feature/expert i that 0 l MWU t l t i + t l t i + lnn t x jt i +T + lnn, (1) as ρ = 1inoursetting(duetonormalizationofall x j s)andthus li t 1forall iand t. Letusmultiplybothsidesofeachequation(1)correspondingtosome i,by wi andaddthemall together.asall wi arenon-negativeandtheysumuptoone,weobtain 0 i w i ( t ) x jt i +T + lnn = t w x jt +T + lnn. Now,bydefinitionof γ W,wehavethat w x j γ W foreach j.therefore,itmustbethecasethat 0 t w x jt +T + lnn γ W T + γ W 2 T +4lnn γ W, as γw 4 γw 2. Rearrangingthetermsanddividingbothsidesby γw 2,wegetthat as desired. T 8logn γw 2, Finally,itisworthpointingoutthattheMWUalgorithmistreatedhereinapurelyblack-box fashion. So, it can be replaced by any other algorithm for the learning-from-expert-advice framework, aslongas,thisalgorithmoffersasimilarlossguaranteetotheonedescribedbytheorem6fromlecture Comparison of the Perceptron and Winnow Algorithms At this point, we have seen two different algorithms(perceptron and Winnow) for exactly the same task. Itisthusnaturaltowonderwhichoneofthemoneshoulduse.Asitisoftenthecase,thereisnoclear answer to that question. The performance guarantees of both these algorithms although look similar are to large extent incomparable. Recall that if we do not normalize the data then the performance guarantees for these algorithms that we have established are: Algorithm Perceptron Winnow 1 γ 2 P 8logn γ 2 W Number of iterations with γ P = max min w x j w j w 2 x j 2 with γ W = max min w x j w j w 1 x j 3

4 So,ifboth γ P and γ W aresimilar,usingperceptronshouldbebetter. However,ingeneral, γ P oftencalled l 2 /l 2 -margin and γ W oftenreferredtoas l /l 1 -margin canbequitedifferent,asthe corresponding norms have different characteristics- cf. Figure 2. Inparticular,oneadvantageofWinnowalgorithmisthatthe l /l 1 -marginismuchlesssensitive to presence of irrelevant features in our data set. More precisely, if one imagines that features take valuesfromsomefixedinterval,say [ 1,1],thenaddingnewfeaturesthatarenotreallyneededfor correct classification(i.e., the optimal classifier does not put any weight on them) does not affect the l /l 1 -margin,whilestillmightsignificantlychangethe l 2 /l 2 -margin. The general consensus is that Winnow algorithm is more preferable when one expects the target classifiertoberathersparse(i.e.,alotoffeaturesisirrelevant).ontheotherhand,ifonesuspectsthat most of the features are relevant for correct classification(i.e., the optimal classifier is non-zero on most features)thentheperceptronalgorithmshouldperformbetter,asthe l 2 normofthetargetshouldbe muchsmallerthanits l 1 norm. x 1 = + = 1 x 2 = 1 +x2 2 = 1 x = max(, ) = 1 Figure2: Unitballsin,respectively, l 1, l 2 and l norm.theintersectionwiththeredlinecorresponds tothepointintheseballsthatmaximizeslinearobjective 2 +.The l 1 solutionhasonlyoneofits componentnon-zero(sparsesolution),whilethe l 2 normsolutionisfullandminimizestheenergy.the l solutionfavorshomogeneity. 3 Support Vector Machines(SVMs) Ourgeneralstrategysofarwastorelatetherunningtimeofouralgorithmstothequalityofthemargin achievedbysomeoptimalclassifier.inthiswaywewereabletoprovethatifthedatacanbeseparated with large margin then our algorithms return a correct classifier. However, while doing so, we provided no guarantee on the quality of the margin of this returned classifier. This is rather unfortunate, as one would expect(and, in fact, it can be shown) that classifiers with larger margins have better generalization properties. Therefore, ability to return a classifier with good margin(if such exists) would be a very useful property. To some extent, both Perceptron and Winnow algorithm can be adjusted to provide this kind of guarantees(e.g., by treating all classifications that are correct but have small margin as misclassifications). However, it is sometimes better to take a more principled approach here and just state the search for maximum margin classifier as optimization problem. Such optimization problems are called Support Vector Machines(SVM). Formally,inthecaseof l 2 /l 2 -margin,wehaveourdatanormalizedin l 2 normandwewanttosolve the following problem min w 2 2 (Notethattheconstraintis 1insteadof 0here.) s.t. w x j 1 j (2) 4

5 Astheaboveprogramisconvex,wecansolveitefficiently.(Infact,thereareveryefficientmethods specializedinsolvingsuchsvms.) Onecanshow seebelow thatindeedtheoptimumsolutionto thisprogramgivesamaximum l 2 /l 2 -marginclassifier.also,onecanusedualityargumentstoshowthat - similarly to the case of Perceptron and Winnow algorithm- this optimal classifier is a combination ofsomeofthedatapoints x j. Thesepointsareoftencalledsupportvector(asthey support the separating hyperplane), which explains the name of SVMs. Lemma2Let w betheoptimumsolutiontothesvm(2).the l 2 /l 2 -marginachievedby w isequal to γ P and w 2 2 = 1/γ 2 P. Proof First,weshowthat w 1 isacorrectclassifierswith l 2 /l 2 -marginbeingatleast w 2. Thisfollows easilybynotingthat,forany j, w x j 1 w 2 x j 2 w 2, whereweusedthefeasibilityof w andthefactthatall x j sarenormalized. w Ontheotherhand,if wisacorrectclassifierwith l 2 /l 2 -margin γthen γ w 2 isafeasiblesolutionto thesvm(2).toseethis,notethatforany j w x j w x j w 2 = 1. γ w 2 min j w x j w 2 Also, the objective value corresponding to w γ w 2 is 1/γ 2.Thelemmanowfollowsbycombiningthetwo above facts. Although at first glance very appealing, in practice, looking for the maximum margin classifier that correctlyclassifiesallthedatamightnotbethebestidea.thisissoasthereal-worlddataisusually noisy and just a few outliers might either render the data not linearly separable, or at least significantly reduce the margin of the best linear classifier. Therefore, instead of treating the correct classification ofalldatapointsasahardconstraint,onemightprefertobeabletoallowoneselftomisclassifysome samples if this enables separation of the remaining data with large margin. This type of soft trade-off between the correctness and quality of the margin can be very conveniently expressedinsvmsframework,byintroducingso-calledslackvariables j foreachdatapoint.formally, one consider the following augmented optimization problem: min w 2 2 +C j j s.t. w x j 1 j j j 0 j Here,theparameter Cisusedtoweighthesizeofthemarginoftheobtainedclassifieragainstits classificationerrormeasuredbythetotalhingeloss.clearly,ifweset Ctobeverylarge,weessentially recoverthesettingofsvmin(2). 5

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012 CS-62 Theory Gems September 27, 22 Lecture 4 Lecturer: Aleksander Mądry Scribes: Alhussein Fawzi Learning Non-Linear Classifiers In the previous lectures, we have focused on finding linear classifiers,