Chapter 2 he Experts/Multplcatve Weghts Algorthm and Applcatons We turn to the problem of onlne learnng, and analyze a very powerful and versatle algorthm called the multplcatve weghts update algorthm. Let s motvate the algorthm by consderng a few dverse example problems; we ll then ntroduce the algorthm and later show how t can be appled to solve each of these problems. 2.1 Some optmzaton problems Machne learnng: Learnng a lnear classfer. In machne learnng, a basc but classc settng conssts of a set of k labeled examples (a 1, l 1 ),..., (a k, l k ) where the a j R N are feature vectors and the l j are labels n { 1, 1}. he problem s to fnd a lnear classfer: a unt vector x R N + such that x 1 = 1 and j {1,..., k}, l j (a j x) 0 (thnk of x as a dstrbuton over the coordnates of the a j, the features, that explans, or correlates wth, the labelng ndcated by the l j ). Machne learnng: Boostng. Suppose gven a sequence of tranng ponts x 1,..., x N sampled from some unverse accordng to some (unknown) dstrbuton D. Each pont has an (unknown) label c(x ) { 1, 1}, where the functon c s taken from some restrcted set of functons (the concept class C). he goal s to fnd a hypothess functon h C that assgns labels to ponts, and predcts the functon c n the best way possble (on average over D). For nstance, n the prevous example the concept class s the class of all lnear classfers (or, hyperplanes ), the dstrbuton s unform, and the hypothess functon s x. Call a learnng algorthm strong f t outputs wth probablty at least 1 δ a hypothess 1 h such that E D h(x 2 ) c(x )} ε, and weak f the error s at most 1 γ. Suppose 2 the only thng we have s a weak learnng algorthm: for any dstrbuton D, t returns a hypothess h such that 1 E h(x ) c(x ) 1 D 2 2 γ. 1
he goal s to use the weak learner n order to construct a strong learner. Approxmaton algorthms: Vertex Cover Gven a unverse U = {1,..., N} and a collecton C = {C 1,..., C m } of subsets of U, the goal s to fnd the smallest possble number of sets from C that covers all of U. Lnear programmng. Gven a set of N lnear constrants a x b, where a R m and b R, the goal s to decde f there s an (approxmately) feasble soluton: does there exst an x R m + such that a x b δ for all {1,..., N}? Game theory: zero-sum games. Consder a game between two players, the row and column players. Each player can choose an acton, j {1,..., N}. If the actons are (, j) the payoff to the row player s M(, j) R, and the payoff to the column player s M(, j). A player can decde to randomze and choose a dstrbuton over actons whch she wll sample from. In partcular, f the row player plays accordng to a dstrbuton p R N + over row actons and the column player plays accordng to a dstrbuton q R N + over column actons, then the expected payoff of the row player s p Mq and the expected payoff of the column player s p Mq. he goal s to fnd (approxmately) optmal strateges for both players: (p, q ) such that (p ) Mq λ ε, λ = max q mn p p Mq = mn p max p Mq. q What do these fve problems have n common? For each there s a natural greedy approach. In a zero-sum game, we could pck a random strategy to start wth, fnd the second player s best response, then the frst player s best response to ths, etc. For the lnear programmng problem we can agan choose a random x, fnd a constrant that s volated, update x. Same for the lnear classfer problem (whch can tself be wrtten as a lnear program). In the set cover problem we can pck a frst set that covers the most possble elements, then a second set coverng the hghest possble number of uncovered elements, etc. In boostng we can tweak the dstrbuton over ponts and call a weak learner to mprove the performance of our current hypothess wth respect to msclassfed ponts. In each case t s not so easy to analyze such a greedy approach, as there s always a rsk to get stuck n a local mnmum. In ths lecture we re gong to develop an abstract greedybut-cautous algorthm that can be appled to effcently solve each of these problems. 2.2 Onlne learnng Consder the followng stuaton. here are steps, t = 1,...,. At each step t, he player chooses an acton x (t) K R N. She suffers a loss f (t) (x (t) ), where f (t) : K [ 1, 1] s the loss functon. 2
he player s goal s to devse a strategy for choosng her actons x (1),..., x ( ) whch mnmzes the regret, R = f (t) (x (t) ) mn f (t) (x). (2.1) x K he regret measures how much worse the player s dong compared to a player who s told all loss functons n advance, but s restrcted to play the same acton at each step. Snce the player may be adaptve a por regret could be postve or negatve. In vrtually all cases the set K as well as the functons f (t) wll be convex, so that settng x = 1 (x(1) + + x ( ) ) we can see that under ths convexty assumpton regret s always a non-negatve quantty. Exercse 1. What f regret were defned, not wth respect to the best fxed acton n hndsght, but by comparson wth an offlne player who s allowed to swtch between experts at every step? Construct an example where the loss suffered by such an offlne player s 0, but the expected loss of any onlne learnng algorthm s at least (1 1/N). In onlne learnng we can dstngush between the cases where the player s told what the functon f (t) s, or only what her specfc loss f (t) (x (t) ) s. he former scenaro s called the full nformaton model, and t s the model we ll focus on. he latter s usually referred to as the mult-armed bandt problem, but we won t dscuss t further. In addton we are gong to make the followng assumptons: he set of allowed actons for the player s the probablty smplex K N R N +, p = 1}, = {p Loss functons are lnear, f (t) (p) = m (t) p where m (t) R N s such that m (t) 1 for all {1,..., N}. Note that ths ensures that f (t) (p) [ 1, 1] for any p K N. hs settng has an nterpretaton n terms of experts : thnk of each of the coordnates = 1,..., N as an expert. At each tme t, the experts make recommendatons. he player has to bet on an expert, or more generally on a dstrbuton over experts. hen she gets to see how well the expert s recommendatons turned out: to each expert s assocated a loss, and the player suffers an average loss p(t) regret R, whch n ths case bols down to m (t) m (t). he player s goal s to mnmze her R = = f (t) (p (t) ) mn p K N m (t) p (t) f (t) (p) mn {1,...,N} m (t),.e. she s comparng her loss to that of the best expert n hndsght. How would you choose whch expert to follow so as to mnmze your total regret? A natural strategy s to always choose the expert who has performed best so far. Suppose for smplcty that experts are always ether rght or wrong, so that the losses m (t) {0, 1}: 0 3
for rght and 1 for wrong. Suppose also that there s at least one expert who always gets t rght. Fnally suppose at each step we hedge and choose an expert at random among those who have always been correct so far. hen whenever we suffer a loss r t [0, 1] t precsely means that a fracton r t of the remanng experts (those who were rght so far) got t wrong and are thus elmnated. Snce N t (1 r t) 1 (there are N experts to start wth and there must be at least 1 remanng), takng logarthms we see t r t ln N, whch s not bad; n partcular t doesn t grow wth. But now let s remove our assumpton that there s an expert that always gets t rght n practce, no one s perfect. Suppose at the frst step all experts do equally well, so m (1) = (1/2,..., 1/2). But now experts get t rght or wrong on alternate days, m (2t) = (1,..., 1, 0,..., 0) and m (2t+1) = (0,..., 0, 1,..., 1) for t 1. In ths case, t turns out we re always gong to pck the wrong expert! Our regret ncreases by 1/2 per teraton on average, and we never learn. In ths case a much better strategy would be to realze that no expert s consstently good, so that by pckng a random expert at each step we suffer an expected loss of 1/2, whch s just as good as the best expert our regret would be 0 nstead of /2. Follow the Regularzed Leader. he soluton s to consder a strategy whch does not jump around too quckly. Instead of selectng the absolute best expert at every step, we re gong to add a rsk penalty whch penalzes dstrbutons that concentrate too much on a small number of experts we ll only allow such hgh concentraton f there s really a hgh confdence that the experts are dong much better than the others. Specfcally, nstead of playng the dstrbuton p such that p (t+1) = arg mn p K N j t m (j) p, we ll choose ( p (t+1) = arg mn η ) m (j) p + R(p), (2.2) p K N j t where R s a regularzer and η > 0 a small weght of our choosng used to balance the two terms. 2.3 he Multplcatve Weghts Update algorthm he multplcatve weghts update (MWU for short) algorthm s a specal case of follow the regularzed leader where the regularzer R s chosen to be the (negatve of the) entropy functon. Regularzng va entropy. Let R(p) = H(p), where H(p) = p ln p s the entropy of p. (In ths lecture we ll use the unusual conventon that logarthms are measured n nats. hs s only for convenence, and n later lectures we ll revert to the bnary logarthm log.) We ll use the followng propertes of entropy. 4
Lemma 2.1. Let p, q be dstrbutons n K N, H(p) = p ln 1 p the entropy of p and D(p q) = p ln p q the relatve entropy (sometmes also called KL dvergence) between p and q. hen the followng hold: (1) For all p, 0 H(p) ln N. (2) For all p and q, D(p q) 0. Exercse 2. Prove the lemma. he only way to have zero entropy s to have ln p = 0 for each such that p 0,.e. p = 1 for some and zero elsewhere: low-entropy dstrbutons are hghly concentrated. At the opposte extreme, the only dstrbuton wth maxmal entropy s the unform dstrbuton (as beng unform s the only way to saturate Jensen s nequalty used n the proof of the lemma). hus the effect of usng the negatve entropy as a regularzer s to penalze dstrbutons that are too hghly concentrated. he multplcatve weghts update rule. Wth ths choce of regularzer t s possble to solve for (2.2) explctly. Specfcally, note that the gradent at pont q s smply q = η j t m(j) + ln(q ) + 1, and settng ths to zero gves q = e 1 η j t m(j). hs gves us the unque mnmzer over R n. o take nto account the normalzaton to a probablty dstrbuton, we project on K N and fnd the update rule j t p (t+1) = e η m(j) e η j t m(j) (t) = e ηm p (t) e ηm(t) You can check usng the KK condtons that ths soluton ndeed mnmzes (2.2) over K N, when R(p) = H(p). hs suggests the followng multplcatve weghts update algorthm: p (t). 5
Algorthm 1: Multplcatve Weghts Algorthm (a.k.a MW or MWA) Choose η (0, 1); Intalze weghts w (1) for each {1,..., N} ; (example: w (1) = 1, but they can be chosen arbtrarly) for t = 1, 2,..., do Choose decsons proportonal to the weghts w (t),.e., use the dstrbuton where Φ (t) = w(t) ; Observe the costs m (t) of the decsons; ( (t) w ) p (t) 1 w(t) N =,..., Φ (t), Φ (t) Penalze costly decsons by updatng ther weghts: w (t+1) end e ηm(t) w (t) ; At every step we update our weghts w (t) w (t+1) by re-weghtng expert by a multplcatve factor e ηm(t) that s drectly related to ts performance n the prevous step. However, we do so smoothly, a small step at a tme; how small the step s governed by η, a parameter that we ll choose for best performance later. Remark 2.2. Note that for small η, e ηm(t) = (1 ηm (t) )w (t) 1 ηm (t). So we could also use a modfed, drectly n the algorthm; t sometmes lead to a slghtly update rule, w (t+1) better bound (dependng on the structure of the loss functons). Analyss. Algorthm. he followng theorem quantfes the performance of the Multplcatve Weghts heorem 2.3. For all N N, 0 < η < 1, N, losses m (t) [ 1, 1] for {1,..., N} and t {1,..., }, the above procedure startng at some p (1) K N suffers a total loss such that ( m (t) p (t) mn m (t) p + 1 p N η D( p p (1))) N ( (t)) 2p (t) + η m where D(p q) = p ln(p /q ) s the relatve entropy. Before provng the theorem we can make a number of mportant comments. Frst let s make some observatons that wll gve us a smpler formulaton. By choosng p (1) to be the unform dstrbuton we can ensure that D(p p (1) ) = ln N H(p) s always at most ln N, whatever the optmal p s. Moreover, usng m (t) 1 the last term s at most η. Usng the defnton of the regret (2.1), we get R 1 ln N + η. η =1 6
he best choce of η n ths equaton s η = ln N/, n whch case we get the bound 1 ln N R 2. What ths means s that, f we want to have an average regret, over the rounds, that s at most ε, then t wll suffce to run the procedure for a number of teratons = 4 ln N/ε 2. he most remarkable feature of ths bound s the logarthmc dependence on N: n only a logarthmc number of teratons we are able to narrow down on a way to select experts that ensures our average loss s small. hs matches the guarantee of our earler nave procedure of choosng the best expert so far, except that now t works n all cases, whether the experts are consstent or not! he dependence on ε, however, s not so good. hs s an mportant drawback of the MWA; n many cases t s not a serous lmtaton but t s mportant to have t n mnd, and we ll return to ths ssue when we look at some examples. Frst let s prove the theorem. Proof of heorem 2.3. he proof s based on the use of the relatve entropy as a potental functon: for any q we measure the decrease D(q p (t+1) ) D(q p (t) ) = = η q ln p(t) p (t+1) q m (t) + ( ) ( q ln ηm (t) q ηm (t) p (t) + η 2 e ηm(t) ) p (t) ( (t)) 2p (t) m, where for the last lne we used e x 1 x + x 2 and ln(1 x) x for all x 1. Summng over t and usng that the relatve entropy s always postve proves the theorem. 2.4 Applcatons We now gve three applcatons of the multplcatve weghts algorthm: to fndng lnear classfers, solvng zero-sum games, and solvng lnear programs. You ll treat other applcatons n your homework. 2.4.1 Fndng lnear classfers Let s get back to our frst example, learnng a lnear classfer. For smplcty, we are gong to assume that there exsts a strct classfer, n the sense that there exsts x R N such that x 1 = 1 and j {1,..., k}, l j (a j x ) ε (2.3) for some ε > 0 (here a 1 = a ). 7
Frst we observe that we may also assume that x R ( ) N +. For ths, create a new nput aj by settng b j = R a 2N. Suppose there exsts x R N such that (2.3) holds. j Let x ( + = max(x ), 0) and x = mn(x, 0) (component-wse), so x = x + + x. hen x y = + x has non-negatve entres summng to 1, and satsfes (2.3) wth the b j replacng ( ) y the a j. Conversely, for any y = 1 R 2N + wth entres summng to 1 such that (2.3) y 2 holds (for the b j ), we can defne x = 1 2 (y 1 y 2), so x 1 1 and x satsfes (2.3), wth the a j and ε replaced by 1 2 ε. hus the two problems are equvalent, and for the remander of ths secton we assume that (2.3) s satsfed for some x R N + such that x = 1 (we keep usng the notaton a j ). We also defne ρ = max j a j, where a = max =1,...,N a. he dea behnd applyng MWA here s to consder each of the N coordnates of vectors a j as an expert, and choose the loss functon m (t) as a functon of one of the a j that s not well classfed by the current vector p (t). If p (t) s such that all of the a j s satsfy l j (a j p (t) ) 0, then all ponts are well classfed by x = p (t) and we can stop the algorthm. Otherwse, we update p (t) through the MWA update rule and go to the next step. he queston here s to determne why the algorthm termnates, and n how many steps. Let us analyze the MW approach n a bt more detals. We ntalze w (1) = 1 for all, and p (t) accordngly. At each round t, we look for a value j such that l j (a j p (t) ) < 0,.e. a j s not classfed correctly. If there s none, the algorthm stops. If there s one, pck ths j and defne m (t) = l j a ρ j; note that by defnton of ρ, m (t) [ 1, 1] for all. By assumpton, there exsts x such that m (t) x = l j (a j x )/ρ ε/ρ. By the man MW theorem, snce x s a dstrbuton, we have m (t) p (t) Usng m (t) x ε/ρ, we get m (t) x + ln N η m (t) p (t) ε ρ + ln N η + η. + η. Now, one should remark that untl the last tme step,.e. untl the algorthm stops, we chose a j such that l j (a j p (t) ) < 0, whch mples that t = 1,...,, m (t) p (t) = l j (a j p (t) )/ρ > 0. It mmedately follows that 0 < ε ρ + ln N η 8 + η.
Fnally, choosng η = ε, we obtan < 4ρ2 ln N. hs proves that the algorthm termnates, 2ρ ε 2 and that t fnds a classfer n less than 4ρ2 ln N tmesteps. ε 2 2.4.2 Zero-sum games Next we consder games n the game theoretc sense: nformally, a set of k players each have a set of a actons they can choose from; every player chooses an acton, after whch he gets a payoff or utlty that s a functon of hs own acton and the actons of other players. Here wll lmt ourselves to 2-player zero-sum games, n whch each of the two players has a fnte number N of avalable actons. he payoffs of players 1 and 2 (let us call them the row and column players from now on) can be represented by a sngle payoff matrx M R N N such f the row player plays acton, and the column player plays j, then the column player receves payoff M(, j) and the row player receves M(, j). A player can decde to randomze and choose a dstrbuton over actons whch she wll sample from. In partcular, f the row player plays accordng to a dstrbuton p over row actons (nstead of playng one specfc acton) and the column player plays accordng to a dstrbuton q over column actons, then the expected payoff of the column player s p Mq and the expected payoff of the row player s p Mq. Gven that each player s tryng to maxmze her utlty, we can see that the row and column players n a zero-sum games have conflctng objectves: the column player wants to maxmze p Mq, whle the row player want to mnmze the same quantty. One nce property of zero-sum games s that t does not matter for any of the players whether they act frst: Von Neumann s mnmax theorem states that max q mn p p Mq = mn p max p Mq = λ. q hs means n partcular that t does not matter whether the row player tres to mnmze p Mq frst and lets the column player act second and maxmze mn p p Mq or the column player tres to maxmze p Mq frst and the row player then mnmzes max q p Mq: the utlty both players end up gettng does not change. hs optmal utlty λ s called the value of the game. We want to show that the MWA allows us to fnd strateges that are near-optmal for a zero-sum game, n the sense that for any ε > 0 and for a suffcent number of tme steps, the MWA fnds a set of strateges ( 1 p(t), 1 q(t) ) for the row and column players that satsfy ( 1 ) ( 1 ) λ p (t) M q (t) λ + ε. he dea behnd the MW here s to take p (t) to be the strategy of the row player. At each tme step, q (t) s chosen as the best response to strategy p (t) (.e. the strategy whch maxmze the utlty of the column player gven that the row player plays p (t) ); then, the row player uses the multplcatve weght update based on a loss vector m (t) chosen as the vector of 9
expected payoffs for the row player when the column player plays q (t) to decde what p (t+1) wll be n the next tme step. Intutvely, the row player tres to penalze the strateges that lead to a low payoff and to put more weght on the strategy that leads to a hgher payoff. Formally, the algorthm proceeds as follows: Algorthm 2: MWA for zero-sum games Choose η (0, 1) ; Intalze weghts w (1) := 1 for each {1,..., N}; for t = 1, 2,..., do Defne the dstrbuton p (t) = { w(t) 1,..., w(t) Φ (t) N }, where Φ (t) = Φ (t) w(t) Defne q (t) = argmax q (p (t) ) Mq and m (t) = Mq (t) ; Penalze costly decsons by updatng ther weghts: w (t+1) end ; e ηm(t) w (t) ; Let s analyze ths algorthm. Assume that the payoffs M(, j) [ 1, 1]. Choosng η = ln N/ and = 4 ln N/ε 2, accordng to the MW theorem, for any dstrbuton p, 1 m (t) p (t) = 1 (p (t) ) Mq (t) 1 (p ) Mq (t) + ε. (2.4) ake p = argmn p p Mq (t). hen for any t, and therefore (2.4) mples (p ) Mq (t) mn p 1 p Mq (t) max mn p Mq = λ, q p (p (t) ) Mq (t) λ + ε. o lower bound 1 (p(t) ) Mq (t), note that for every dstrbuton q, (p (t) ) Mq (t) (p (t) ) Mq gven the choce of q (t) made n the algorthm; therefore, and (p (t) ) Mq (t) max(p (t) ) Mq q (p (t) ) Mq (t) mn p max p Mq = λ. q Usng the fact that the above bounds are vald for any t, by lnearty we get that the dstrbutons produced by the algorthm satsfy ( 1 λ ) ( 1 p (t) M 10 ) q (t) λ + ε.
hs equaton shows that p and q are approxmate equlbra of the game, n the sense that nether the row or the column player can mprove her utlty by more than ε by changng her strategy, when the strategy of the other player remans fxed. Indeed, magne that the column player decdes to change her strategy to another q. hen, she s gong to get payoff ( 1 p (t) ) Mq 1 (p (t) ) Mq (t) λ + ε. However, she s guaranteed to earn payoff at least λ when playng q, and can therefore not mprove her payoff by more than ε by changng her strategy. A smlar reasonng apples to the row player. 2.4.3 Solvng lnear programs Our goal s to check to feasblty of a set of lnear nequaltes, Ax b, x 0, where A = [a 1...a m ] s an m n matrx and x an n-dmensonal vector, or more precsely to fnd an approxmately feasble soluton x 0 such that for some ε > 0, a x b ε,. he analyss wll be based on an oracle that answers the followng queston: Gven a vector c and a scalar d, does there exst an x 0 such that c x d? Wth ths oracle, we wll be able to repeatedly check whether a convex combnaton of the ntal lnear nequaltes, a x b, s nfeasble; a condton that s suffcent for the nfeasblty of our orgnal problem. Note that the oracle s straghtforward to construct, as t nvolves a sngle nequalty. In partcular, t returns a negatve answer f d > 0 and c < 0. he algorthm s as follows. Experts correspond to each of the m constrants, and loss functons are assocated wth ponts x 0. he loss suffered by the -th expert wll be m = 1 ρ (a x b ), where ρ > 0 s a parameter used to ensure that the losses m [ 1, 1]. (Although one mght expect the penalty to be the volaton of the constrant, t s exactly the opposte; the reason s that the algorthm s tryng to actually prove nfeasblty of the problem.) In the t-th round, we use our dstrbuton p (t) over experts to generate an nequalty that would be vald, f the problem were feasble: the nequalty s p (t) a x p (t) b. (2.5) We then use the oracle to detect whether ths constrant s nfeasble, n whch case the orgnal problem s nfeasble, or return a pont x (t) 0 that satsfes the nequalty. he loss we suffer s equal to 1 ρ p(t) (a x (t) b ), and we use ths loss to update our weghts. Note that n case nfeasblty s not detected, the penalty we pay s always nonnegatve, snce x (t) satsfes the nequalty (2.5). 11
he smplfed form of the multplcatve weghts theorem gves us the followng guarantee: m (t) p (t) m (t) p (t) ln m + 2ρ, for any. Choose = 4ρ 2 ln(m)/ε 2. Usng that the left-hand sde s non-negatve, f we set x = 1 x (t) we see that for every, a x b ε, as desred. he number of teratons requred to decde approxmate feasblty,, scales logarthmcally wth the number of constrants m. hs s much better than standard algorthms such as the ellpsod algorthm, whch are polynomal n n. However, the dependence on the accuracy ε s quadratc, whch s much worse than these algorthms, for whch the dependence s n log(1/ε). Fnally the dependence of on ρ 1 s also quadratc. he value of ρ depends on the oracle (t s called the wdth of the oracle), and dependng on the specfc LP we are tryng to solve t may or may not be possble to desgn an oracle that guarantees a small value of ρ. 12