The Experts/Multiplicative Weights Algorithm and Applications

Similar documents
Feature Selection: Part 1

COS 521: Advanced Algorithms Game Theory and Linear Programming

1 The Mistake Bound Model

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

Lecture 4. Instructor: Haipeng Luo

Generalized Linear Methods

MMA and GCMMA two methods for nonlinear optimization

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lecture 14: Bandits with Budget Constraints

Kernel Methods and SVMs Extension

Assortment Optimization under MNL

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 15 Scribe: Jieming Mao April 1, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #16 Scribe: Yannan Wang April 3, 2014

Lecture Notes on Linear Regression

More metrics on cartesian products

Problem Set 9 Solutions

Lecture 10 Support Vector Machines. Oct

Online Classification: Perceptron and Winnow

The Second Anti-Mathima on Game Theory

CSC 411 / CSC D11 / CSC C11

Lecture 10 Support Vector Machines II

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Module 9. Lecture 6. Duality in Assignment Problems

Errors for Linear Systems

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Ensemble Methods: Boosting

10-701/ Machine Learning, Fall 2005 Homework 3

Lecture 10: May 6, 2013

APPENDIX A Some Linear Algebra

Boostrapaggregating (Bagging)

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

CSCI B609: Foundations of Data Science

A 2D Bounded Linear Program (H,c) 2D Linear Programming

1 Convex Optimization

Some modelling aspects for the Matlab implementation of MMA

CS : Algorithms and Uncertainty Lecture 14 Date: October 17, 2016

Lecture 12: Discrete Laplacian

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

1 GSW Iterative Techniques for y = Ax

Homework Assignment 3 Due in class, Thursday October 15

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

CSE 546 Midterm Exam, Fall 2014(with Solution)

Structure and Drive Paul A. Jensen Copyright July 20, 2003

The Minimum Universal Cost Flow in an Infeasible Flow Network

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

Grover s Algorithm + Quantum Zeno Effect + Vaidman

Support Vector Machines CS434

Lecture 17. Solving LPs/SDPs using Multiplicative Weights Multiplicative Weights

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

CS286r Assign One. Answer Key

Edge Isoperimetric Inequalities

Support Vector Machines CS434

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

U.C. Berkeley CS294: Spectral Methods and Expanders Handout 8 Luca Trevisan February 17, 2016

Linear Feature Engineering 11

Linear Approximation with Regularization and Moving Least Squares

find (x): given element x, return the canonical element of the set containing x;

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

Week 5: Neural Networks

Vapnik-Chervonenkis theory

Notes on Frequency Estimation in Data Streams

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

Foundations of Arithmetic

Learning Theory: Lecture Notes

2.3 Nilpotent endomorphisms

Multilayer Perceptron (MLP)

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

1 Gradient descent for convex functions: univariate case

On the Multicriteria Integer Network Flow Problem

Lecture 21: Numerical methods for pricing American type derivatives

Affine transformations and convexity

Lecture 20: Lift and Project, SDP Duality. Today we will study the Lift and Project method. Then we will prove the SDP duality theorem.

Lecture Space-Bounded Derandomization

NP-Completeness : Proofs

COS 511: Theoretical Machine Learning

Which Separator? Spring 1

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Games of Threats. Elon Kohlberg Abraham Neyman. Working Paper

Maximal Margin Classifier

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Math 426: Probability MWF 1pm, Gasson 310 Homework 4 Selected Solutions

Neural networks. Nuno Vasconcelos ECE Department, UCSD

Singular Value Decomposition: Theory and Applications

Lecture 12: Classification

Economics 101. Lecture 4 - Equilibrium and Efficiency

n α j x j = 0 j=1 has a nontrivial solution. Here A is the n k matrix whose jth column is the vector for all t j=0

The Geometry of Logit and Probit

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

Ph 219a/CS 219a. Exercises Due: Wednesday 23 October 2013

The Gaussian classifier. Nuno Vasconcelos ECE Department, UCSD

ECE559VV Project Report

CHAPTER 5 NUMERICAL EVALUATION OF DYNAMIC RESPONSE

Complete subgraphs in multipartite graphs

Transcription:

Chapter 2 he Experts/Multplcatve Weghts Algorthm and Applcatons We turn to the problem of onlne learnng, and analyze a very powerful and versatle algorthm called the multplcatve weghts update algorthm. Let s motvate the algorthm by consderng a few dverse example problems; we ll then ntroduce the algorthm and later show how t can be appled to solve each of these problems. 2.1 Some optmzaton problems Machne learnng: Learnng a lnear classfer. In machne learnng, a basc but classc settng conssts of a set of k labeled examples (a 1, l 1 ),..., (a k, l k ) where the a j R N are feature vectors and the l j are labels n { 1, 1}. he problem s to fnd a lnear classfer: a unt vector x R N + such that x 1 = 1 and j {1,..., k}, l j (a j x) 0 (thnk of x as a dstrbuton over the coordnates of the a j, the features, that explans, or correlates wth, the labelng ndcated by the l j ). Machne learnng: Boostng. Suppose gven a sequence of tranng ponts x 1,..., x N sampled from some unverse accordng to some (unknown) dstrbuton D. Each pont has an (unknown) label c(x ) { 1, 1}, where the functon c s taken from some restrcted set of functons (the concept class C). he goal s to fnd a hypothess functon h C that assgns labels to ponts, and predcts the functon c n the best way possble (on average over D). For nstance, n the prevous example the concept class s the class of all lnear classfers (or, hyperplanes ), the dstrbuton s unform, and the hypothess functon s x. Call a learnng algorthm strong f t outputs wth probablty at least 1 δ a hypothess 1 h such that E D h(x 2 ) c(x )} ε, and weak f the error s at most 1 γ. Suppose 2 the only thng we have s a weak learnng algorthm: for any dstrbuton D, t returns a hypothess h such that 1 E h(x ) c(x ) 1 D 2 2 γ. 1

he goal s to use the weak learner n order to construct a strong learner. Approxmaton algorthms: Vertex Cover Gven a unverse U = {1,..., N} and a collecton C = {C 1,..., C m } of subsets of U, the goal s to fnd the smallest possble number of sets from C that covers all of U. Lnear programmng. Gven a set of N lnear constrants a x b, where a R m and b R, the goal s to decde f there s an (approxmately) feasble soluton: does there exst an x R m + such that a x b δ for all {1,..., N}? Game theory: zero-sum games. Consder a game between two players, the row and column players. Each player can choose an acton, j {1,..., N}. If the actons are (, j) the payoff to the row player s M(, j) R, and the payoff to the column player s M(, j). A player can decde to randomze and choose a dstrbuton over actons whch she wll sample from. In partcular, f the row player plays accordng to a dstrbuton p R N + over row actons and the column player plays accordng to a dstrbuton q R N + over column actons, then the expected payoff of the row player s p Mq and the expected payoff of the column player s p Mq. he goal s to fnd (approxmately) optmal strateges for both players: (p, q ) such that (p ) Mq λ ε, λ = max q mn p p Mq = mn p max p Mq. q What do these fve problems have n common? For each there s a natural greedy approach. In a zero-sum game, we could pck a random strategy to start wth, fnd the second player s best response, then the frst player s best response to ths, etc. For the lnear programmng problem we can agan choose a random x, fnd a constrant that s volated, update x. Same for the lnear classfer problem (whch can tself be wrtten as a lnear program). In the set cover problem we can pck a frst set that covers the most possble elements, then a second set coverng the hghest possble number of uncovered elements, etc. In boostng we can tweak the dstrbuton over ponts and call a weak learner to mprove the performance of our current hypothess wth respect to msclassfed ponts. In each case t s not so easy to analyze such a greedy approach, as there s always a rsk to get stuck n a local mnmum. In ths lecture we re gong to develop an abstract greedybut-cautous algorthm that can be appled to effcently solve each of these problems. 2.2 Onlne learnng Consder the followng stuaton. here are steps, t = 1,...,. At each step t, he player chooses an acton x (t) K R N. She suffers a loss f (t) (x (t) ), where f (t) : K [ 1, 1] s the loss functon. 2

he player s goal s to devse a strategy for choosng her actons x (1),..., x ( ) whch mnmzes the regret, R = f (t) (x (t) ) mn f (t) (x). (2.1) x K he regret measures how much worse the player s dong compared to a player who s told all loss functons n advance, but s restrcted to play the same acton at each step. Snce the player may be adaptve a por regret could be postve or negatve. In vrtually all cases the set K as well as the functons f (t) wll be convex, so that settng x = 1 (x(1) + + x ( ) ) we can see that under ths convexty assumpton regret s always a non-negatve quantty. Exercse 1. What f regret were defned, not wth respect to the best fxed acton n hndsght, but by comparson wth an offlne player who s allowed to swtch between experts at every step? Construct an example where the loss suffered by such an offlne player s 0, but the expected loss of any onlne learnng algorthm s at least (1 1/N). In onlne learnng we can dstngush between the cases where the player s told what the functon f (t) s, or only what her specfc loss f (t) (x (t) ) s. he former scenaro s called the full nformaton model, and t s the model we ll focus on. he latter s usually referred to as the mult-armed bandt problem, but we won t dscuss t further. In addton we are gong to make the followng assumptons: he set of allowed actons for the player s the probablty smplex K N R N +, p = 1}, = {p Loss functons are lnear, f (t) (p) = m (t) p where m (t) R N s such that m (t) 1 for all {1,..., N}. Note that ths ensures that f (t) (p) [ 1, 1] for any p K N. hs settng has an nterpretaton n terms of experts : thnk of each of the coordnates = 1,..., N as an expert. At each tme t, the experts make recommendatons. he player has to bet on an expert, or more generally on a dstrbuton over experts. hen she gets to see how well the expert s recommendatons turned out: to each expert s assocated a loss, and the player suffers an average loss p(t) regret R, whch n ths case bols down to m (t) m (t). he player s goal s to mnmze her R = = f (t) (p (t) ) mn p K N m (t) p (t) f (t) (p) mn {1,...,N} m (t),.e. she s comparng her loss to that of the best expert n hndsght. How would you choose whch expert to follow so as to mnmze your total regret? A natural strategy s to always choose the expert who has performed best so far. Suppose for smplcty that experts are always ether rght or wrong, so that the losses m (t) {0, 1}: 0 3

for rght and 1 for wrong. Suppose also that there s at least one expert who always gets t rght. Fnally suppose at each step we hedge and choose an expert at random among those who have always been correct so far. hen whenever we suffer a loss r t [0, 1] t precsely means that a fracton r t of the remanng experts (those who were rght so far) got t wrong and are thus elmnated. Snce N t (1 r t) 1 (there are N experts to start wth and there must be at least 1 remanng), takng logarthms we see t r t ln N, whch s not bad; n partcular t doesn t grow wth. But now let s remove our assumpton that there s an expert that always gets t rght n practce, no one s perfect. Suppose at the frst step all experts do equally well, so m (1) = (1/2,..., 1/2). But now experts get t rght or wrong on alternate days, m (2t) = (1,..., 1, 0,..., 0) and m (2t+1) = (0,..., 0, 1,..., 1) for t 1. In ths case, t turns out we re always gong to pck the wrong expert! Our regret ncreases by 1/2 per teraton on average, and we never learn. In ths case a much better strategy would be to realze that no expert s consstently good, so that by pckng a random expert at each step we suffer an expected loss of 1/2, whch s just as good as the best expert our regret would be 0 nstead of /2. Follow the Regularzed Leader. he soluton s to consder a strategy whch does not jump around too quckly. Instead of selectng the absolute best expert at every step, we re gong to add a rsk penalty whch penalzes dstrbutons that concentrate too much on a small number of experts we ll only allow such hgh concentraton f there s really a hgh confdence that the experts are dong much better than the others. Specfcally, nstead of playng the dstrbuton p such that p (t+1) = arg mn p K N j t m (j) p, we ll choose ( p (t+1) = arg mn η ) m (j) p + R(p), (2.2) p K N j t where R s a regularzer and η > 0 a small weght of our choosng used to balance the two terms. 2.3 he Multplcatve Weghts Update algorthm he multplcatve weghts update (MWU for short) algorthm s a specal case of follow the regularzed leader where the regularzer R s chosen to be the (negatve of the) entropy functon. Regularzng va entropy. Let R(p) = H(p), where H(p) = p ln p s the entropy of p. (In ths lecture we ll use the unusual conventon that logarthms are measured n nats. hs s only for convenence, and n later lectures we ll revert to the bnary logarthm log.) We ll use the followng propertes of entropy. 4

Lemma 2.1. Let p, q be dstrbutons n K N, H(p) = p ln 1 p the entropy of p and D(p q) = p ln p q the relatve entropy (sometmes also called KL dvergence) between p and q. hen the followng hold: (1) For all p, 0 H(p) ln N. (2) For all p and q, D(p q) 0. Exercse 2. Prove the lemma. he only way to have zero entropy s to have ln p = 0 for each such that p 0,.e. p = 1 for some and zero elsewhere: low-entropy dstrbutons are hghly concentrated. At the opposte extreme, the only dstrbuton wth maxmal entropy s the unform dstrbuton (as beng unform s the only way to saturate Jensen s nequalty used n the proof of the lemma). hus the effect of usng the negatve entropy as a regularzer s to penalze dstrbutons that are too hghly concentrated. he multplcatve weghts update rule. Wth ths choce of regularzer t s possble to solve for (2.2) explctly. Specfcally, note that the gradent at pont q s smply q = η j t m(j) + ln(q ) + 1, and settng ths to zero gves q = e 1 η j t m(j). hs gves us the unque mnmzer over R n. o take nto account the normalzaton to a probablty dstrbuton, we project on K N and fnd the update rule j t p (t+1) = e η m(j) e η j t m(j) (t) = e ηm p (t) e ηm(t) You can check usng the KK condtons that ths soluton ndeed mnmzes (2.2) over K N, when R(p) = H(p). hs suggests the followng multplcatve weghts update algorthm: p (t). 5

Algorthm 1: Multplcatve Weghts Algorthm (a.k.a MW or MWA) Choose η (0, 1); Intalze weghts w (1) for each {1,..., N} ; (example: w (1) = 1, but they can be chosen arbtrarly) for t = 1, 2,..., do Choose decsons proportonal to the weghts w (t),.e., use the dstrbuton where Φ (t) = w(t) ; Observe the costs m (t) of the decsons; ( (t) w ) p (t) 1 w(t) N =,..., Φ (t), Φ (t) Penalze costly decsons by updatng ther weghts: w (t+1) end e ηm(t) w (t) ; At every step we update our weghts w (t) w (t+1) by re-weghtng expert by a multplcatve factor e ηm(t) that s drectly related to ts performance n the prevous step. However, we do so smoothly, a small step at a tme; how small the step s governed by η, a parameter that we ll choose for best performance later. Remark 2.2. Note that for small η, e ηm(t) = (1 ηm (t) )w (t) 1 ηm (t). So we could also use a modfed, drectly n the algorthm; t sometmes lead to a slghtly update rule, w (t+1) better bound (dependng on the structure of the loss functons). Analyss. Algorthm. he followng theorem quantfes the performance of the Multplcatve Weghts heorem 2.3. For all N N, 0 < η < 1, N, losses m (t) [ 1, 1] for {1,..., N} and t {1,..., }, the above procedure startng at some p (1) K N suffers a total loss such that ( m (t) p (t) mn m (t) p + 1 p N η D( p p (1))) N ( (t)) 2p (t) + η m where D(p q) = p ln(p /q ) s the relatve entropy. Before provng the theorem we can make a number of mportant comments. Frst let s make some observatons that wll gve us a smpler formulaton. By choosng p (1) to be the unform dstrbuton we can ensure that D(p p (1) ) = ln N H(p) s always at most ln N, whatever the optmal p s. Moreover, usng m (t) 1 the last term s at most η. Usng the defnton of the regret (2.1), we get R 1 ln N + η. η =1 6

he best choce of η n ths equaton s η = ln N/, n whch case we get the bound 1 ln N R 2. What ths means s that, f we want to have an average regret, over the rounds, that s at most ε, then t wll suffce to run the procedure for a number of teratons = 4 ln N/ε 2. he most remarkable feature of ths bound s the logarthmc dependence on N: n only a logarthmc number of teratons we are able to narrow down on a way to select experts that ensures our average loss s small. hs matches the guarantee of our earler nave procedure of choosng the best expert so far, except that now t works n all cases, whether the experts are consstent or not! he dependence on ε, however, s not so good. hs s an mportant drawback of the MWA; n many cases t s not a serous lmtaton but t s mportant to have t n mnd, and we ll return to ths ssue when we look at some examples. Frst let s prove the theorem. Proof of heorem 2.3. he proof s based on the use of the relatve entropy as a potental functon: for any q we measure the decrease D(q p (t+1) ) D(q p (t) ) = = η q ln p(t) p (t+1) q m (t) + ( ) ( q ln ηm (t) q ηm (t) p (t) + η 2 e ηm(t) ) p (t) ( (t)) 2p (t) m, where for the last lne we used e x 1 x + x 2 and ln(1 x) x for all x 1. Summng over t and usng that the relatve entropy s always postve proves the theorem. 2.4 Applcatons We now gve three applcatons of the multplcatve weghts algorthm: to fndng lnear classfers, solvng zero-sum games, and solvng lnear programs. You ll treat other applcatons n your homework. 2.4.1 Fndng lnear classfers Let s get back to our frst example, learnng a lnear classfer. For smplcty, we are gong to assume that there exsts a strct classfer, n the sense that there exsts x R N such that x 1 = 1 and j {1,..., k}, l j (a j x ) ε (2.3) for some ε > 0 (here a 1 = a ). 7

Frst we observe that we may also assume that x R ( ) N +. For ths, create a new nput aj by settng b j = R a 2N. Suppose there exsts x R N such that (2.3) holds. j Let x ( + = max(x ), 0) and x = mn(x, 0) (component-wse), so x = x + + x. hen x y = + x has non-negatve entres summng to 1, and satsfes (2.3) wth the b j replacng ( ) y the a j. Conversely, for any y = 1 R 2N + wth entres summng to 1 such that (2.3) y 2 holds (for the b j ), we can defne x = 1 2 (y 1 y 2), so x 1 1 and x satsfes (2.3), wth the a j and ε replaced by 1 2 ε. hus the two problems are equvalent, and for the remander of ths secton we assume that (2.3) s satsfed for some x R N + such that x = 1 (we keep usng the notaton a j ). We also defne ρ = max j a j, where a = max =1,...,N a. he dea behnd applyng MWA here s to consder each of the N coordnates of vectors a j as an expert, and choose the loss functon m (t) as a functon of one of the a j that s not well classfed by the current vector p (t). If p (t) s such that all of the a j s satsfy l j (a j p (t) ) 0, then all ponts are well classfed by x = p (t) and we can stop the algorthm. Otherwse, we update p (t) through the MWA update rule and go to the next step. he queston here s to determne why the algorthm termnates, and n how many steps. Let us analyze the MW approach n a bt more detals. We ntalze w (1) = 1 for all, and p (t) accordngly. At each round t, we look for a value j such that l j (a j p (t) ) < 0,.e. a j s not classfed correctly. If there s none, the algorthm stops. If there s one, pck ths j and defne m (t) = l j a ρ j; note that by defnton of ρ, m (t) [ 1, 1] for all. By assumpton, there exsts x such that m (t) x = l j (a j x )/ρ ε/ρ. By the man MW theorem, snce x s a dstrbuton, we have m (t) p (t) Usng m (t) x ε/ρ, we get m (t) x + ln N η m (t) p (t) ε ρ + ln N η + η. + η. Now, one should remark that untl the last tme step,.e. untl the algorthm stops, we chose a j such that l j (a j p (t) ) < 0, whch mples that t = 1,...,, m (t) p (t) = l j (a j p (t) )/ρ > 0. It mmedately follows that 0 < ε ρ + ln N η 8 + η.

Fnally, choosng η = ε, we obtan < 4ρ2 ln N. hs proves that the algorthm termnates, 2ρ ε 2 and that t fnds a classfer n less than 4ρ2 ln N tmesteps. ε 2 2.4.2 Zero-sum games Next we consder games n the game theoretc sense: nformally, a set of k players each have a set of a actons they can choose from; every player chooses an acton, after whch he gets a payoff or utlty that s a functon of hs own acton and the actons of other players. Here wll lmt ourselves to 2-player zero-sum games, n whch each of the two players has a fnte number N of avalable actons. he payoffs of players 1 and 2 (let us call them the row and column players from now on) can be represented by a sngle payoff matrx M R N N such f the row player plays acton, and the column player plays j, then the column player receves payoff M(, j) and the row player receves M(, j). A player can decde to randomze and choose a dstrbuton over actons whch she wll sample from. In partcular, f the row player plays accordng to a dstrbuton p over row actons (nstead of playng one specfc acton) and the column player plays accordng to a dstrbuton q over column actons, then the expected payoff of the column player s p Mq and the expected payoff of the row player s p Mq. Gven that each player s tryng to maxmze her utlty, we can see that the row and column players n a zero-sum games have conflctng objectves: the column player wants to maxmze p Mq, whle the row player want to mnmze the same quantty. One nce property of zero-sum games s that t does not matter for any of the players whether they act frst: Von Neumann s mnmax theorem states that max q mn p p Mq = mn p max p Mq = λ. q hs means n partcular that t does not matter whether the row player tres to mnmze p Mq frst and lets the column player act second and maxmze mn p p Mq or the column player tres to maxmze p Mq frst and the row player then mnmzes max q p Mq: the utlty both players end up gettng does not change. hs optmal utlty λ s called the value of the game. We want to show that the MWA allows us to fnd strateges that are near-optmal for a zero-sum game, n the sense that for any ε > 0 and for a suffcent number of tme steps, the MWA fnds a set of strateges ( 1 p(t), 1 q(t) ) for the row and column players that satsfy ( 1 ) ( 1 ) λ p (t) M q (t) λ + ε. he dea behnd the MW here s to take p (t) to be the strategy of the row player. At each tme step, q (t) s chosen as the best response to strategy p (t) (.e. the strategy whch maxmze the utlty of the column player gven that the row player plays p (t) ); then, the row player uses the multplcatve weght update based on a loss vector m (t) chosen as the vector of 9

expected payoffs for the row player when the column player plays q (t) to decde what p (t+1) wll be n the next tme step. Intutvely, the row player tres to penalze the strateges that lead to a low payoff and to put more weght on the strategy that leads to a hgher payoff. Formally, the algorthm proceeds as follows: Algorthm 2: MWA for zero-sum games Choose η (0, 1) ; Intalze weghts w (1) := 1 for each {1,..., N}; for t = 1, 2,..., do Defne the dstrbuton p (t) = { w(t) 1,..., w(t) Φ (t) N }, where Φ (t) = Φ (t) w(t) Defne q (t) = argmax q (p (t) ) Mq and m (t) = Mq (t) ; Penalze costly decsons by updatng ther weghts: w (t+1) end ; e ηm(t) w (t) ; Let s analyze ths algorthm. Assume that the payoffs M(, j) [ 1, 1]. Choosng η = ln N/ and = 4 ln N/ε 2, accordng to the MW theorem, for any dstrbuton p, 1 m (t) p (t) = 1 (p (t) ) Mq (t) 1 (p ) Mq (t) + ε. (2.4) ake p = argmn p p Mq (t). hen for any t, and therefore (2.4) mples (p ) Mq (t) mn p 1 p Mq (t) max mn p Mq = λ, q p (p (t) ) Mq (t) λ + ε. o lower bound 1 (p(t) ) Mq (t), note that for every dstrbuton q, (p (t) ) Mq (t) (p (t) ) Mq gven the choce of q (t) made n the algorthm; therefore, and (p (t) ) Mq (t) max(p (t) ) Mq q (p (t) ) Mq (t) mn p max p Mq = λ. q Usng the fact that the above bounds are vald for any t, by lnearty we get that the dstrbutons produced by the algorthm satsfy ( 1 λ ) ( 1 p (t) M 10 ) q (t) λ + ε.

hs equaton shows that p and q are approxmate equlbra of the game, n the sense that nether the row or the column player can mprove her utlty by more than ε by changng her strategy, when the strategy of the other player remans fxed. Indeed, magne that the column player decdes to change her strategy to another q. hen, she s gong to get payoff ( 1 p (t) ) Mq 1 (p (t) ) Mq (t) λ + ε. However, she s guaranteed to earn payoff at least λ when playng q, and can therefore not mprove her payoff by more than ε by changng her strategy. A smlar reasonng apples to the row player. 2.4.3 Solvng lnear programs Our goal s to check to feasblty of a set of lnear nequaltes, Ax b, x 0, where A = [a 1...a m ] s an m n matrx and x an n-dmensonal vector, or more precsely to fnd an approxmately feasble soluton x 0 such that for some ε > 0, a x b ε,. he analyss wll be based on an oracle that answers the followng queston: Gven a vector c and a scalar d, does there exst an x 0 such that c x d? Wth ths oracle, we wll be able to repeatedly check whether a convex combnaton of the ntal lnear nequaltes, a x b, s nfeasble; a condton that s suffcent for the nfeasblty of our orgnal problem. Note that the oracle s straghtforward to construct, as t nvolves a sngle nequalty. In partcular, t returns a negatve answer f d > 0 and c < 0. he algorthm s as follows. Experts correspond to each of the m constrants, and loss functons are assocated wth ponts x 0. he loss suffered by the -th expert wll be m = 1 ρ (a x b ), where ρ > 0 s a parameter used to ensure that the losses m [ 1, 1]. (Although one mght expect the penalty to be the volaton of the constrant, t s exactly the opposte; the reason s that the algorthm s tryng to actually prove nfeasblty of the problem.) In the t-th round, we use our dstrbuton p (t) over experts to generate an nequalty that would be vald, f the problem were feasble: the nequalty s p (t) a x p (t) b. (2.5) We then use the oracle to detect whether ths constrant s nfeasble, n whch case the orgnal problem s nfeasble, or return a pont x (t) 0 that satsfes the nequalty. he loss we suffer s equal to 1 ρ p(t) (a x (t) b ), and we use ths loss to update our weghts. Note that n case nfeasblty s not detected, the penalty we pay s always nonnegatve, snce x (t) satsfes the nequalty (2.5). 11

he smplfed form of the multplcatve weghts theorem gves us the followng guarantee: m (t) p (t) m (t) p (t) ln m + 2ρ, for any. Choose = 4ρ 2 ln(m)/ε 2. Usng that the left-hand sde s non-negatve, f we set x = 1 x (t) we see that for every, a x b ε, as desred. he number of teratons requred to decde approxmate feasblty,, scales logarthmcally wth the number of constrants m. hs s much better than standard algorthms such as the ellpsod algorthm, whch are polynomal n n. However, the dependence on the accuracy ε s quadratc, whch s much worse than these algorthms, for whch the dependence s n log(1/ε). Fnally the dependence of on ρ 1 s also quadratc. he value of ρ depends on the oracle (t s called the wdth of the oracle), and dependng on the specfc LP we are tryng to solve t may or may not be possble to desgn an oracle that guarantees a small value of ρ. 12