COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line Prediction and Boosting

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Lecture 8. Instructor: Haipeng Luo

Online Learning, Mistake Bounds, Perceptron Algorithm

2 Upper-bound of Generalization Error of AdaBoost

Empirical Risk Minimization Algorithms

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Lecture 6 Lecturer: Shaddin Dughmi Scribes: Omkar Thakoor & Umang Gupta

Convex Repeated Games and Fenchel Duality

1 Review of Zero-Sum Games

Learning by constraints and SVMs (2)

1 Review of Winnow Algorithm

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Hybrid Machine Learning Algorithms

1 Review of the Perceptron Algorithm

CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm

Learning with multiple models. Boosting.

CS7267 MACHINE LEARNING

Learning, Games, and Networks

Totally Corrective Boosting Algorithms that Maximize the Margin

Agnostic Online learnability

Online Convex Optimization

Convex Repeated Games and Fenchel Duality

i=1 = H t 1 (x) + α t h t (x)

Optimal and Adaptive Algorithms for Online Boosting

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

1 The Probably Approximately Correct (PAC) Model

Prediction and Playing Games

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Massachusetts Institute of Technology 6.854J/18.415J: Advanced Algorithms Friday, March 18, 2016 Ankur Moitra. Problem Set 6

Generalization bounds

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning. Lecture 9: Learning Theory. Feng Li.

On the Generalization Ability of Online Strongly Convex Programming Algorithms

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

1 Rademacher Complexity Bounds

CS446: Machine Learning Spring Problem Set 4

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Network Flows. CS124 Lecture 17

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Learning Ensembles. 293S T. Yang. UCSB, 2017.

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

Stochastic Gradient Descent

Linear Classifiers and the Perceptron

4.1 Online Convex Optimization

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

1 Proof of learning bounds

Hierarchical Boosting and Filter Generation

Name (NetID): (1 Point)

1 A Lower Bound on Sample Complexity

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Optimal and Adaptive Online Learning

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

TDT4173 Machine Learning

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Data Mining und Maschinelles Lernen

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

18.6 Regression and Classification with Linear Models

Algorithmic Game Theory and Applications. Lecture 4: 2-player zero-sum games, and the Minimax Theorem

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 17: Duality and MinMax Theorem Lecturer: Sanjeev Arora

Ensembles. Léon Bottou COS 424 4/8/2010

Data Warehousing & Data Mining

How to Take the Dual of a Linear Program

Machine Learning. Ensemble Methods. Manfred Huber

Ensemble Methods: Jay Hyer

A Course in Machine Learning

Online Learning and Sequential Decision Making

Linear Regression. S. Sumitra

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Perceptron Mistake Bounds

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

The AdaBoost algorithm =1/n for i =1,...,n 1) At the m th iteration we find (any) classifier h(x; ˆθ m ) for which the weighted classification error m

CS229 Supplemental Lecture notes

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017

Littlestone s Dimension and Online Learnability

Voting (Ensemble Methods)

Lecture 19: Follow The Regulerized Leader

The Boosting Approach to. Machine Learning. Maria-Florina Balcan 10/31/2016

1 Primals and Duals: Zero Sum Games

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

ECE 5984: Introduction to Machine Learning

Lecture 10: Duality in Linear Programs

AM 221: Advanced Optimization Spring Prof. Yaron Singer Lecture 6 February 12th, 2014

TDT4173 Machine Learning

Online Learning Summer School Copenhagen 2015 Lecture 1

Boosting: Foundations and Algorithms. Rob Schapire

Transcription:

COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix M with all elements in [0, ]. Simultaneously, Mindy (row player) chooses a row i and Max (column player) chooses a column j. From Mindy s point of view, her loss will be: M(i, j) (which she is trying to imize, and which Max is trying to imize). Choosing a single row or a single column corresponds to playing deteristically, i.e. pure strategies. Alternatively, we can allow the players to play in a randomized way (known as mixed strategies): simultaneously, Mindy chooses a distribution over rows (the strategy she ends up playing is i ) and Max chooses a distribution over columns (the strategy he ends up playing is j ). In this case, Mindy s expected loss (which we sometimes will refer to simply as her loss) is: (i)m(i, j)(j) = M := M(, ) () i,j where (i) is the probability that Mindy plays strategy i, and (j) is the probability that Max plays strategy j. 2 Mini heorem Suppose that the game is played sequentially. In other words, Mindy first picks a distribution (over the rows), then, knowing and in order to imize his expected gain, Max picks a distribution = arg M(, ). 2 Now, knowing Max will play in this fashion, Mindy will pick her mixed strategy to imize her expected loss, which thus will be: M(, ) (where is over all possible mixed row strategies, and is over all possible mixed column strategies). Similarly, if Max plays first, then Mindy s expected loss is : M(, ). As we have seen when studying Support Vector Machines, in general, the second player is always at an advantage, i.e. M(, ) M(, ) (2) However, in this case, John von Neumann s Mini theorem shows that for two player zero-sum games (when one player s loss is the other player s gain): M(, ) = M(, ) := v (3) Note that we denote by M(i, ) Mindy s loss for playing some row i when Max plays some distribution over columns, and vice-versa for M(, j). 2 For a fixed distribution, this function is linear in, so the optimal strategy is a pure strategy: M(, ) = j M(, j).

where v denotes the value of the game. In what follows, we will prove this theorem using techniques from online learning. he Mini theorem implies that :, M(, ) v (regardless of what Max plays, Mindy can ensure a loss of at most v) and that :, M(, ) v (regardless of what Mindy plays, Max can ensure a gain of at least v). In this case, we say that is a strategy and a strategy. 3 In classical game theory, having knowledge of M, one can compute and immediately (using linear programg). However, things are not always so simple. In some cases, we might not have knowledge of M, in others, M could be very large, and more notably, (, ) assumes both players are rational (optimal and adversarial), which is not always the case. hus, a natural extension of the situation that will allow for online learning, is to allow the game to be played repeatedly. Consider the following online algorithm: 45 Algorithm Learning rotocol INU: Game Matrix M, Number of rows n, Number of iterations. for t =,..., do Learner (Mindy) chooses t Environment (Max) chooses t (knowing t ) Learner suffers loss M( t, t ) Learner observes M(i, t ) i OUCOME: Learner s total loss is M( t, t ). We want to compare this total loss to the loss the learner would have gotten by fixing an optimal strategy for all rounds. In other words, we want to prove: M( t, t ) M(, t ) + small regret. (4) Note that the first term on the right-hand side of the inequality is at most v, the value of the game (it can be smaller when the other player is sub-optimal). 2. Multiplicative Updates Since the above looks like an online learning problem, we will construct an algorithm similar to the Weighted Majority algorithm (i.e. with multiplicative weights). Knowing that we need to compute t on every update, consider the following update rules: Algorithm 2 Multiplicative Weights (MW) i : (i) = n i : t+ (i) = t(i) βm(i, t ) Normalization for some fixed β [0, ) his algorithm is called Multiplicative Weights or MW. Notice that the greater the loss the learner would have suffered if row i had been played, the smaller the probability of choosing that row gets. 3 Also, (, ) forms a Nash equilibrium. 4 Everything is viewed from Mindy s perspective. 5 On each iteration, the learner observes the loss they would have gotten for every row chosen. 2

heorem. If we set β = + 2 ln n, then: M( t, t ) ( ) ln n where = O, so that the regret goes to 0 as. M(, t ) + (5) roof. he proof involves an analysis similar to the one done for the Weighted Majority algorithm (potential function argument) and will be omitted. 2.2 Mini heorem roof We will next prove the Mini heorem using the Multiplicative Weights algorithm and its anaylsis. Suppose that Mindy and Max behave according to the following procedure: Algorithm 3 for t =,..., do Mindy picks t using Multiplicative Weights (Algorithm 2) Max picks t = arg M( t, ) Define = t, = t. Now, note the following chain of inequalities: M M (6) = = t M (7) t M (8) t M t (9) M t + (0) = M + () M + (2) where (6) is due to picking a particular value of, (7) is by definition of, (8) is due to convexity (also, taking the imum term-by-term can only give a greater or equal quantity), (9) is by definition of t, (0) is by the MW algorithm and heorem, () is ( by definition of, and (2) is by definition of. Since = O proves the result. ln n ), taking 3

Notice that by taking the right-hand term of (6) and the inequality in (2) we also get: M M +. Or, in other words: M(, ) v + (3) his implies that we can approximate the value of the game, v, using Algorithm 3 (i.e. is an approximate strategy). Let us now observe how Online Learning and Boosting an be seen as special cases of the above result. 3 Relation to Online Learning Consider the following version of Online Learning: Algorithm 4 INU: Finite hypothesis space H, Finite domain space X, Number of iterations. for t =,..., do Get x t X redict ŷ t {0, } Observe true label c(x t ) (mistake if ŷ t c(x t )) We want to show: #(mistakes) h H #(mistakes by h) + small regret. In order to do so, let s construct a game matrix M, where the rows are indexed by h H, and the columns are indexed by x X. Define M(h, x) = [h(x) c(x)]. Now, run the Multiplicative Weights algorithm on the matrix M. On each round t, MW will use the distribution t to pick a hypothesis h t and predict ŷ t = h(x t ). hen, let t be concentrated on x t (i.e. probability on column x t and 0 everywhere else). From the bound on the MW algorithm, we get: M( t, x t ) h H ( ) M(h, x t ) + O ln H (4) Note that it is enough to compute the right-hand side over all pure strategies only. Also, notice that h H M(h, x t) is the number of mistakes made by the best hypothesis in the class, and that: M( t, x t ) = E [[h(x t ) c(x t )]] = h t lugging this into (3) yields the desired result. 4 Relation to Boosting r ŷ t [ŷ t c(x t )] = E[#(mistakes)]. (5) Now, let s look at a simplified version of Boosting (with known edge γ > 0): 4

Algorithm 5 INU: Weak hypothesis space H, raining set X, Number of iterations. for t =,..., do Booster chooses distribution D t over X Weak learner chooses h t H such that r x Dt [h t (x) c(x)] 2 γ OUU: H = MAJORIY(h,..., h t ). Since the distributions D t are over the examples (not hypotheses), using the matrix M defined as in the previous analysis, construct M = M, so that M (x, h) = [h(x) = c(x)]. In fact, M and M actually represent exactly the same game but with the roles of the row and column players reversed: transposing switches the examples with the hypotheses; negating switches and ; and adding to every entry simply translates the entries to [0, ] while having no impact on the game. Now, run the Multiplicative Weights algorithm on the matrix M. On each round t, MW will find a distribution t on the rows of M. So, let D t = t, get h t H and let t be concentrated on h t. From the bound on the MW algorithm, we get: M ( t, h t ) x X M (x, h t ) + (6) Note that M ( t, h t ) is the probability of choosing a row (according to D t = t ) that is correctly classified by h t, so M ( t, h t ) 2 + γ. Combining this with the above, and rearranging terms yields: x X : M (x, h t ) 2 + γ > 2 (7) since goes to 0 as. Notice that M (x, h t ) is the fraction of hypotheses that are correct on example x, so when we take the majority vote H, we immediately get that H(x) = c(x). herefore, the training error goes to 0. Finally, the two previous analyses show that (versions of) the Weighted Majority algorithm and AdaBoost can both be viewed as special cases of a more general algorithm for playing repeated games. Furthermore, the games that are used for online learning and boosting are in fact duals of each other, in the sense that they represent the exact same game, but with the row and column players reversed. 5