COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Similar documents
COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

Game Theory, On-line prediction and Boosting (Freund, Schapire)

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture # 12 Scribe: Indraneel Mukherjee March 12, 2008

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Game Theory, On-line Prediction and Boosting

1 Review of Zero-Sum Games

Empirical Risk Minimization Algorithms

1 Review of Winnow Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm

2 Upper-bound of Generalization Error of AdaBoost

1 Review of the Perceptron Algorithm

Hybrid Machine Learning Algorithms

Learning, Games, and Networks

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 8. Instructor: Haipeng Luo

CS7267 MACHINE LEARNING

TDT4173 Machine Learning

1 Rademacher Complexity Bounds

Mistake Bound Model, Halving Algorithm, Linear Classifiers, & Perceptron

Totally Corrective Boosting Algorithms that Maximize the Margin

1 A Lower Bound on Sample Complexity

1 Boosting. COS 511: Foundations of Machine Learning. Rob Schapire Lecture # 10 Scribe: John H. White, IV 3/6/2003

Agnostic Online learnability

Littlestone s Dimension and Online Learnability

Linear Classifiers and the Perceptron

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #5 Scribe: Allen(Zhelun) Wu February 19, ). Then: Pr[err D (h A ) > ɛ] δ

CS260: Machine Learning Theory Lecture 12: No Regret and the Minimax Theorem of Game Theory November 2, 2011

The No-Regret Framework for Online Learning

Algorithmic Game Theory and Applications. Lecture 4: 2-player zero-sum games, and the Minimax Theorem

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

Optimal and Adaptive Online Learning

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

4.1 Online Convex Optimization

Game Theory. Greg Plaxton Theory in Programming Practice, Spring 2004 Department of Computer Science University of Texas at Austin

Lecture Learning infinite hypothesis class via VC-dimension and Rademacher complexity;

1 Review of The Learning Setting

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

1 The Probably Approximately Correct (PAC) Model

Algorithms, Games, and Networks January 17, Lecture 2

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

AdaBoost. S. Sumitra Department of Mathematics Indian Institute of Space Science and Technology

Online Convex Optimization

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Generalization bounds

Lecture 4. 1 Learning Non-Linear Classifiers. 2 The Kernel Trick. CS-621 Theory Gems September 27, 2012

1 Learning Linear Separators

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Learning with multiple models. Boosting.

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Convex Repeated Games and Fenchel Duality

1. Implement AdaBoost with boosting stumps and apply the algorithm to the. Solution:

Stochastic Gradient Descent

CS Homework 3. Walid Krichene ( )

Learning Ensembles. 293S T. Yang. UCSB, 2017.

From Bandits to Experts: A Tale of Domination and Independence

Introduction to Bayesian Learning. Machine Learning Fall 2018

Support Vector Machines

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*

TDT4173 Machine Learning

Minimax risk bounds for linear threshold functions

Voting (Ensemble Methods)

Optimal and Adaptive Algorithms for Online Boosting

Computational Learning Theory

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Online Learning with Experts & Multiplicative Weights Algorithms

Computational Learning Theory

Data Warehousing & Data Mining

1 Learning Linear Separators

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Convex Repeated Games and Fenchel Duality

i=1 = H t 1 (x) + α t h t (x)

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

CS446: Machine Learning Spring Problem Set 4

CS229 Supplemental Lecture notes

Computational and Statistical Learning Theory

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS 231A Section 1: Linear Algebra & Probability Review

An Algorithms-based Intro to Machine Learning

Notes on Machine Learning for and

15-850: Advanced Algorithms CMU, Fall 2018 HW #4 (out October 17, 2018) Due: October 28, 2018

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Machine Learning. Ensemble Methods. Manfred Huber

18.6 Regression and Classification with Linear Models

Duality. Peter Bro Mitersen (University of Aarhus) Optimization, Lecture 9 February 28, / 49

Computational Learning Theory: Shattering and VC Dimensions. Machine Learning. Spring The slides are mainly from Vivek Srikumar

Data Mining und Maschinelles Lernen

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Hierarchical Boosting and Filter Generation

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Introduction and Models

Open Problem: A (missing) boosting-type convergence result for ADABOOST.MH with factorized multi-class classifiers

CS229 Supplemental Lecture notes

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Game Theory: Lecture 3

Lecture 7: Passive Learning

Learning by constraints and SVMs (2)

Transcription:

COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call the players Mindy (row) and Max (column)). We said that any such game could be described by its game matrix M, which is constructed from Mindy s point of view and thus describes Mindy s loss for a pair of chosen moves. We also said that there were two types of plays: deterministic and randomized. In deterministic play, the players choose according to pure strategies where they pick the one move to play. In randomized play, the players choose distributions over their sets of possible moves, according to which their moves will be picked. In deterministic play, we said the outcome was M(i, j) [0, ] if Mindy plays row i and Max plays column j. In randomized play, we have an expected outcome. If Mindy has picked distribution and Max has picked distribution, the expected outcome is (i)m(i, j)(j) = M. ij For randomized games, we often use the notation M(, ) = M to denote the expected outcome as a function of the distributions chosen by the two players. 2 Minimax heorem On first look, there seems to be an obvious advantage to playing second, implying that max min M(, ) min max M(, ). However, John von Neumann showed that such an advantage does not exist, that regardless of who goes first, in a game with optimal players, the expected outcome is always the same. Denoting this optimal value as v, we have v = max min M(, ) = min max M(, ). We will prove the following theorem using an online learning algorithm. heorem (von Neumann min max theorem). For randomized zero-sum games of two players: min max M(, ) = max min M(, ) Assuming = arg min max M(, ) and = arg max min M(, ), the theorem implies the following : M(, ) v () : M(, ) v (2)

Here, () implies that even if Max knows Mindy s strategy the most loss Max can inflict is bounded by v and (2) implies that regardless of what strategy Mindy uses, her loss will be at least v. In this sense, is optimal for Mindy. With knowledge of M, we can find through linear programming; However, there are some issues to consider to see why things are not always so simple:. We don t know matrix M because arbitrary interactions are being modeled. 2. he matrix M is very large (we cannot apply linear programming). 3. In reality, the opponent may not be optimal and/or adverserial. he strategy only applies for an optimal opponent. hus, when games are repeatedly played, it is useful to learn the game matrix M and/or the opponent s strategy without the knowledge of either at the beginning of the game. We consider the following online version of iterations of the game: for t =,..., do Mindy (learner) chooses t Max (environment) chooses t with knowledge of t Learner suffers loss M( t, t ) Learner can observer M(i, t ) i end for wo important points to notice here are that Max chooses with knowledge of Mindy s distribution and that Mindy can observe M(i, t ) i, meaning for example if t is concentrated on a pure strategy j, then Mindy can observe the entire column j in M. We have that the total loss here is M( t, t ). he learner wishes to minimize this total loss when compared to the best possible loss had the learner chosen the best fixed strategy for all of the iterations. We want to show: 2. Multiplicative Updates M( t, t ) min M(, t ) + small. Let us consider a simple multiplicative weight update algorithm, which outputs distributions over rows in the following manner: (i) = n i t+ (i) = t(i) β M(i,t), Z where 0 < β < and Z is the normalizing constant. his update is similar in style to the weighted majority algorithm. Basically, the bigger the loss for a row i, the lesser the probability that we use that row in the future. Using this update, we get the following result. 2

heorem 2 (Multiplicative Weights Update). Using the multiplicative weights update, we get M( t, t ) a β min M(, t ) + c β ln n, where a β and c β are functions of β. We will not prove the above result, as it involves analysis similar to the one done for the weighted majority algorithm, mainly using a potential function argument. Corollary 3. We can choose β so that ( ) ln n where = O. M( t, t ) min M(, t ) +, We can see that 0 as and so the corollary states that the average per-round loss for the learner approaches the best possible average per-round loss. Using the algorithm and its corresponding result, we will prove the min-max theorem from above. We say Mindy uses the multiplicative weights algorithm to set t and that Max sets t so as to maximize Mindy s loss: We also define t = arg max M( t, ) = = t t roof. We can see that both and are distributions since they are the average of distributions. Since we know that max min M(, ) min max M(, ), to show equality, 3

we need to show that max min M(, ) min max M(, ). We have min max M max M (3) = max max = min t M (4) t M (5) t M t (6) M t + (7) = min M + (8) max M +, (9) where again 0 as, meaning that the desired result follows. Here, (3) follows by definition of the minimum, (4) by definition of, (5) by convexity, (6) by defintion of t, (7) by Corollary 2., (8) by definition of, and (9) by definition of the maximum. If we skip the first inequality, we get that max M v +, where v = max min M. hus, taking the average of t s computed over the rounds of the algorithm, we get a distribution that is within of the optimal. By running the algorithm for more rounds, we can get closer and closer to the optimal and thus we call an approximate min max strategy. By a similar argument, we can show that is an approximate max min strategy. 3 Relation to Online Learning We will now try to relate our previous analysis to the setting of online learning. he problem is as follows: for t =,..., do Observe x t X redict ŷ t {0, } Observe true label c(x t ) (mistake if c(x t ) ŷ t ) end for Let H be the set of all possible hypotheses. We associate each h H as being an expert and we wish to perform as well as the best expert. As in the past, we would like to show #mistakes #mistakes of best h + small 4

Assuming the sets H and X are finite, for this problem, we set up the game matrix M as follows: where M = x x 2... x n h M(, ) M(, 2)... M(, n) h 2 M(2, ) M(2, 2)... M(2, n)......., h m M(m, ) M(m, 2)... M(m, n) M(h, x) = {, h(x) c(x) 0, otherwise Given a particular x t X, the algorithm uses the distribution t to make a prediction on x t. Choose h according to t. hen, let ŷ t = h(x t ). We then let t be the distribution concentrated on x t. his setup implies the following result M( t, x t ) } {{ } E[# mistakes] min M(h, x t) + small, (0) h H }{{} # mistakes of best h since M( t, x t ) = h H t(h) {h(x) c(x)} = r h t [h(x) c(x)]. If we properly plug into the above result, we will find that we acheive the same bound as in the analysis done for the weighted majority algorithm in the online learning model. 4 Relation to Boosting We now turn to boosting and see how to set it up as a game between two players: the boosting algorithm and the weak learner. Here, H is the set of weak hypotheses and X is the training set. for t =,..., do Boosting Algorithm chooses distribution D t on X Weak Learner chooses h t H end for where it is assumed that r [h t (x) c(x)] x D t 2 γ. We cannot use the game matrix M used in the last section because it would give us a distribution on rows (h s), whereas here we want a distribution on columns (x s). hus, we flip the game matrix M, and renormalize so that the matrix values are in [0, ]. We get the game matrix M = M 5

where meaning M = h h 2... h n x M (, ) M (, 2)... M (, n) x 2 M (2, ) M (2, 2)... M (2, n)......., x m M (m, ) M (m, 2)... M (m, n) M (x, h) = {, h(x) = c(x) 0, otherwise We then have that D t = t and that t is the distribution fully concentrated on the h t given to us. o be clear, the booster is simulating the multiplicative weights algorithm on the game matrix. Applying the multiplicative weights algorithm, we have Notice that implying that 2 + γ M ( t, h t ) min x M (x, h t ) +. M ( t, h t ) = x X t (x) {h t (x) = c(x)} = r x t [h t (x) = c(x)] 2 + γ, M ( t, h t ) min x M (x, h t ) +. Since the second inequality applies to the minimum x, it also applies for all x: x : M (x, h t ) 2 + γ > 2, where the second inequality is true since 0 as. Note that M (x, h t ) is the fraction of weak hypotheses that correctly classify x. Since we have shown that for any x, the fraction of weak hypothesis that correctly classify x is greater than 2, we have that the majority vote MAJ(h (x),..., h (x)) = c(x) for all x. hus, this game formulation solves the problem of Boosting in the simple case. 6