Learning, Games, and Networks

Similar documents
Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Learning with Large Number of Experts: Component Hedge Algorithm

Exponential Weights on the Hypercube in Polynomial Time

From Bandits to Experts: A Tale of Domination and Independence

Applications of on-line prediction. in telecommunication problems

Advanced Machine Learning

The No-Regret Framework for Online Learning

Prediction and Playing Games

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Minimax strategy for prediction with expert advice under stochastic assumptions

Online Learning and Online Convex Optimization

Advanced Machine Learning

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

The Online Approach to Machine Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Full-information Online Learning

Advanced Machine Learning

Agnostic Online learnability

Online Learning with Feedback Graphs

Algorithms, Games, and Networks January 17, Lecture 2

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

U Logo Use Guidelines

A Second-order Bound with Excess Losses

Totally Corrective Boosting Algorithms that Maximize the Margin

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Adaptive Game Playing Using Multiplicative Weights

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Lecture 16: Perceptron and Exponential Weights Algorithm

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Move from Perturbed scheme to exponential weighting average

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

The Multi-Arm Bandit Framework

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Optimal and Adaptive Online Learning

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

CS281B/Stat241B. Statistical Learning Theory. Lecture 14.

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

A survey: The convex optimization approach to regret minimization

Online Learning and Sequential Decision Making

Advanced Machine Learning

Efficient learning by implicit exploration in bandit problems with side observations

Online Forest Density Estimation

A Multiplayer Generalization of the MinMax Theorem

Adaptive Online Gradient Descent

Bandit models: a tutorial

Online Learning with Feedback Graphs

Manfred K. Warmuth - UCSC S.V.N. Vishwanathan - Purdue & Microsoft Research. Updated: March 23, Warmuth (UCSC) ICML 09 Boosting Tutorial 1 / 62

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

arxiv: v4 [cs.lg] 27 Jan 2016

Yevgeny Seldin. University of Copenhagen

Online Learning with Experts & Multiplicative Weights Algorithms

Time Series Prediction & Online Learning

0.1 Motivating example: weighted majority algorithm

Game Theory, On-line Prediction and Boosting

Putting Bayes to sleep

Defensive forecasting for optimal prediction with expert advice

Online Prediction: Bayes versus Experts

Online Optimization with Gradual Variations

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Adaptive Online Prediction by Following the Perturbed Leader

New Algorithms for Contextual Bandits

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Advanced Machine Learning

Multi-armed bandit models: a tutorial

Online combinatorial optimization with stochastic decision sets and adversarial losses

Adaptive Online Learning in Dynamic Environments

No-Regret Algorithms for Unconstrained Online Convex Optimization

Game-Theoretic Learning:

Bandits for Online Optimization

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Adaptive Sampling Under Low Noise Conditions 1

Toward a Classification of Finite Partial-Monitoring Games

Regret Minimization for Branching Experts

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Online Bounds for Bayesian Algorithms

Online Optimization : Competing with Dynamic Comparators

Combinatorial Bandits

Minimax Fixed-Design Linear Regression

Online Learning for Time Series Prediction

Convergence and No-Regret in Multiagent Learning

Minimax Policies for Combinatorial Prediction Games

Zero-Sum Games Public Strategies Minimax Theorem and Nash Equilibria Appendix. Zero-Sum Games. Algorithmic Game Theory.

An adaptive algorithm for finite stochastic partial monitoring

THE first formalization of the multi-armed bandit problem

From External to Internal Regret

On-line Variance Minimization

No-regret algorithms for structured prediction problems

Online Aggregation of Unbounded Signed Losses Using Shifting Experts

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Lecture 8. Instructor: Haipeng Luo

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Putting the Bayes update to sleep

Optimal and Adaptive Online Learning

Internal Regret in On-line Portfolio Selection

Worst-Case Bounds for Gaussian Process Models

Online Dynamic Programming

Transcription:

Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44

Outline 1 Prediction With Experts Advice 2 Application to Game Theory 3 Online Shortest Path (OSP) 4 Open Problems 2 / 44

Prediction With Experts Advice Outline 1 Prediction With Experts Advice 2 Application to Game Theory 3 Online Shortest Path (OSP) 4 Open Problems 3 / 44

Prediction With Experts Advice Prediction with Experts Advice - Setup [1] Setting: We are playing a game with an opponent for T rounds. At each round t we first output a real number ˆx t and then the opponent reveals another real number y t. Our (convex) loss at round t is l(x t, y t). 4 / 44

Prediction With Experts Advice Prediction with Experts Advice - Setup [1] Setting: We are playing a game with an opponent for T rounds. At each round t we first output a real number ˆx t and then the opponent reveals another real number y t. Our (convex) loss at round t is l(x t, y t). The Experts: There are N experts to assist us. At round t, the expert i forecasts a number f i (t). ˆx t could be taken to be any function of the past and present forecasts. 5 / 44

Prediction With Experts Advice Prediction with Experts Advice - Setup [1] Setting: We are playing a game with an opponent for T rounds. At each round t we first output a real number ˆx t and then the opponent reveals another real number y t. Our (convex) loss at round t is l(x t, y t). The Experts: There are N experts to assist us. At round t, the expert i forecasts a number f i (t). ˆx t could be taken to be any function of the past and present forecasts. Expert 1 forecasts={10, 4, 2, 30,...} Expert 2 forecasts={20, 6, 50, 10,...} Expert 3 forecasts={30, 5, 10, 42,...} 6 / 44

Prediction With Experts Advice Prediction with Experts Advice - Setup [1] Setting: We are playing a game with an opponent for T rounds. At each round t we first output a real number ˆx t and then the opponent reveals another real number y t. Our (convex) loss at round t is l(x t, y t). The Experts: There are N experts to assist us. At round t, the expert i forecasts a number f i (t). ˆx t could be taken to be any function of the past and present forecasts. Expert 1 forecasts={10, 4, 2, 30,...} Expert 2 forecasts={20, 6, 50, 10,...} Expert 3 forecasts={30, 5, 10, 42,...} Observed Info: After each round we get to observe the loss incurred by each experts l(f it, y t). 7 / 44

Prediction With Experts Advice Prediction Strategy Problem Design an online strategy for selecting {ˆx t} so that our cumulative loss remains close to the best fixed expert in the hindsight. 8 / 44

Prediction With Experts Advice Prediction Strategy Problem Design an online strategy for selecting {ˆx t} so that our cumulative loss remains close to the best fixed expert in the hindsight. Let L(i, t) denote the total loss of expert i up to time t 1, i.e., t 1 L(i, t) = l(f i (τ), y τ ) (1) τ=1 9 / 44

Prediction With Experts Advice Prediction Strategy Problem Design an online strategy for selecting {ˆx t} so that our cumulative loss remains close to the best fixed expert in the hindsight. Let L(i, t) denote the total loss of expert i up to time t 1, i.e., t 1 L(i, t) = l(f i (τ), y τ ) (1) τ=1 Performance Metric: A natural metric to measure the performance of an online algorithm π is to bound its regret R π (T ) up to time T defined as follows: R π ( (T ) = max L π (T ) min L(i, T ) ) y i A natural strategy is to take a weighted combination w(t) of experts prediction at time t, where the weights are decreasing with the expert s cumulative losses up to time t 1. This ensures that we give more weights to the experts who are doing better in terms of their total losses 10 / 44

Prediction With Experts Advice Prediction Strategy Problem Design an online strategy for selecting {ˆx t} so that our cumulative loss remains close to the best fixed expert in the hindsight. Let L(i, t) denote the total loss of expert i up to time t 1, i.e., t 1 L(i, t) = l(f i (τ), y τ ) (1) τ=1 Performance Metric: A natural metric to measure the performance of an online algorithm π is to bound its regret R π (T ) up to time T defined as follows: R π ( (T ) = max L π (T ) min L(i, T ) ) y i A natural strategy is to take a weighted combination w(t) of experts prediction at time t, where the weights are decreasing with the expert s cumulative losses up to time t 1. This ensures that we give more weights to the experts who are doing better in terms of their total losses In other words, for a suitable weight vector w(t), we predict i ˆx t = w i,tf i (t) i w i,t (2) 11 / 44

Prediction With Experts Advice Performance of EXP Question: How to select the weights? 12 / 44

Prediction With Experts Advice Performance of EXP Question: How to select the weights? Strategy EXP for some η > 0. w i,t = exp( ηl(i, t 1)), (3) [Theorem: Regret of EXP] R EXP (T ) ln N η + η T 13 / 44

Prediction With Experts Advice Performance of EXP Question: How to select the weights? Strategy EXP for some η > 0. w i,t = exp( ηl(i, t 1)), (3) [Theorem: Regret of EXP] R EXP (T ) ln N η + η T By choosing η T ln N, we obtain a O( T ) regret bound for the exponential algorithm. Thus the average regret per round diminishes to zero as T. Proof involves a Lyapunov argument with the logarithm of the total weights L(t) = ln( i w i,t) as the Lyapunov function. 14 / 44

Prediction With Experts Advice Lower Bounds : Can we do better? Here is my existential proof that we cannot do better than O( T ) in terms of regret, asymptotically. Proof: Assume that there are two experts and an environment, all of which output iid binary sequences {f (t), y(t)}, independent of each other. Choose the loss function to be l(x, y) = x y. Now observe: The expected loss of each expert at any slot t is simply 1, independent of y. 2 The expected loss of the player at each slot is 1, irrespective of his output 2 ˆx(t) {0, 1}. 15 / 44

Prediction With Experts Advice Lower Bounds : Can we do better? Here is my existential proof that we cannot do better than O( T ) in terms of regret, asymptotically. Proof: Assume that there are two experts and an environment, all of which output iid binary sequences {f (t), y(t)}, independent of each other. Choose the loss function to be l(x, y) = x y. Now observe: The expected loss of each expert at any slot t is simply 1, independent of y. 2 The expected loss of the player at each slot is 1, irrespective of his output 2 ˆx(t) {0, 1}. Suffices to show that regret with respect to any expert (say, the first expert) is O( T ). 16 / 44

Prediction With Experts Advice Proof Contd. Analysis: P(R π (T ) T ) P(L π (T ) L(1, T ) T ) = T ( P( Y (t) ˆX (t) Y (t) F 1 (t) ) T ) t=1 Define the random variable Z(t) = ( Y (t) ˆX (t) Y (t) F 1 (t) ). Clearly, the sequence of random variables {Z(t)} are i.i.d. with zero mean and variance = 1 2. Thus, we have lim T P(Rπ (T ) T ) lim P( T Z(t) T ) CLT = 1 Φ( 2) 0.07 T t=1 Thus, for large enough T, there is a strictly positive probability that regret is greater than T. This shows that there exists a sequence of forecasts f (t) and an adversarial sequence y t such that no online strategy achieves regret smaller that O( T ). 17 / 44

Application to Game Theory Outline 1 Prediction With Experts Advice 2 Application to Game Theory 3 Online Shortest Path (OSP) 4 Open Problems 18 / 44

Application to Game Theory Set Up: Two Player zero-sum game [2] Consider a finite, two player (row and column), zero-sum game, with the payoff matrix given by M. The row player plays a randomized strategy P and the column player plays a randomized strategy Q. Loss of the row and column player is P T MQ and -P T MQ. 19 / 44

Application to Game Theory Set Up: Two Player zero-sum game [2] Consider a finite, two player (row and column), zero-sum game, with the payoff matrix given by M. The row player plays a randomized strategy P and the column player plays a randomized strategy Q. Loss of the row and column player is P T MQ and -P T MQ. Assume that the game is played sequentially in the following orders: Case 1: Row player makes the first move Optimal strategy is given by the solution of the following problem V 1 = min max P Q PT MQ (4) Case 2: Column player makes the first move Optimal strategy is given by the solution of the following problem V 2 = max min Q P PT MQ, (5) where V 1 and V 2 are the corresponding pay off of the row player. 20 / 44

Application to Game Theory Minimax Theorem Intutively, for any player making move after the other player is advantageous as he knows his opponent s move. Hence, it is no surprise that V 2 = max Q min P PT MQ min max P Q PT MQ = V 1 (6) 21 / 44

Application to Game Theory Minimax Theorem Intutively, for any player making move after the other player is advantageous as he knows his opponent s move. Hence, it is no surprise that V 2 = max Q min P PT MQ min max P Q PT MQ = V 1 (6) The startling result by Von Neumann is that the second player does not have any advantage in expectation, if the first player plays optimally: Theorem (Minimax Theorem) For any payoff matrix M: min Q max P PT MQ = max min P Q PT MQ (7) 22 / 44

Application to Game Theory Minimax Theorem Intutively, for any player making move after the other player is advantageous as he knows his opponent s move. Hence, it is no surprise that V 2 = max Q min P PT MQ min max P Q PT MQ = V 1 (6) The startling result by Von Neumann is that the second player does not have any advantage in expectation, if the first player plays optimally: Theorem (Minimax Theorem) For any payoff matrix M: min Q max P PT MQ = max min P Q PT MQ (7) 23 / 44

Application to Game Theory EXP and Game Theory Questions: How to find the minimax strategy? Classically the minimax strategy is derived by solving an LP, which requires knowledge of the full pay-off matrix. What has it to do with the theory of experts advice? 24 / 44

Application to Game Theory EXP and Game Theory Questions: How to find the minimax strategy? Classically the minimax strategy is derived by solving an LP, which requires knowledge of the full pay-off matrix. What has it to do with the theory of experts advice? Answers: We will see that the EXP algorithm gives an alternative online strategy for finding the MiniMax strategy, without requiring knowledge of the full payoff matrix M. It also yields an alternative concise proof of the MiniMax Theorem (which is classically proved via LP duality). 25 / 44

Application to Game Theory A Minimax Strategy for the Learner Setting: Repeated Games Imagine the learner to be the Row player and the Adversary to be the column player. Experts are the actions of the row players. At each round the learner plays a randomized strategy P t and the adversary plays a randomized strategy Q t (which may depend on the current choice and the past history). The learner observes the losses of each action M(i, Q t) at time t. At time t, the learner incurs a loss M(P t, Q t) = P T t MQt. 26 / 44

Application to Game Theory A Minimax Strategy for the Learner Setting: Repeated Games Imagine the learner to be the Row player and the Adversary to be the column player. Experts are the actions of the row players. At each round the learner plays a randomized strategy P t and the adversary plays a randomized strategy Q t (which may depend on the current choice and the past history). The learner observes the losses of each action M(i, Q t) at time t. At time t, the learner incurs a loss M(P t, Q t) = P T t MQt. Learner s strategy: EXP P t is chosen accoding to the EXP rule: probability of playing action i is proportional to Exponential of the minus of cumulative loss incurred by the action i, i.e., w t+1 (i) = w t(i) exp( ηm(i, Q t)) P t w t 27 / 44

Application to Game Theory Performance Analysis Invoking our general theory of EXP strategy, we readily obtain the following regret bound. With η T ln(n), we have T Avg. Loss = 1 T t=1 ln N where T = O 0 as T. T M(P t, Q t) min P 1 T T M(P, Q t) + T, (8) t=1 28 / 44

Application to Game Theory Performance Analysis Invoking our general theory of EXP strategy, we readily obtain the following regret bound. With η T ln(n), we have T Avg. Loss = 1 T t=1 ln N where T = O 0 as T. T M(P t, Q t) min P 1 T T M(P, Q t) + T, (8) Corollary: Note that the above bound holds for any choice of the opponent s strategy Q t. First, let Q t = Qt be the MinMax strategy against P t. We have from the LHS, t=1 Avg. Loss 1 T T t=1 M(P t, Q t ) min P max M(P, Q) (9) Q 29 / 44

Application to Game Theory Performance Analysis Invoking our general theory of EXP strategy, we readily obtain the following regret bound. With η T ln(n), we have T Avg. Loss = 1 T t=1 ln N where T = O 0 as T. T M(P t, Q t) min P 1 T T M(P, Q t) + T, (8) Corollary: Note that the above bound holds for any choice of the opponent s strategy Q t. First, let Q t = Qt be the MinMax strategy against P t. We have from the LHS, t=1 Avg. Loss 1 T T t=1 M(P t, Q t ) min P max M(P, Q) (9) Q Bound the RHS by the MaxMin strategy P = P Avg. Loss 1 T T t=1 M(P, Qt ) + T max min M(P, Q) + T, (10) Q P 30 / 44

Application to Game Theory Proof Contd. Combining the above two Eqns., we get min P max Q M(P, Q) Avg. Loss max Q min P M(P, Q) + T (11) Letting T 0, and combining it with the previous bound, we recover the famous minmax theorem. 31 / 44

Application to Game Theory Proof Contd. Combining the above two Eqns., we get min P max Q M(P, Q) Avg. Loss max Q min P M(P, Q) + T (11) Letting T 0, and combining it with the previous bound, we recover the famous minmax theorem. Moreover, the above proof shows that by choosing Q t = arg max Q M(P t, Q) and updating P t according to EXP and letting T, we may obtain the Nash Equilibrium of both players as follows P = 1 T P t T t=1 Q = 1 T Q t T t=1 32 / 44

Online Shortest Path (OSP) Outline 1 Prediction With Experts Advice 2 Application to Game Theory 3 Online Shortest Path (OSP) 4 Open Problems 33 / 44

Online Shortest Path (OSP) Online Shortest Path Consider the classic problem of online shortest path. Here we are given a graph G(V, E) with a source-destination pair s, t. At every slot t, an adversary assigns cost c(e, t) to each edge e. The goal is to pick an s t path at each slot, so that the cumulative regret upto time T remains small w.r.t. the best path. 34 / 44

Online Shortest Path (OSP) Online Shortest Path Consider the classic problem of online shortest path. Here we are given a graph G(V, E) with a source-destination pair s, t. At every slot t, an adversary assigns cost c(e, t) to each edge e. The goal is to pick an s t path at each slot, so that the cumulative regret upto time T remains small w.r.t. the best path. Can frame the problem as an expert s prediction problem, with each path acting as an expert. The problem with this naive approach is that there are exponentially many s t paths and hence experts. The algorithm will suffer an undesirable exponential slow down. 35 / 44

Online Shortest Path (OSP) Online Shortest Path Consider the classic problem of online shortest path. Here we are given a graph G(V, E) with a source-destination pair s, t. At every slot t, an adversary assigns cost c(e, t) to each edge e. The goal is to pick an s t path at each slot, so that the cumulative regret upto time T remains small w.r.t. the best path. Can frame the problem as an expert s prediction problem, with each path acting as an expert. The problem with this naive approach is that there are exponentially many s t paths and hence experts. The algorithm will suffer an undesirable exponential slow down. The solution to get around this issue is the following randomized strategy : Follow The Perturbed Leading Path (FPL). 36 / 44

Online Shortest Path (OSP) The Algorithm FPL [3] At each time t: 1 For each edge e, pick a number p t(e) randomly from a two-sided exponential distribution (with parameter ɛ). 2 Use the shortest path π t on the graph G with edge e s weight C t(e) + p t(e), where C t(e) is total cost incurred by traversing on edge e, i.e., t 1 C t(e) = c(e, τ) τ=1 37 / 44

Online Shortest Path (OSP) The Algorithm FPL [3] At each time t: 1 For each edge e, pick a number p t(e) randomly from a two-sided exponential distribution (with parameter ɛ). 2 Use the shortest path π t on the graph G with edge e s weight C t(e) + p t(e), where C t(e) is total cost incurred by traversing on edge e, i.e., t 1 C t(e) = c(e, τ) τ=1 Theorem (Performance of FPL) E(Cost) (1 + ɛ)c + O(mn log n) ɛ (12) 38 / 44

Online Shortest Path (OSP) The Algorithm FPL [3] At each time t: 1 For each edge e, pick a number p t(e) randomly from a two-sided exponential distribution (with parameter ɛ). 2 Use the shortest path π t on the graph G with edge e s weight C t(e) + p t(e), where C t(e) is total cost incurred by traversing on edge e, i.e., t 1 C t(e) = c(e, τ) τ=1 Theorem (Performance of FPL) E(Cost) (1 + ɛ)c + O(mn log n) ɛ (12) For completeness, we also mention that, there exists a clever way for efficiently implementing the original EXP algorithm with exponentially many experts, if the underlying graph is a DAG [1]. 39 / 44

Open Problems Outline 1 Prediction With Experts Advice 2 Application to Game Theory 3 Online Shortest Path (OSP) 4 Open Problems 40 / 44

Open Problems Online Sabotaged Shortest Path [4] The setting is similar to OSP, however at each slot some edges are sabotaged, and hence, some paths are blocked. The regret with respect to a constant path P is defined as follows R T (P) = Pt π c t P c t (13) t:p is open Again, the objective is to keep the regret small with respect to all paths P simultaneously. 41 / 44

Open Problems Online Sabotaged Shortest Path [4] The setting is similar to OSP, however at each slot some edges are sabotaged, and hence, some paths are blocked. The regret with respect to a constant path P is defined as follows R T (P) = Pt π c t P c t (13) t:p is open Again, the objective is to keep the regret small with respect to all paths P simultaneously. Baseline Algorithm : Sleeping Experts 1 Maintain an expert for each path. 2 At each slot, if an expert is awake (path open), update its weight according to the EXP algorithm. 3 If an expert is asleep (path closed), keep its weight unchanged. 4 Select an open path with probability proportional to its weight. 42 / 44

Open Problems Performance Regret bound is satisfactory: ER T (P) K T ln D, (14) where K is the length of the longest s t path and D is the total number of s t paths (which could be exponential but that does not matter as we are taking log of it). Issues: 1 The regret bound is suboptimal (by a factor K) 2 The run time could be exponential Open Problem [4] : Can we design an efficient prediction strategy similar to FPL with good regret guarantee? 43 / 44

Open Problems References Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006. Yoav Freund and Robert E Schapire. Game theory, on-line prediction and boosting. In Proceedings of the ninth annual conference on Computational learning theory, pages 325 332. ACM, 1996. Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291 307, 2005. Wouter M Koolen, Manfred K Warmuth, and Dmitry Adamskiy. Open problem: Online sabotaged shortest path. In Proceedings of The 28th Conference on Learning Theory, pages 1764 1766, 2015. 44 / 44