The No-Regret Framework for Online Learning

Similar documents
Agnostic Online learnability

Full-information Online Learning

Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics

Learning, Games, and Networks

Online Learning and Sequential Decision Making

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

An Online Convex Optimization Approach to Blackwell s Approachability

Online Learning and Online Convex Optimization

The Online Approach to Machine Learning

Online Convex Optimization

Adaptive Online Gradient Descent

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

From Bandits to Experts: A Tale of Domination and Independence

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Advanced Machine Learning

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Generalization bounds

Active Learning and Optimized Information Gathering

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Foundations of Machine Learning

Yevgeny Seldin. University of Copenhagen

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

Exponential Weights on the Hypercube in Polynomial Time

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Machine Learning Theory (CS 6783)

Applications of on-line prediction. in telecommunication problems

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

A survey: The convex optimization approach to regret minimization

Online Learning with Feedback Graphs

Bandit models: a tutorial

Online Learning for Time Series Prediction

Worst-Case Bounds for Gaussian Process Models

Bandits for Online Optimization

Machine Learning Theory (CS 6783)

Online Learning with Feedback Graphs

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Multi-armed bandit models: a tutorial

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

IFT Lecture 7 Elements of statistical learning theory

Online Prediction: Bayes versus Experts

An Algorithms-based Intro to Machine Learning

CS261: Problem Set #3

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Online Learning, Mistake Bounds, Perceptron Algorithm

Decision trees COMS 4771

Perceptron Mistake Bounds

NOTE. A 2 2 Game without the Fictitious Play Property

Time Series Prediction & Online Learning

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Adaptive Game Playing Using Multiplicative Weights

THE first formalization of the multi-armed bandit problem

arxiv: v1 [cs.lg] 8 Feb 2018

Optimal and Adaptive Online Learning

arxiv: v1 [cs.lg] 8 Nov 2010

Convex Repeated Games and Fenchel Duality

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Machine Learning and Data Mining. Linear classification. Kalev Kask

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Algorithms, Games, and Networks January 17, Lecture 2

The sample complexity of agnostic learning with deterministic labels

arxiv: v4 [cs.lg] 27 Jan 2016

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

Adaptive Online Learning in Dynamic Environments

On Competitive Prediction and Its Relation to Rate-Distortion Theory

Piecewise-stationary Bandit Problems with Side Observations

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Multi-Armed Bandit Formulations for Identification and Control

Robustness and duality of maximum entropy and exponential family distributions

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

0.1 Motivating example: weighted majority algorithm

Convex Repeated Games and Fenchel Duality

Online Bounds for Bayesian Algorithms

Agnostic Online Learning

Classification objectives COMS 4771

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

1 Review of Winnow Algorithm

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

A Second-order Bound with Excess Losses

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Algorithmic Stability and Generalization Christoph Lampert

University of Alberta. The Role of Information in Online Learning

Brown s Original Fictitious Play

Learning to play partially-specified equilibrium

The FTRL Algorithm with Strongly Convex Regularizers

Online Learning Summer School Copenhagen 2015 Lecture 1

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

Adaptive Sampling Under Low Noise Conditions 1

Least Squares Regression

Littlestone s Dimension and Online Learnability

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Least Squares Regression

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Lecture 16: Perceptron and Exponential Weights Algorithm

Sequential Decision Making in Non-stochastic Environments

Transcription:

The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin, Technion 1 / 28

Outline 1 The online decision problem 2 Blackwell s Approachability 3 The Multiplicative Weight Algorithm 4 Online convex programming 5 Recent topics N. Shimkin, Technion 2 / 28

1. The Online Decision Problem N. Shimkin, Technion 3 / 28

Supervised Learning 1.1 Preliminaries: Supervised Learning The basic problem of supervised learning can be roughly formulated in the following terms. Given a sequence of examples (x n, y n ) N n=1, where x n X is the input pattern (or feature vector) and y n Y the corresponding label, find a map h : x X ŷ Y, that will correctly predict the true label y of other (yet unseen) input patterns x. N. Shimkin, Technion 4 / 28

Supervised Learning 1.1 Preliminaries: Supervised Learning The basic problem of supervised learning can be roughly formulated in the following terms. Given a sequence of examples (x n, y n ) N n=1, where x n X is the input pattern (or feature vector) and y n Y the corresponding label, find a map h : x X ŷ Y, that will correctly predict the true label y of other (yet unseen) input patterns x. When Y is a discrete (finite) set, this formulation corresponds to a classification problem. If Y = {0, 1} this a binary classification problem, and for Y a continuous set we obtain a regression problem. Statistical learning theory often assumes that the samples (x n, y n ) are i.i.d. samples from a fixed, yet unknown, probability distribution. N. Shimkin, Technion 4 / 28

Supervised learning The Online Decision Problem Supervised Learning The quality of prediction is measured by a cost (or loss) function c(ŷ, y). For classification this may be the 0-1 cost function: c(ŷ, y) = 1 {ŷ=y}. In regression, this may be the quadratic cost: c(ŷ, y) = ŷ y 2. N. Shimkin, Technion 5 / 28

Supervised learning The Online Decision Problem Supervised Learning The quality of prediction is measured by a cost (or loss) function c(ŷ, y). For classification this may be the 0-1 cost function: c(ŷ, y) = 1 {ŷ=y}. In regression, this may be the quadratic cost: c(ŷ, y) = ŷ y 2. It is sometimes convenient to allow the predictions ŷ to take values in a larger set Ŷ than the label set Y. For example, in the binary classification problems we may want to allow probabilistic predictions ( 80% chance for rain tomorrow ). We therefore consider prediction functions h : X Ŷ and cost c : Ŷ Y R +. The prediction function h is often restricted to some predefined class H. For example, in linear regression, h(x) = w, x, where w is a vector of weights to be tuned. N. Shimkin, Technion 5 / 28

Supervised learning The Online Decision Problem Supervised Learning The quality of prediction is measured by a cost (or loss) function c(ŷ, y). For classification this may be the 0-1 cost function: c(ŷ, y) = 1 {ŷ=y}. In regression, this may be the quadratic cost: c(ŷ, y) = ŷ y 2. It is sometimes convenient to allow the predictions ŷ to take values in a larger set Ŷ than the label set Y. For example, in the binary classification problems we may want to allow probabilistic predictions ( 80% chance for rain tomorrow ). We therefore consider prediction functions h : X Ŷ and cost c : Ŷ Y R +. The prediction function h is often restricted to some predefined class H. For example, in linear regression, h(x) = w, x, where w is a vector of weights to be tuned. Below we will variably denote the Euclidean inner product as w, x, w x, or w T x (for column vectors), as convenient. N. Shimkin, Technion 5 / 28

Online Learning and Regret 1.2 Online Learning and Regret In online learning, examples are displayed sequentially, and learning takes place simultaneously with prediction. A generic template for this process is as follows. For t = 1, 2,..., observe input x t X predict y t Ŷ observe true answer y t Y suffer cost (or loss) c(ŷ t, y t ) N. Shimkin, Technion 6 / 28

Online Learning and Regret 1.2 Online Learning and Regret In online learning, examples are displayed sequentially, and learning takes place simultaneously with prediction. A generic template for this process is as follows. For t = 1, 2,..., observe input x t X predict y t Ŷ observe true answer y t Y suffer cost (or loss) c(ŷ t, y t ) The cumulative cost over T periods is therefore C T = T c(ŷ t, y t ). It is generally required to make this cumulative loss as small as possible, in some appropriate sense. t=1 N. Shimkin, Technion 6 / 28

Online learning examples Online Learning and Regret For concreteness, consider the following familiar examples: 1 Weather prediction: Suppose we wish to predict tomorrow s weather. This may involve a classification problem (rain/no-rain), or a regression problem (max/min temperature prediction). In any case, while we may be improving our prediction capability over time, the goal is clearly to provide accurate predictions throughout this period. N. Shimkin, Technion 7 / 28

Online learning examples Online Learning and Regret For concreteness, consider the following familiar examples: 1 Weather prediction: Suppose we wish to predict tomorrow s weather. This may involve a classification problem (rain/no-rain), or a regression problem (max/min temperature prediction). In any case, while we may be improving our prediction capability over time, the goal is clearly to provide accurate predictions throughout this period. 2 Spam filter: Here we wish to classify incoming mail as spam / no spam. Some messages x t are displayed to the user to get their true label y t. (The process of choosing which messages to display falls in the area of active learning, which we do not address here. It is also interesting that valuable information can be gained from unlabeled examples; this is addressed within semi-supervised learning.) Here, again, learning takes place online. N. Shimkin, Technion 7 / 28

The Arbitrary Opponent Online Learning and Regret In these lectures we do not impose statistical assumptions on the examples sequence. Rather, we refer to this sequence as arbitrary. It is convenient to think of this sequence as chosen by an opponent (often an imaginary one). We may further distinguish between the following cases: 1 An oblivious opponent: Here the example sequence (x t, y t ) is preset, in the sense that it does not depend on the results (ŷ t ) of our prediction algorithm. This would be the case in the weather prediction problem. 2 An adaptive, or adversarial, opponent: Here the opponent may choose future examples based on past predictions of our algorithm. This might be the case in the spam filter example. We will make these assumptions more concrete in the problems discussed below. In either case, we refer to the choice of samples by the opponent as the opponent s strategy. N. Shimkin, Technion 8 / 28

The Arbitrary Opponent Online Learning and Regret Either way, it is evident that the cumulative cost will depend on the examples sequence. For example, in areas where it rains every day (or never rains), we expect to approach 100% success rate. While if Rain happens to follow an i.i.d. Bernoulli sequence with parameter q [0, 1], the best we can hope for is an asymptotic relative accuracy of min{q, 1 q}. Thus, it would be advisable to compare the performance of our learning prediction algorithm to ideal, baseline performance. This is where the concept of regret comes in. N. Shimkin, Technion 9 / 28

Regret The Online Decision Problem Online Learning and Regret The (cumulative, T -step) regret relative to some fixed predictor h is defined as follows: T T Regret T (h) = c(ŷ t, y t ) c(h(x t ), y t ) t=1 t=1 N. Shimkin, Technion 10 / 28

Regret The Online Decision Problem Online Learning and Regret The (cumulative, T -step) regret relative to some fixed predictor h is defined as follows: T T Regret T (h) = c(ŷ t, y t ) c(h(x t ), y t ) t=1 The regret relative to a predictor class H (e.g., the set of all linear predictors) is then defined as Regret T (H) = max h H Regret T (h) = t=1 T c(ŷ t, y t ) min t=1 h H t=1 T c(h(x t ), y t ) In the last term, the best fixed predictor h H is chosen with the benefit of hindsight, i.e., given the entire sample sequence (x t, y t ) up to T. N. Shimkin, Technion 10 / 28

The No-Regret Property Online Learning and Regret A learning algorithm is said to have the no-regret property (w.r.t. H) if the regret is guaranteed to grow sub-linearly in T, namely, Regret T (H) = o(t ) (in some appropriate sense), for any strategy of the opponent. N. Shimkin, Technion 11 / 28

Prediction 1.3 Prediction An important special case is the problem of sequential prediction. Here the challenge is to predict the next element y t based on the previous elements (y 1,..., y t 1 ) of the sequence, for t = 1, 2,.... This may be viewed as a special case of the previous model, with absent patterns x t. If we assume a statistical structure (e.g., Markovian) on the sequence, the problem becomes that of statistical model estimation and prediction. Prediction of arbitrary sequences with regret bounds has been treated extensively in the Information Theory literature (under the names universal prediction or individual sequence prediction; see Merhav and Feder (1998). N. Shimkin, Technion 12 / 28

Prediction with Expert Advice Prediction A related important model is that of prediction with expert advice (Littlestone and Warmuth, 1994). Here we are assisted in out prediction task by a set E of experts (which themselves may be prediction algorithms), among which we wish to follow the best one. N. Shimkin, Technion 13 / 28

Prediction with Expert Advice Prediction A related important model is that of prediction with expert advice (Littlestone and Warmuth, 1994). Here we are assisted in out prediction task by a set E of experts (which themselves may be prediction algorithms), among which we wish to follow the best one. The problem is formulated as follows. For t = 1, 2,..., 1 The environment chooses the outcome y t. 2 Simultaneously, each expert e chooses an advice ŷ e,t, which is revealed to the forecaster. 3 The forecaster chooses a prediction ŷ t, after which he observes the true answer y t. 4 The forecaster suffers a loss c(ŷ t, y t ) The goal of the forecaster here is to minimize his regret with respect the best expert. Here the regret is defined as T T Regret T = c(ŷ t, y t ) min c(ŷ e,t, y t ) e E t=1 N. Shimkin, Technion 13 / 28 t=1

Repeated Games 1.4 Repeated Games The notion of no-regret strategies, or no-regret play, was introduced in a seminal 1957 paper by Hannan (1957) in the context of repeated matrix games. N. Shimkin, Technion 14 / 28

Matrix Games (reminder) Repeated Games Recall that a zero-sum matrix game is defined through a payoff (or cost) matrix Γ = {c(i, j) : i I, j J }, where I is the set of actions of player 1 (PL1, the learner), and J is the set of actions for player 2 (PL2, the opponent). Let p (I) denote a mixed action of PL1, and q (J ) a mixed action of PL2. The expected cost (to PL1) under mixed actions p and q is c(p, q) = i,j p i c(i, j)q j The minimax value of the game (with PL1 the minimizer) is given by v(γ) = min max p (I) q (J ) = max min q (J ) p (I) c(p, q) c(p, q) N. Shimkin, Technion 15 / 28

The Repeated Game The Online Decision Problem Repeated Games The repeated game Γ proceeds in stages t = 1, 2,..., where at stage t, PL1 and PL2 simultaneously choose action i t and j t, respectively, which are then observed (and recalled) by both players. A strategy π 1 for PL1 is a map that assigns a mixed action p t (I) to each possible history H t = (i 1, j 1,..., i t 1, j t 1 ) and time t 1, and similarly a strategy π 2 for PL2 assigns a mixed action q t (J ) to each possible history. The actions i i and j t are chosen randomly according to p t and q t. Any pair (π 1, π 2 ) induces a probability measure on H, which we denote by P π1,π 2. We shall be interested in the (long-term) cumulative cost (or loss): C T = T c(i t, j t ) t=1 N. Shimkin, Technion 16 / 28

Regret (again) The Online Decision Problem Repeated Games The minimax value of this game (for any fixed T, and for T ) is easily seen to equal v(γ), the value of the single-shot game. However, assuming that PL2 is not necessarily adversarial (and, indeed, not necessary rational in the game-theoretic sense), it begs the question whether PL1 can gain more than the value by adapting to the observed action history of PL2. N. Shimkin, Technion 17 / 28

Regret (again) The Online Decision Problem Repeated Games The minimax value of this game (for any fixed T, and for T ) is easily seen to equal v(γ), the value of the single-shot game. However, assuming that PL2 is not necessarily adversarial (and, indeed, not necessary rational in the game-theoretic sense), it begs the question whether PL1 can gain more than the value by adapting to the observed action history of PL2. To this end, define the following (cumulative) regret for PL1, R T = T t=1 c(i t, j t ) min i I T c(i, j t ) (1) The second term on the RHS serves here as our reference level, to which the actual cost is compared. t=1 Naturally, PL1 would like to have a regret as small as possible. N. Shimkin, Technion 17 / 28

No-regret Strategies The Online Decision Problem Repeated Games Define the average regret: R T = 1 T R T Definition A strategy π 1 of PL1 is said to have the no-regret property (or be Hannan-consistent) if lim sup T R T 0, P π1,π 2 -a.s. (2) for any strategy π 2 of PL2. More succinctly, we may write this property as R T o(1) (a.s.), or R T o(t ) (a.s.). N. Shimkin, Technion 18 / 28

Some More Notations... Repeated Games The RHS of (1) can be written in another convenient form. Let q T = 1 T T t=1 e j t denote the empirical frequency vector of PL2 s actions (here e j (J ) is the mixed action concentrated on action j). Recalling our convention c(i, q) = j c(i, j)q j, it follows that min i I 1 T T t=1 c(i, j t ) = min c(i, q T ) = min c(p, q T ) = c ( q T ) i I p (I) where c (q) = min p c(p, q) is the best-response cost, also known as the Bayes risk of the game. The average regret may now be written more succinctly as R T = C T c ( q T ) N. Shimkin, Technion 19 / 28

Repeated Games The no-regret property R T o(1) (a.s.) may now be written as 1 T C T c (ȳ T ) + o(1) (a.s.). Accordingly, a no-regret strategy is sometimes said to attain the Bayes risk of the game. N. Shimkin, Technion 20 / 28

Notes on Randomization Repeated Games 1. Necessity of randomization: It is easily seen that no deterministic strategy, i.e., a map i t = f t (h t 1 ), can satisfy the no-regret property. Indeed, an (adaptive) opponent has access to h t 1, and can choose j t the maximize the loss vs. i t. For example, in the binary prediction problem he might choose j t = 1 i t, which obtains a cumulative loss of C T = T and cumulative regret of 1 2T or more. Hence, we use strictly mixed actions p t as part of the learning policy. N. Shimkin, Technion 21 / 28

Notes on Randomization Repeated Games 2. Smoothed regret: It is easily seen that the difference d t = c(p t, j t ) c(i t, j t ) is a bounded Martingale difference sequence on F t = σ{h t 1, j t }. Hence, by the CLT, the average difference 1 T T t=1 d t is of order O( T ), and in particular converges to 0 (a.s.). We can therefore define the regret in terms of c(p t, j t ) in place of c(i t, j t ), namely R T = T t=1 c(p t, j t ) min i I T c(i, q t ) This definition allows to establish sample-path (rather than a.s.) bounds on the regret. Henceforth we will use the latter definition. t=1 N. Shimkin, Technion 22 / 28

Hannan s No-regret Theorem Repeated Games Theorem (Hannan 57) There exists a strategy π for the learner so that R T c 0 T for any strategy of the opponent and T 1. Here c 0 = ( 3 2 n I ) 1 2 n J span(c). The constant c 0 in Hannan s result was subsequently improved, but not the rate T which is optimal. The proposed strategy was a perturbed FTL (follow the leader) scheme, which we briefly describe next. N. Shimkin, Technion 23 / 28

FTL-Type Strategies The Online Decision Problem Repeated Games The FTL (follow the leader) strategy is given by: i t+1 = argmin{ i I (with ties arbitrarily broken). t c(i, j t )} s=1 = argmin{c(i, q t )} = BR( q t ) i I Here the learner uses a best-response action against the empirical frequency of the opponent s actions. This simple rule is also known as fictitious play in the game literature. N. Shimkin, Technion 24 / 28

Repeated Games It is easily seen that FTL does not have the no-regret property, even against an oblivious opponent. Consider the binary prediction problem, namely i, j {0, 1}, with c(i, j) = 1 {i=j}. Suppose PL2 chooses the sequence (j, 1, 0, 1, 0,... ), where j is some auxiliary action with c(0, j ) = 0.5, c(1, j ) = 0. In that case FTL yields the sequence (i t ) = (?, 0, 1, 0, 1,... ), which oscillates opposite to PL2 s actions, leading to R T 1 2 T. N. Shimkin, Technion 25 / 28

Repeated Games It is easily seen that FTL does not have the no-regret property, even against an oblivious opponent. Consider the binary prediction problem, namely i, j {0, 1}, with c(i, j) = 1 {i=j}. Suppose PL2 chooses the sequence (j, 1, 0, 1, 0,... ), where j is some auxiliary action with c(0, j ) = 0.5, c(1, j ) = 0. In that case FTL yields the sequence (i t ) = (?, 0, 1, 0, 1,... ), which oscillates opposite to PL2 s actions, leading to R T 1 2 T. Still, FTL can be modified by essentially smoothing the best-response map, so that the oscillation observed above is prevented. N. Shimkin, Technion 25 / 28

Perturbed FTL The Online Decision Problem Repeated Games Let z t = (z j,t ) j J, t 1 be a collection of i.i.d. random variables, with z j,t U[0, 1] uniformly distributed on [0, 1]. Let i t+1 = BR( q t + λ t z t ) Hannan s result holds with z j,t U[0, 1] uniformly distributed on [0, 1], and λ t = c 1 / t, with c 1 = (3n 2 J /2n I) 1/2. N. Shimkin, Technion 26 / 28

Smooth Fictitious Play Repeated Games In terms of mixed actions, Perturbed FTL effectively leads to p t+1 = BR λt (ȳ t ), where q p = BR λ (q) is a smooth version of BR( ) (for each λ > 0). In the next variant, smoothing is obtained analytically through function minimization. In this scheme, introduced by Fudenberg and Levine (1995) and others, smoothing the best-response map is implemented directly using function minimization. Here BR λ (q) = argmin{c(p, q) + λv(p)} p (I) where v : (I) R is a smooth, strictly convex function, with derivatives that are steep at the vertices of (I). N. Shimkin, Technion 27 / 28

Smooth Fictitious Play Repeated Games In terms of mixed actions, Perturbed FTL effectively leads to p t+1 = BR λt (ȳ t ), where q p = BR λ (q) is a smooth version of BR( ) (for each λ > 0). In the next variant, smoothing is obtained analytically through function minimization. In this scheme, introduced by Fudenberg and Levine (1995) and others, smoothing the best-response map is implemented directly using function minimization. Here BR λ (q) = argmin{c(p, q) + λv(p)} p (I) where v : (I) R is a smooth, strictly convex function, with derivatives that are steep at the vertices of (I). In particular, choosing v(p) = i p i log p i yields the logistic map BR λ (q) = exp(λ 1 c(i, q)) i exp(λ 1 c(i, q)) N. Shimkin, Technion 27 / 28

Pointers to the Literature Literature There are a number of monographs and surveys that encompass a variety of problems within the no-regret framework. The textbook by Cesa-Bianchi and Lugosi (2006) surveys and unifies the different approaches developed within the game theory, information theory statistical decision theory and machines learning communities. The monographs by Fudenberg and Levine (1998) and by Young (2004) provide a game-theoretic viewpoint. A recent survey by Shalev-Shwartz (2011) considers the general problem of Online Convex Optimization, while Bubeck and Cesa-Bianchi (2012) provide an overview of the related (Stochastic and Nonstochastic) Multi-armed Bandit problem. N. Shimkin, Technion 28 / 28