The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin, Technion 1 / 28

Outline 1 The online decision problem 2 Blackwell s Approachability 3 The Multiplicative Weight Algorithm 4 Online convex programming 5 Recent topics N. Shimkin, Technion 2 / 28

1. The Online Decision Problem N. Shimkin, Technion 3 / 28

Supervised Learning 1.1 Preliminaries: Supervised Learning The basic problem of supervised learning can be roughly formulated in the following terms. Given a sequence of examples (x n, y n ) N n=1, where x n X is the input pattern (or feature vector) and y n Y the corresponding label, find a map h : x X ŷ Y, that will correctly predict the true label y of other (yet unseen) input patterns x. When Y is a discrete (finite) set, this formulation corresponds to a classification problem. If Y = {0, 1} this a binary classification problem, and for Y a continuous set we obtain a regression problem. Statistical learning theory often assumes that the samples (x n, y n ) are i.i.d. samples from a fixed, yet unknown, probability distribution. N. Shimkin, Technion 4 / 28

Supervised learning The Online Decision Problem Supervised Learning The quality of prediction is measured by a cost (or loss) function c(ŷ, y). For classification this may be the 0-1 cost function: c(ŷ, y) = 1 {ŷ=y}. In regression, this may be the quadratic cost: c(ŷ, y) = ŷ y 2. It is sometimes convenient to allow the predictions ŷ to take values in a larger set Ŷ than the label set Y. For example, in the binary classification problems we may want to allow probabilistic predictions ( 80% chance for rain tomorrow ). We therefore consider prediction functions h : X Ŷ and cost c : Ŷ Y R +. The prediction function h is often restricted to some predefined class H. For example, in linear regression, h(x) = w, x, where w is a vector of weights to be tuned. N. Shimkin, Technion 5 / 28

Online Learning and Regret 1.2 Online Learning and Regret In online learning, examples are displayed sequentially, and learning takes place simultaneously with prediction. A generic template for this process is as follows. For t = 1, 2,..., observe input x t X predict y t Ŷ observe true answer y t Y suffer cost (or loss) c(ŷ t, y t ) The cumulative cost over T periods is therefore C T = T c(ŷ t, y t ). It is generally required to make this cumulative loss as small as possible, in some appropriate sense. t=1 N. Shimkin, Technion 6 / 28

Online learning examples Online Learning and Regret For concreteness, consider the following familiar examples: 1 Weather prediction: Suppose we wish to predict tomorrow s weather. This may involve a classification problem (rain/no-rain), or a regression problem (max/min temperature prediction). In any case, while we may be improving our prediction capability over time, the goal is clearly to provide accurate predictions throughout this period. 2 Spam filter: Here we wish to classify incoming mail as spam / no spam. Some messages x t are displayed to the user to get their true label y t. (The process of choosing which messages to display falls in the area of active learning, which we do not address here. It is also interesting that valuable information can be gained from unlabeled examples; this is addressed within semi-supervised learning.) Here, again, learning takes place online. N. Shimkin, Technion 7 / 28

The Arbitrary Opponent Online Learning and Regret In these lectures we do not impose statistical assumptions on the examples sequence. Rather, we refer to this sequence as arbitrary. It is convenient to think of this sequence as chosen by an opponent (often an imaginary one). We may further distinguish between the following cases: 1 An oblivious opponent: Here the example sequence (x t, y t ) is preset, in the sense that it does not depend on the results (ŷ t ) of our prediction algorithm. This would be the case in the weather prediction problem. 2 An adaptive, or adversarial, opponent: Here the opponent may choose future examples based on past predictions of our algorithm. This might be the case in the spam filter example. We will make these assumptions more concrete in the problems discussed below. In either case, we refer to the choice of samples by the opponent as the opponent s strategy. N. Shimkin, Technion 8 / 28

The Arbitrary Opponent Online Learning and Regret Either way, it is evident that the cumulative cost will depend on the examples sequence. For example, in areas where it rains every day (or never rains), we expect to approach 100% success rate. While if Rain happens to follow an i.i.d. Bernoulli sequence with parameter q [0, 1], the best we can hope for is an asymptotic relative accuracy of min{q, 1 q}. Thus, it would be advisable to compare the performance of our learning prediction algorithm to ideal, baseline performance. This is where the concept of regret comes in. N. Shimkin, Technion 9 / 28

Regret The Online Decision Problem Online Learning and Regret The (cumulative, T -step) regret relative to some fixed predictor h is defined as follows: T T Regret T (h) = c(ŷ t, y t ) c(h(x t ), y t ) t=1 The regret relative to a predictor class H (e.g., the set of all linear predictors) is then defined as Regret T (H) = max h H Regret T (h) = t=1 T c(ŷ t, y t ) min t=1 h H t=1 T c(h(x t ), y t ) In the last term, the best fixed predictor h H is chosen with the benefit of hindsight, i.e., given the entire sample sequence (x t, y t ) up to T. N. Shimkin, Technion 10 / 28

The No-Regret Property Online Learning and Regret A learning algorithm is said to have the no-regret property (w.r.t. H) if the regret is guaranteed to grow sub-linearly in T, namely, Regret T (H) = o(t ) (in some appropriate sense), for any strategy of the opponent. N. Shimkin, Technion 11 / 28

Prediction 1.3 Prediction An important special case is the problem of sequential prediction. Here the challenge is to predict the next element y t based on the previous elements (y 1,..., y t 1 ) of the sequence, for t = 1, 2,.... This may be viewed as a special case of the previous model, with absent patterns x t. If we assume a statistical structure (e.g., Markovian) on the sequence, the problem becomes that of statistical model estimation and prediction. Prediction of arbitrary sequences with regret bounds has been treated extensively in the Information Theory literature (under the names universal prediction or individual sequence prediction; see Merhav and Feder (1998). N. Shimkin, Technion 12 / 28

Prediction with Expert Advice Prediction A related important model is that of prediction with expert advice (Littlestone and Warmuth, 1994). Here we are assisted in out prediction task by a set E of experts (which themselves may be prediction algorithms), among which we wish to follow the best one. The problem is formulated as follows. For t = 1, 2,..., 1 The environment chooses the outcome y t. 2 Simultaneously, each expert e chooses an advice ŷ e,t, which is revealed to the forecaster. 3 The forecaster chooses a prediction ŷ t, after which he observes the true answer y t. 4 The forecaster suffers a loss c(ŷ t, y t ) The goal of the forecaster here is to minimize his regret with respect the best expert. Here the regret is defined as T T Regret T = c(ŷ t, y t ) min c(ŷ e,t, y t ) e E t=1 N. Shimkin, Technion 13 / 28 t=1

Repeated Games 1.4 Repeated Games The notion of no-regret strategies, or no-regret play, was introduced in a seminal 1957 paper by Hannan (1957) in the context of repeated matrix games. N. Shimkin, Technion 14 / 28

Matrix Games (reminder) Repeated Games Recall that a zero-sum matrix game is defined through a payoff (or cost) matrix Γ = {c(i, j) : i I, j J }, where I is the set of actions of player 1 (PL1, the learner), and J is the set of actions for player 2 (PL2, the opponent). Let p (I) denote a mixed action of PL1, and q (J ) a mixed action of PL2. The expected cost (to PL1) under mixed actions p and q is c(p, q) = i,j p i c(i, j)q j The minimax value of the game (with PL1 the minimizer) is given by v(γ) = min max p (I) q (J ) = max min q (J ) p (I) c(p, q) c(p, q) N. Shimkin, Technion 15 / 28

The Repeated Game The Online Decision Problem Repeated Games The repeated game Γ proceeds in stages t = 1, 2,..., where at stage t, PL1 and PL2 simultaneously choose action i t and j t, respectively, which are then observed (and recalled) by both players. A strategy π 1 for PL1 is a map that assigns a mixed action p t (I) to each possible history H t = (i 1, j 1,..., i t 1, j t 1 ) and time t 1, and similarly a strategy π 2 for PL2 assigns a mixed action q t (J ) to each possible history. The actions i i and j t are chosen randomly according to p t and q t. Any pair (π 1, π 2 ) induces a probability measure on H, which we denote by P π1,π 2. We shall be interested in the (long-term) cumulative cost (or loss): C T = T c(i t, j t ) t=1 N. Shimkin, Technion 16 / 28

Regret (again) The Online Decision Problem Repeated Games The minimax value of this game (for any fixed T, and for T ) is easily seen to equal v(γ), the value of the single-shot game. However, assuming that PL2 is not necessarily adversarial (and, indeed, not necessary rational in the game-theoretic sense), it begs the question whether PL1 can gain more than the value by adapting to the observed action history of PL2. To this end, define the following (cumulative) regret for PL1, R T = T t=1 c(i t, j t ) min i I T c(i, j t ) (1) The second term on the RHS serves here as our reference level, to which the actual cost is compared. t=1 Naturally, PL1 would like to have a regret as small as possible. N. Shimkin, Technion 17 / 28

No-regret Strategies The Online Decision Problem Repeated Games Define the average regret: R T = 1 T R T Definition A strategy π 1 of PL1 is said to have the no-regret property (or be Hannan-consistent) if lim sup T R T 0, P π1,π 2 -a.s. (2) for any strategy π 2 of PL2. More succinctly, we may write this property as R T o(1) (a.s.), or R T o(t ) (a.s.). N. Shimkin, Technion 18 / 28

Some More Notations... Repeated Games The RHS of (1) can be written in another convenient form. Let q T = 1 T T t=1 e j t denote the empirical frequency vector of PL2 s actions (here e j (J ) is the mixed action concentrated on action j). Recalling our convention c(i, q) = j c(i, j)q j, it follows that min i I 1 T T t=1 c(i, j t ) = min c(i, q T ) = min c(p, q T ) = c ( q T ) i I p (I) where c (q) = min p c(p, q) is the best-response cost, also known as the Bayes risk of the game. The average regret may now be written more succinctly as R T = C T c ( q T ) N. Shimkin, Technion 19 / 28

Repeated Games The no-regret property R T o(1) (a.s.) may now be written as 1 T C T c (ȳ T ) + o(1) (a.s.). Accordingly, a no-regret strategy is sometimes said to attain the Bayes risk of the game. N. Shimkin, Technion 20 / 28

Notes on Randomization Repeated Games 1. Necessity of randomization: It is easily seen that no deterministic strategy, i.e., a map i t = f t (h t 1 ), can satisfy the no-regret property. Indeed, an (adaptive) opponent has access to h t 1, and can choose j t the maximize the loss vs. i t. For example, in the binary prediction problem he might choose j t = 1 i t, which obtains a cumulative loss of C T = T and cumulative regret of 1 2T or more. Hence, we use strictly mixed actions p t as part of the learning policy. N. Shimkin, Technion 21 / 28

Notes on Randomization Repeated Games 2. Smoothed regret: It is easily seen that the difference d t = c(p t, j t ) c(i t, j t ) is a bounded Martingale difference sequence on F t = σ{h t 1, j t }. Hence, by the CLT, the average difference 1 T T t=1 d t is of order O( T ), and in particular converges to 0 (a.s.). We can therefore define the regret in terms of c(p t, j t ) in place of c(i t, j t ), namely R T = T t=1 c(p t, j t ) min i I T c(i, q t ) This definition allows to establish sample-path (rather than a.s.) bounds on the regret. Henceforth we will use the latter definition. t=1 N. Shimkin, Technion 22 / 28

Hannan s No-regret Theorem Repeated Games Theorem (Hannan 57) There exists a strategy π for the learner so that R T c 0 T for any strategy of the opponent and T 1. Here c 0 = ( 3 2 n I ) 1 2 n J span(c). The constant c 0 in Hannan s result was subsequently improved, but not the rate T which is optimal. The proposed strategy was a perturbed FTL (follow the leader) scheme, which we briefly describe next. N. Shimkin, Technion 23 / 28

FTL-Type Strategies The Online Decision Problem Repeated Games The FTL (follow the leader) strategy is given by: i t+1 = argmin{ i I (with ties arbitrarily broken). t c(i, j t )} s=1 = argmin{c(i, q t )} = BR( q t ) i I Here the learner uses a best-response action against the empirical frequency of the opponent s actions. This simple rule is also known as fictitious play in the game literature. N. Shimkin, Technion 24 / 28

Repeated Games It is easily seen that FTL does not have the no-regret property, even against an oblivious opponent. Consider the binary prediction problem, namely i, j {0, 1}, with c(i, j) = 1 {i=j}. Suppose PL2 chooses the sequence (j, 1, 0, 1, 0,... ), where j is some auxiliary action with c(0, j ) = 0.5, c(1, j ) = 0. In that case FTL yields the sequence (i t ) = (?, 0, 1, 0, 1,... ), which oscillates opposite to PL2 s actions, leading to R T 1 2 T. N. Shimkin, Technion 25 / 28

Perturbed FTL The Online Decision Problem Repeated Games Let z t = (z j,t ) j J, t 1 be a collection of i.i.d. random variables, with z j,t U[0, 1] uniformly distributed on [0, 1]. Let i t+1 = BR( q t + λ t z t ) Hannan s result holds with z j,t U[0, 1] uniformly distributed on [0, 1], and λ t = c 1 / t, with c 1 = (3n 2 J /2n I) 1/2. N. Shimkin, Technion 26 / 28

Smooth Fictitious Play Repeated Games In terms of mixed actions, Perturbed FTL effectively leads to p t+1 = BR λt (ȳ t ), where q p = BR λ (q) is a smooth version of BR( ) (for each λ > 0). In the next variant, smoothing is obtained analytically through function minimization. In this scheme, introduced by Fudenberg and Levine (1995) and others, smoothing the best-response map is implemented directly using function minimization. Here BR λ (q) = argmin{c(p, q) + λv(p)} p (I) where v : (I) R is a smooth, strictly convex function, with derivatives that are steep at the vertices of (I). N. Shimkin, Technion 27 / 28

Pointers to the Literature Literature There are a number of monographs and surveys that encompass a variety of problems within the no-regret framework. The textbook by Cesa-Bianchi and Lugosi (2006) surveys and unifies the different approaches developed within the game theory, information theory statistical decision theory and machines learning communities. The monographs by Fudenberg and Levine (1998) and by Young (2004) provide a game-theoretic viewpoint. A recent survey by Shalev-Shwartz (2011) considers the general problem of Online Convex Optimization, while Bubeck and Cesa-Bianchi (2012) provide an overview of the related (Stochastic and Nonstochastic) Multi-armed Bandit problem. N. Shimkin, Technion 28 / 28