Competing With Strategies

Size: px
Start display at page:

Download "Competing With Strategies"

Transcription

1 JMLR: Workshop and Conference Proceedings vol (203) 27 Competing With Strategies Wei Han Alexander Rakhlin Karthik Sridharan Abstract We study the problem of online learning with a notion of regret defined with respect to a set of strategies. We develop tools for analyzing the minimax rates and for deriving regret-minimization algorithms in this scenario. While the standard methods for minimizing the usual notion of regret fail, through our analysis we demonstrate existence of regret-minimization methods that compete with such sets of strategies as: autoregressive algorithms, strategies based on statistical models, regularized least squares, and follow the regularized leader strategies. In several cases we also derive efficient learning algorithms.. Introduction he common criterion for evaluating an online learning algorithm is regret, that is the difference between the cumulative loss of the algorithm and the cumulative loss of the best fixed decision, chosen in hindsight. While much work has been done on understanding noregret algorithms, such a definition of regret against a fixed decision often draws criticism: even if regret is small, the cumulative loss of a best fixed action can be large, thus rendering the result uninteresting. o address this problem, various generalizations of the regret notion have been proposed, including regret with respect to the cost of a slowly changing compound decision. While being a step in the right direction, such definitions are still static in the sense that the decision of each compound comparator per step does not depend on the sequence of realized outcomes. Arguably, a more interesting (and more difficult to deal with) notion is that of performing as well as a set of strategies (or, algorithms). A strategy π is a sequence of functions π t, for each time period t, mapping the observed outcomes to the next action. Of course, if the collection of such strategies is finite, we may disregard their dependence on the actual sequence and treat each strategy as a black box expert. his is precisely the reason the Multiplicative Weights and other expert algorithms gained such popularity. However, this black box approach is not always desirable since some measure of the effective number of experts must play a role in the complexity of the problem: experts that predict similarly should not count as two independent ones. But what is a notion of closeness of two strategies? Imagine that we would like to develop an algorithm that incurs loss comparable to that of the best of an infinite family of strategies. o obtain such a statement, one may try to discretize the space of strategies and invoke the black-box experts method. As we show in this paper, such an approach will not always work. Instead, we present a theoretical framework for the analysis of competing against strategies and for algorithmic development, based on the ideas in (Rakhlin et al., 200, 202). 203 W. Han, A. Rakhlin & K. Sridharan.

2 Han Rakhlin Sridharan he strategies considered in this paper are termed simulatable experts in (Cesa- Bianchi and Lugosi, 2006). he authors also distinguish static and non-static experts. In particular, for static experts and absolute loss, Cesa-Bianchi and Lugosi (999) were able to show that problem complexity is governed by the geometry of the class of static experts as captured by its i.i.d. Rademacher averages. For nonstatic experts, however, the authors note that unfortunately we do not have a characterization of the minimax regret by an empirical process, due to the fact that the sequential nature of the online problems is at odds with the i.i.d.-based notions of classical empirical process theory. In recent years, however, a martingale generalization of empirical process theory has emerged, and these tools were shown to characterize learnability of online ervised learning, online convex optimization, and other scenarios (Ben-David et al., 2009; Rakhlin et al., 200). Yet, the machinery developed so far is not directly applicable to the case of general simulatable experts which can be viewed as mappings from an ever-growing set of histories to the space of actions. he goal of this paper is precisely this: to extend the non-constructive as well as constructive techniques of (Rakhlin et al., 200, 202) to simulatable experts. We analyze a number of examples with the developed techniques, but we must admit that our work only scratches the surface. We can imagine further research developing methods that compete with interesting gradient descent methods (parametrized by step size choices), with Bayesian procedures (parametrized by choices of priors), and so on. We also note the connection to online algorithms, where one typically aims to prove a bound on the competitive ratio. Our results can be seen in that light as implying a competitive ratio of one. We close the introduction with a high-level outlook, which builds on the ideas of Merhav and Feder (998). Imagine we are faced with a sequence of data from a probabilistic source, such as a k-markov model with unknown transition probabilities. A well developed statistical theory tells us how to estimate the parameter under the assumption that the model is correct. We may view an estimator as a strategy for predicting the next outcome. Suppose we have a set of possible models, with a good prediction strategy for each model. Now, let us lift the assumption that the sequence is generated by one of these models, and set the goal as that of performing as well as the best prediction strategy. In this case, if the observed sequence is indeed given by one of the models, our loss will be small because one of the strategies will perform well. If not, we still have a valid statement that does not rely on the fact that the model is well specified. o illustrate the point, we will exhibit an example where we can compete with the set of all Bayesian strategies (parametrized by priors). We then obtain a statement that we perform as well as the best of them without assuming that the model is correct. he paper is organized as follows. In Section 2, we extend the minimax analysis of online learning problems to the case of competing with a set of strategies. In Section 3, we show that it is possible to compete with a set of autoregressive strategies, and that the usual online linear optimization algorithms do not attain the optimal bounds. We then derive an optimal and computationally efficient algorithm for one of the proposed regimes. In Section 4 we describe the general idea of competing with statistical models that use sufficient statistics, and demonstrate an example of competing with a set of strategies parametrized by priors. For this example, we derive an optimal and efficient randomized algorithm. In Section 5, we turn to the question of competing with regularized least squares algorithms indexed by the choice of a shift and a regularization parameter. In Section 6, 2

3 Competing With Strategies we consider online linear optimization and show that it is possible to compete with Follow the Regularized Leader methods parametrized by a shift and by a step size schedule. 2. Minimax Regret and Sequential Rademacher Complexity We consider the problem of online learning, or sequential prediction, that consists of rounds. At each time t = {,..., } [ ], the learner makes a prediction f t F and observes an outcome z t Z, where F and Z are abstract sets of decisions and outcomes. Let us fix a loss function l F Z R that measures the quality of prediction. A strategy π = (π t ) is a sequence of functions π t Z t F mapping history of outcomes to a decision. Let Π denote a set of strategies. he regret with respect to Π is the difference between the cumulative loss of the player and the cumulative loss of the best strategy Reg = l(f t, z t ) inf l(π t (z t ), z t ). where we use the notation z k {z,..., z k }. Let Q = (F) and P = (Z) be the sets of probability distributions on F and Z. We now define the value of the game against a set Π of strategies as V (Π) inf... inf [Reg ]. q Q z Z f q q Q z Z f q It was shown in (Rakhlin et al., 200; Abernethy et al., 2009) that one can derive nonconstructive upper bounds on the value through a process of sequential symmetrization, and in (Rakhlin et al., 202) it was shown that these non-constructive bounds can be used as relaxations to derive an algorithm. his is the path we take in this paper. Let us describe an important variant of the above problem that of ervised learning. Here, before making a real-valued prediction ŷ t on round t, the learner observes side information x t X. Simultaneously, the actual outcome y t Y is chosen by Nature. A strategy can therefore depend on the history x t, y t and the current x t, and we write such strategies as π t (x t, y t ), with π t X t Y t Y. Fix some loss function l(ŷ, y). he value V S (Π) is then defined as x inf q (Y) y Y... ŷ q x inf [ q (Y) y Y ŷ q l(ŷ t, y t ) inf l(π t (x t, y t ), y t )]. o proceed, we need to define a notion of a tree. A Z-valued tree z is a sequence of mappings {z,..., z } with z t {±} t Z. hroughout the paper, ɛ t {±} are i.i.d. Rademacher variables, and a realization of ɛ = (ɛ,..., ɛ ) defines a path on the tree, given by z t (ɛ) (z (ɛ),..., z t (ɛ)) for any t [ ]. We write z t (ɛ) for z t (ɛ t ). By convention, a sum b a = 0 for a > b and for simplicity assume that no loss is suffered on the first round. Definition Sequential Rademacher complexity of the set Π of strategies is defined as R(l, Π) w,z ɛ [ ɛ t l(π t (w (ɛ),..., w t (ɛ)), z t (ɛ))] () where the remum is over two Z-valued trees z and w of depth. 3

4 Han Rakhlin Sridharan he w tree can be thought of as providing history while z providing outcomes. We shall use these names throughout the paper. he reader might notice that in the above definition, the outcomes and history are decoupled. We now state the main result: heorem 2 he value of prediction problem with a set Π of strategies is upper bounded as V (Π) 2R(l, Π). While the statement is visually similar to those in Rakhlin et al. (200, 20), it does not follow from these works. Indeed, the proof (which appears in Appendix) needs to deal with the additional complications stemming from the dependence of strategies on the history. Further, we provide the proof for a more general case when sequences z,..., z are not arbitrary but need to satisfy constraints. As we show below, the sequential Rademacher complexity on the right-hand side allows us to analyze general non-static experts, thus addressing the question raised in (Cesa- Bianchi and Lugosi, 999). For real-valued strategies, we can erase a Lipschitz loss function, leading to the sequential Rademacher complexity of Π without the loss and without the z tree: R(Π) w R(Π, w) w ɛ [ ɛ t π t (w t (ɛ))] For example, pose Z = {0, }, the loss function is the indicator loss, and strategies have potentially dependence on the full history. hen one can verify that ɛ w,z [ k [ k = ɛ w,z ɛ t {π t (w t (ɛ)) z t (ɛ)}] ɛ t (π t (w t (ɛ))( 2z t (ɛ)) + z t (ɛ))] = R(Π). (2) he same result holds when F = [0, ] and l is the absolute loss. he process of erasing the loss (or, contraction) extends quite nicely to problems of ervised learning. Let us state the second main result: heorem 3 Suppose the loss function l Y Y R is convex and L-Lipschitz in the first argument, and let Y = [, ]. hen V S (Π) 2L x,y ɛ [ ɛ t π t (x t (ɛ), y t (ɛ))] where (x t (ɛ), y t (ɛ)) naturally takes place of w t (ɛ) in heorem 2. Further, if Y = [, ] and l(ŷ, y) = ŷ y, V S (Π) x ɛ [ ɛ t π t (x t (ɛ), ɛ t )]. Let us present a few simple examples as a warm-up. xample (History-independent strategies) Let π f Π be constant history-independent strategies π f =... = πf = f F. hen () recovers the definition of sequential Rademacher complexity in Rakhlin et al. (200). 4

5 Competing With Strategies xample 2 (Static experts) For static experts, each strategy π is a predetermined sequence of outcomes, and we may therefore associate each π with a vector in F. A direct consequence of heorem 3 for any convex L-Lipschitz loss is that V(Π) 2L ɛ [ ɛ t π t ] which is simply the classical i.i.d. Rademacher averages. For the case of F = [0, ], Z = {0, }, and the absolute loss, this is the result of Cesa-Bianchi and Lugosi (999). xample 3 (Finite-order Markov strategies) Let Π k be a set of strategies that only depend on the k most recent outcomes to determine the next move. heorem 2 implies that the value of the game is upper bounded as V(Π k ) 2 w,z ɛ k [ ɛ t l(π t (w t k (ɛ),..., w t (ɛ)), z t (ɛ))]. Now, pose that F = Z is a finite set, of cardinality s. hen there are effectively s sk strategies π. he bound on the sequential Rademacher complexity then scales as 2s k log(s), recovering the result of Feder et al. (992) (see (Cesa-Bianchi and Lugosi, 2006, Cor. 8.2)). In addition to providing an understanding of minimax regret against a set of strategies, sequential Rademacher complexity can serve as a starting point for algorithmic development. As shown in Rakhlin et al. (202), any admissible relaxation can be used to define a succinct algorithm with a regret guarantee. For the setting of this paper, this means the following. Let Rel Z t R, for each t, be a collection of functions satisfying two conditions: t, inf q t Q z t Z { l(f t, z t ) + Rel(z t )} Rel(z t ), and inf f t q t l(π t (z t ), z t ) Rel(z ). hen we say that the relaxation is admissible. It is then easy to show that regret of any algorithm that ensures above inequalities is bounded by Rel({}). heorem 4 he conditional sequential Rademacher complexity with respect to Π R(l, Π z,..., z t ) z,w is admissible. [2 ɛ t+ ɛ s l(π s (z t, w s t (ɛ)), z s t (ɛ)) l(π s (z s ), z s )] Conditional sequential Rademacher complexity can therefore be used as a starting point for possibly deriving computationally attractive algorithms, as shown throughout the paper. We may now define covering numbers for the set Π of strategies over the history trees. he development is a straightforward modification of the notions developed in (Rakhlin et al., 200), where we replace any tree x with a tree of histories w t. Definition 5 A set V of R-valued trees is an α-cover (with respect to l p ) of a set of strategies Π on an Z-valued history tree w if π Π, ɛ {±}, v V s.t. ( π t (w t (ɛ)) v t (ɛ) p ) /p α. (3) An α-covering number N p (Π, w, α) is the size of the smallest α-cover. t s= 5

6 Han Rakhlin Sridharan For ervised learning, (x t (ɛ), y t (ɛ)) takes place of w t (ɛ). Now, for any history tree w, sequential Rademacher averages of a class of [, ]-valued strategies Π satisfy R(Π, w) inf {α + 2 log N (Π, w, α) } α 0 and the Dudley entropy integral type bound also holds: R(Π, w) inf {4α + 2 log N2 (Π, w, δ) dδ}. (4) α 0 α In particular, this bound should be compared with heorem 7 in (Cesa-Bianchi and Lugosi, 999), which employs a covering number in terms of a pointwise metric between strategies that requires closeness for all histories and all time steps. Second, the results of (Cesa- Bianchi and Lugosi, 999) for real-valued prediction require strategies to be bounded away from 0 and by δ > 0 and this restriction spoils the rates. In the rest of the paper, we show how the results of this section (a) yield proofs of existence of regret-minimization strategies with certain rates and (b) guide in the development of algorithms. For some of these examples, standard methods (such as xponential Weights) come close to providing an optimal rate, while for others fail miserably. 3. Competing with Autoregressive Strategies In this section, we consider strategies that depend linearly on the past outcomes. o this end, we fix a set Θ R k, for some k > 0, and parametrize the set of strategies as Π Θ = {π θ π θ t (z,..., z t ) = k i=0 θ i+z t k+i, θ = (θ,..., θ k ) Θ}. For consistency of notation, we assume that the sequence of outcomes is padded with zeros for t 0. First, as an example where known methods can recover the correct rate, we consider the case of a constant look-back of size k. We then extend the study to cases where neither the regret behavior nor the algorithm is known in the literature, to the best of our knowledge. 3.. Finite Look-Back Suppose Z = F R d are l 2 unit balls, the loss is l(f, z) = f, z, and Θ R k is also a unit l 2 ball. Denoting by W (t k t ) = [w t k (ɛ),..., w t (ɛ)] a matrix with columns in Z, R(l, Π Θ ) = w,z = w,z ɛ θ Θ [ ɛ t π θ (w t k t (ɛ)), z t (ɛ) ] = w,z ɛ θ Θ [ ɛ t z t (ɛ) W (t k t ) θ] ɛ ɛ t z t (ɛ) W (t k t ) k (5) In fact, this bound against all strategies parametrized by Θ is achieved by the gradient descent (GD) method with the simple update θ t+ = Proj Θ (θ t η [z t k,..., z t ] z t ) where Proj Θ is the uclidean projection onto the set Θ. his can be seen by writing the loss as [z t k,..., z t ] θ t, z t = θ t, [z t k,..., z t ] z t. he regret of GD, θ t, [z t k,..., z t ] z t inf θ Θ θ, [z t k,..., z t ] z t, is precisely regret against strategies in Θ, and analysis of GD yields the rate in (5). 6

7 Competing With Strategies 3.2. Full Dependence on History he situation becomes less obvious when k = and strategies depend on the full history. he regret bound in (5) is vacuous, and the question is whether a better bound can be proved, under some additional assumptions on Θ. Can such a bound be achieved by GD? For simplicity, consider the case of F = Z = [, ], and assume that Θ = B p () R is a unit l p ball, for some p. Since k =, it is easier to re-index the coordinates so that π θ t (z t ) = t i= θ iz i. he sequential Rademacher complexity of the strategy class is R(l, Π Θ ) = w,z θ Θ [ ɛ t π θ (w t (ɛ)) z t (ɛ)] = w,z Rearranging the terms, the last expression is equal to w,z θ Θ [ θ t w t (ɛ) ( i=t+ where q is the Hölder conjugate of p. Observe that z t i=t ɛ i z i (ɛ) z [ θ Θ [ t ( i= ɛ i z i (ɛ))] [ w (ɛ) q max w,z i= ɛ i z i (ɛ) + t t i= θ i w i (ɛ)) ɛ t z t (ɛ)]. t i=t+ ɛ i z i (ɛ) ] 2 z t ɛ i z i (ɛ) ] t ɛ i z i (ɛ) Since {ɛ t z t (ɛ) t =,..., } is a bounded martingale difference sequence, the last term is of the order of O( ). Now, pose there is some β > 0 such that w (ɛ) q β for all ɛ. his assumption can be implemented if we consider constrained adversaries, where such l q -bound is required to hold for any prefix w t (ɛ) of history (In Appendix, we prove heorem 2 for the case of constrained sequences). hen R(l, Π Θ ) C β+/2 for some constant C. We now compare the rate of convergence of sequential Rademacher and the rate of the mirror descent algorithm for different settings of q in able 3.2. If θ p and w q β for q 2, the convergence rate of mirror descent with Legendre function F (θ) = 2 θ 2 p is q β+/2 (see (Srebro et al., 20)). Θ w sequential Radem. rate Mirror descent rate B () w log q 2 B p () w q β β+/2 q β+/2 B 2 () w 2 β β+/2 β+/2 q 2 B p () w q β β+/2 β+/q B () w β β+/2 able : Comparison of the rates of convergence (up to constant factors) We observe that mirror descent, which is known to be optimal for online linear optimization, and which gives the correct rate for the case of bounded look-back strategies, in several regimes fails to yield the correct rate for more general linearly parametrized strategies. ven in the most basic regime where Θ is a unit l ball and the sequence of data is not constrained (other than Z = [, ]), there is a gap of log between the Rademacher bound and the guarantee of mirror descent. Is there an algorithm that removes this factor? i= 7

8 Han Rakhlin Sridharan Algorithms for Θ = B () For the example considered in the previous section, with F = Z = [, ] and Θ = B (), the conditional sequential Rademacher complexity of heorem 4 becomes R (Π z,..., z t ) = z,w w [2 ɛ t+ [2 ɛ t+ ɛ s π s (z t, w s t (ɛ)) z s (ɛ) π s (z s ) z s ] t s= ɛ s π s (z t, w s t (ɛ)) z s π s (z s )] where the z tree is erased, as at the end of the proof of heorem 3. Define a s (ɛ) = 2ɛ s for s > t and z s otherwise; b i (ɛ) = w i (ɛ) for i > t and z i otherwise. We can then simply write w [ ɛ t+ θ Θ s= s a s (ɛ) i= θ i b i (ɛ)] = w which we may use as a relaxation: ɛ t+ θ Θ [ s= θ s b s (ɛ) + Lemma 6 Define a t s(ɛ) = 2ɛ s for s > t, and z s otherwise. hen, is an admissible relaxation. Rel(z t ) = ɛt+ max s at i (ɛ) t s= a i (ɛ)] ɛ t+ max s With this relaxation, the following method attains O( ) regret: prediction at step t is q t = argmin q [,] { f t z t + ɛt+ z t {±} f t q max s a t i(ɛ) } a i (ɛ) where the over z t [, ] is achieved at {±} due to convexity. Following (Rakhlin et al., 202), we can also derive randomized algorithms, which can be viewed as randomized playout generalizations of the Follow the Perturbed Leader algorithm. Lemma 7 Consider the randomized strategy where at round t we first draw ɛ t+,..., ɛ uniformly at random and then further draw our move f t according to the distribution q t (ɛ) = argmin q [,] zt {,} { f t q f t z t + max s at i (ɛ) } = 2 ( max {max s=,...,t t z i i=t+ ɛ i, max {max s=,...,t t z i + 2 i=t+ ɛ i, max,..., 2 ɛ i } max,..., 2 ɛ i }) he expected regret of this randomized strategy is upper bounded by sequential Rademacher complexity: [Reg ] 2R (Π), which was shown to be O( ) (see able 3.2). he time consuming parts of the above randomized method are to draw t random bits at round t and to calculate the partial sums. However, we may replace Rademacher random variables by Gaussian N (0, ) random variables and use known results on the distributions 8

9 Competing With Strategies of extrema of a Brownian motion. o this end, define a Gaussian analogue of conditional sequential Rademacher complexity G (Π z,..., z t ) = z,w [ 2π σ t+ σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) l(π s (z s ), z s )] where σ t N (0, ), and ɛ = (sign(σ ),..., sign(σ )). For our example the O( ) bound can be shown for G (Π) by calculating the expectation of the maximum of Brownian motion. Proofs similar to heorem 2 and heorem 4 show that the conditional Gaussian complexity G (Π z,..., z t ) is an upper bound on R (Π z,..., z t ) and is admissible (see heorem in Appendix). Furthermore, the proof of Lemma 7 holds for Gaussian random variables, and gives the randomized algorithm as in Lemma 7 with ɛ t replaced by σ t. It is not difficult to see that we can keep track of the maximum and minimum of { t z i} between rounds in O() time. We can then draw three random variables from the joint distribution of the maximum, the minimum and the endpoint of a Brownian Motion and calculate the prediction in O() time per round of the game (the joint distribution can be found in Karatzas and Shreve (99)). In conclusion, we have derived an algorithm that for the case of Θ = B (), with time complexity of O() per round and the optimal regret bound of O( ). We leave it as an open question to develop efficient and optimal algorithms for the other settings in able Competing with Statistical Models In this section we consider competing with a set of strategies that arise from statistical models. For example, for the case of Bayesian models, strategies are parametrized by the choice of a prior. Regret bounds with respect to a set of such methods can be thought of as a robustness statement: we are aiming to perform as well as the strategy with the best choice of a prior. We start this section with a general setup that needs further investigation. 4.. Compression and Sufficient Statistics Assume that strategies in Π have a particular form: they all work with a sufficient statistic, or, more loosely, compression of the past data. Suppose sufficient statistics can take values in some set Γ. Fix a set Π of mappings π Γ F. We assume that all the strategies in Π are of the form π t (z,..., z t ) = π(γ(z,..., z t )) for some π Π and γ Z Γ. Such a bottleneck Γ can arise due to a finite memory or finite precision, but can also arise if the strategies in Π are actually solutions to a statistical problem. If we assume a certain stochastic source for the data, we may estimate the parameters of the model, and there is often a natural set of sufficient statistics associated with it. If we collect all such solutions to stochastic models in a set Π, we may compete with all these strategies as long as Γ is not too large and the dependence of estimators on these sufficient statistics is smooth. With the notation introduced in this paper, we need to study the sequential Rademacher complexity for strategies Π, which can be upper bounded by the complexity of Π on Γ-valued trees: R(Π) g,z ɛ π Π [ ɛ t l( π(g t (ɛ)), z t (ɛ))] t s= 9

10 Han Rakhlin Sridharan his complexity corresponds to our intuition that with sufficient statistics the dependence on the ever-growing history can be replaced with the dependence on a summary of the data. Next, we consider one particular case of this general idea, and refer to Foster et al. (20) for more details on these types of bounds Bernoulli Model with a Beta Prior Suppose the data z t {0, } is generated according to Bernoulli distribution with parameter p, and the prior on p [0, ] is p Beta(α, β). Given the data {z,..., z t }, the maximum a posteriori (MAP) estimator of p is ˆp = ( t i= z i + α )/(t + α + β 2). We now consider the problem of competing with Π = {π α,β α >, β (, C β ]} for some C β, where each π α,β predicts the corresponding MAP value for the next round: π α,β t (z,..., z t ) = ( t i= z i + α )/(t + α + β 2). Let us consider the absolute loss, which is equivalent to probability of a mistake of the randomized prediction with bias π α,β t. hus, the loss of a strategy π α,β on round t is π α,β t (z t ) z t. Using heorem 2 and the argument in (2) to erase the outcome tree, we conclude that there exists a regret minimization algorithm against the set Π which attains regret of at most 2 w ɛ α,β [ ɛ t i= w i(ɛ)+α t t +α+β 2 ]. o analyze the rate exhibited by this upper bound, construct a new tree with g (ɛ) = and g t (ɛ) = t i= w i(ɛ)+α [0, ] for t 2. With this notation, we can simply re-write the last expression as twice g ɛ α,β [ ɛ t g t (ɛ) t+α+β 3 ] he remum ranges over all [0, ]-valued trees g, but we can pass to the remum over all [, ]-valued trees (thus making the value larger). We then observe that the remum is achieved at a {±}-valued tree g, which can then be erased as in the end of the proof of heorem 3 (roughly speaking, it amounts to renaming ɛ t into ɛ t g t (ɛ t )). We obtain an upper bound R(Π) ɛ α,β ɛ t (t + α 2) t + α + β 3 ɛ ɛ t + ɛ α,β ɛ t (β ) t + α + β 3 = ( C β + ) (6) where we used Cauchy-Schwartz inequality for the second term. We note that an experts algorithm would require a discretization that depends on and will yield a regret bound of order O( log ). It is therefore interesting to find an algorithm that avoids the discretization and obtains this regret. o this end, we take the derived upper bound on the sequential Rademacher complexity and prove that it is an admissible relaxation.. Alternatively, we can consider strategies that predict according to {ˆp /2}, which better matches the choice of an absolute loss. However, in this situation, an experts algorithm on an appropriate discretization attains the bound. 0

11 Competing With Strategies Lemma 8 he relaxation is admissible. Rel(z t ) = ɛt+ α,β [2 ɛ s t s + α 2 s + α + β 3 s i= z i s + α + β 3 z s ] Given that this relaxation is admissible, we have a guarantee that the following algorithm attains the rate ( C β + ) given in (6): q t = arg min q [0,] max { f q f z t + ɛt+ [2 z t {0,} In fact, q t can be written as q t = 2 { ɛ t+ α,β ɛt+ α,β [2 [2 α,β ɛ s t s + α 2 ɛ s s + α + β 3 ( 2z s ) s= t s + α 2 ɛ s s + α + β 3 ( 2z s ) s= s= t s + α 2 s + α + β 3 s i= z i s + α + β 3 z s ]} s= s i= z i s + α + β 3 + s i= z i s + α + β 3 t i= z i t + α + β 3 ] t i= z i t + α + β 3 ]} For a given realization of random signs, the remum is an optimization of a sum of linear fractional functions of two variables. Such an optimization can be carried out in time O( log ) (see Chen et al. (2005)). o deal with the expectation over random signs, one may either average over many realizations or use the random playout idea and only draw one sequence. Such an algorithm is admissible for the above relaxation, obtains the O( ) bound, and runs in O( log ) time per step. We leave it as an open problem whether a more efficient algorithm with O( ) regret exists. 5. Competing with Regularized Least Squares Consider the ervised learning problem with Y = [, ] and some set X. Consider the Regularized Least Squares (RLS) strategies, parametrized by a regularization parameter λ and a shift w 0. hat is, given data (x, y ),..., (x t, y t ), the strategy solves For a given pair λ and w 0, the solution is arg min w t i= (y i x i, w ) 2 + λ w w 0 2. w λ,w 0 t+ = w 0 + (X X + λi) X Y, where X R t d and Y R t are the usual matrix representations of the data x t, y t. We would like to compete against a set of such RLS strategies which make prediction w λ,w 0 t, x t, given side information x t. Since the outcomes are in [, ], without loss of generality we clip the predictions of strategies to this interval, thus making our regret minimization goal only harder. o this end, let c(a) = a if a [, ] and c(a) = sign(a) for a >. hus, given side-information x t X, the prediction of strategies in Π = {π λ,w 0 λ λ min > 0, w 0 2 } is simply the clipped product π λ,w 0 t (x t, y t ) = c ( w λ,w 0 t, x t ). Let us take the squared loss function l(ŷ, y) = (ŷ y) 2.

12 Han Rakhlin Sridharan Lemma 9 For the set Π of strategies defined above, the minimax regret of competing against Regularized Least Squares strategies is V (Π) c log( λ for an absolute constant c. Observe that λ min enters only logarithmically, which allows us to set, for instance, λ min = /. Finally, we mention that the set of strategies includes λ =. his setting corresponds to a static strategy π λ,w 0 t (x t, y t ) = w 0, x t and regret against such a static family parametrized by w 0 B 2 () is exactly the objective of online linear regression (Vovk, 998). Lemma 9 thus shows that it is possible to have vanishing regret with respect to a much larger set of strategies. It is an interesting open question of whether one can develop an efficient algorithm with the above regret guarantee. min ) 6. Competing with Follow the Regularized Leader Strategies Consider the problem of online linear optimization with the loss function l(f t, z t ) = f t, z t for f t F, z t Z. For simplicity, assume that F = Z = B 2 (). An algorithm commonly used for online linear and online convex optimization problems is the Follow the Regularized Leader (FRL) algorithm. We now consider competing with a family of FRL algorithms π w0,λ indexed by w 0 {w w } and λ Λ where Λ is a family of functions λ R + [ ] R + specifying a schedule for the choice of regularization parameters. Specifically we consider strategies π w0,λ such that π w 0,λ t (z,..., z t ) = w t where w t = w 0 + argmin w w { t i= w, z i + 2 λ ( t i= z i, t) w 2 } (7) his can be written in closed form as w t = w 0 ( t i= z i)/ max {λ ( t i= z i, t), t i= z i }. Lemma 0 For a given class Λ of functions indicating choices of the regularization parameters, define a class Γ of functions on [0, ] [/, ] specified by a/(b ) Γ = {γ b [/, ], a [0, ], γ(a, b) = min {, }, λ Λ} λ(a/(b ), /b) hen the value of the online learning game competing against FRL strategies given by quation 7 is bounded as V (Π Λ ) R (Γ) where R (Γ) is the sequential Rademacher complexity (Rakhlin et al., 200) of Γ. Notice that if Λ < then the second term is bounded as R (Γ) log Λ. However, we may compete with an infinite set of step-size rules. Indeed, each γ Γ is a function [0, ] 2 [0, ]. Hence, even if one considers Γ to be the set of all -Lipschitz functions (Lipschitz w.r.t., say, l norm), it holds that R (Γ) 2 log. We conclude that it is possible to compete with set of FRL strategies that pick any w 0 in unit ball as starting point and further use for regularization parameter schedule any λ R 2 R that is such that a/(b ) λ(a/(b ),/b) is a -Lipchitz function for every a, b [/, ]. Beyond the finite and Lipschitz cases shown above, it would be interesting to analyze richer families of step size schedules, and possibly derive efficient algorithms. 2

13 Competing With Strategies Acknowledgements We gratefully acknowledge the port of NSF under grants CARR DMS and CCF-6928, as well as Dean s Research Fund. References J. Abernethy, A. Agarwal, P. L. Bartlett, and A. Rakhlin. A stochastic view of optimal regret through minimax duality. In COL, S. Ben-David, D. Pal, and S. Shalev-Shwartz. Agnostic online learning. In Proceedings of the 22th Annual Conference on Learning heory, N. Cesa-Bianchi and G. Lugosi. On prediction of individual sequences. he Annals of Statistics, 27(6):pp , 999. N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, D.Z. Chen, O. Daescu, Y. Dai, N. Katoh, X. Wu, and J. Xu. fficient algorithms and implementations for optimizing the sum of linear fractional functions, with applications. Journal of Combinatorial Optimization, 9():69 90, M. Feder, N. Merhav, and M. Gutman. Universal prediction of individual sequences. Information heory, I ransactions on, 38(4): , 992. D. P. Foster, A. Rakhlin, K. Sridharan, and A. ewari. Complexity-based approach to calibration with checking rules. Journal of Machine Learning Research - Proceedings rack, 9:293 34, 20. I. Karatzas and S.. Shreve. Brownian Motion and Stochastic Calculus. Springer-Verlag, Berlin, 2nd edition, 99. N. Merhav and M. Feder. Universal prediction. I ransactions on Information heory, 44: , 998. A. Rakhlin, K. Sridharan, and A. ewari. Online learning: Random averages, combinatorial parameters, and learnability. In Advances in Neural Information Processing Systems, 200. A. Rakhlin, K. Sridharan, and A. ewari. Online learning: Stochastic, constrained, and smoothed adversaries. In NIPS, pages , 20. A. Rakhlin, O. Shamir, and K. Sridharan. Relax and randomize : From value to algorithms. In Advances in Neural Information Processing Systems, 202. N. Srebro, K. Sridharan, and A. ewari. On the universality of online mirror descent. In NIPS, pages , 20. V. Vovk. Competitive on-line linear regression. In NIPS 97: Proceedings of the 997 conference on Advances in neural information processing systems 0, pages , Cambridge, MA, USA, 998. MI Press. 3

14 Han Rakhlin Sridharan Appendix A. Proofs Proof [of heorem 2] Let us prove a more general version of heorem 2, which we do not state in the main text due to lack of space. he extra twist is that we allow constraints on the sequences z,..., z played by the adversary. Specifically, the adversary at round t can only play x t that satisfy constraint C t (z,..., z t ) = where (C,..., C ) is a predetermined sequence of constraints with C t Z t {0, }. When each C t is the function that is always then we are in the setting of the theorem statement where we play an unconstrained/worst case adversary. However the proof here allows us to even analyze constrained adversaries which come in handy in many cases. Following (Rakhlin et al., 20), a restriction P on the adversary is a sequence P,..., P of mappings P t Z t 2 P such that P t (z t ) is a convex subset of P for any z t Z t. In the present proof we will only consider constrained adversaries, where P t = (C t (z t )) the set of all distributions on the constrained subset C t (z t ) {z Z C t (z,..., z t, z) = }. defined at time t via a binary constraint C t Z t {0, }. Notice that the set C t (z t ) is the subset of Z from which the adversary is allowed to pick instance z t from given the history so far. It was shown in Rakhlin et al. (20) that such constraints can model sequences with certain properties, such as slowly changing sequences, low-variance sequences, and so on. Let C be the set of Z-valued trees z such that for every ɛ {±} and t [ ], C t (z (ɛ),..., z t (ɛ)) =, that is, the set of trees such that the constraint is satisfied along any path. he statement we now prove is that the value of the prediction problem with respect to a set Π of strategies and against constrained adversaries (denoted by V (Π, C )) is upper bounded by twice the sequential complexity w C,z ɛ ɛ t l(π t (w (ɛ),..., w t (ɛ))), z t (ɛ)) (8) where it is crucial that the w tree ranges over trees that respect the constraints along all paths, while z is allowed to be an arbitrary Z-valued tree. his fact that w respects the constraints is the only difference with the original statement of heorem 2 in the main body of the paper. For ease of notation we use to denote repeated application of operators such has or inf. For instance, at A inf bt B rt P [F (a, b, r,..., a, b, r )] denotes a A inf b B r P... a A inf b B r P [F (a, b, r,..., a, b, r )]. 4

15 Competing With Strategies he value of a prediction problem with respect to a set of strategies and against constrained adversaries can be written as : V (Π, C ) = inf = [ q t Q p t P t(z t ) f t q t,z t p t [ inf p t P t(z t ) z t p t f t F [ p t P t(z t ) z t p t [ p t P t(z t ) z t,z t l(f t, z t ) inf l(π t (z t ), z t )] z t l(f t, z t) l(π t (z t ), z t )] z t l(π t (z t ), z t) l(π t (z t ), z t )] l(π t (z t ), z t) l(π t (z t ), z t )] Let us now define the selector function χ Z Z {±} Z by χ(z, z, ɛ) = { z if ɛ = z if ɛ = In other words, χ t selects between z t and z t depending on the sign of ɛ. We will use the shorthand χ t (ɛ t ) χ(z t, z t, ɛ t ) and χ t (ɛ t ) (χ(z, z, ɛ ),..., χ(z t, z t, ɛ t )). We can then re-write the last statement as p t P t(χ t (ɛ t )) z t,z t ɛ t [ ɛ t (l(π t (χ t (ɛ t )), χ t ( ɛ t )) l(π t (χ t (ɛ t )), χ t (ɛ t )))] One can indeed verify that we simply used χ t to switch between z t and z t according to ɛ t. Now, we can replace the second argument of the loss in both terms by a larger value to obtain an upper bound p t P t(χ t (ɛ t )) z t,z t z t,z t 2 p t P t(χ t (ɛ t )) z t,z t ɛ t z t ɛ t [ ɛ t (l(π t (χ t (ɛ t )), z t ) l(π t (χ t (ɛ t )), z t ))] [ ɛ t l(π t (χ t (ɛ t )), z t )] since the two terms obtained by splitting the rema are the same. We now pass to the rema over z t, z t, noting that the constraints need to hold: 2 z t,z t Ct(χ t (ɛ t )) (z,z ) C z ɛ = 2 z t ɛ t [ ɛ t l(π t (χ t (ɛ t )), z t )] [ ɛ t π t (χ(z, z, ɛ ),..., χ(z t (ɛ), z t (ɛ), ɛ t )), z (ɛ)] = ( ) where in the last step we passed to the tree notation. Importantly, the pair (z, z ) of trees does not range over all pairs, but only over those which satisfy the constraints: C = {(z, z ) ɛ {±}, t [ ], z t (ɛ), z t(ɛ) C t (χ(z, z, ɛ ),..., χ(z t (ɛ), z t (ɛ), ɛ t ))} 5

16 Han Rakhlin Sridharan Now, given the pair (z, z ) C, define a Z-valued tree of depth as: w =, w t (ɛ) = χ(z t (ɛ), z t (ɛ), ɛ t ) for all t > Clearly, this is a well-defined tree, and we now claim that it satisfies the constraints along every path. Indeed, we need to check that for any ɛ and t, both w t (ɛ t 2, +), w t (ɛ t 2, ) C t ( w,..., w t (ɛ t 2 )). his amounts to checking, by definition of w and the selector χ, that z t (ɛ t 2 ), z t (ɛ t 2 ) C t (χ(z, z, ɛ ),..., χ(z t 2 (ɛ), z t 2(ɛ), ɛ t 2 )). But this is true because (z, z ) C. Hence, w constructed from z, z satisfies the constraints along every path. We can therefore upper bound the expression in ( ) by twice w C z ɛ [ ɛ t l(π t ( w (ɛ),..., w t (ɛ)), z (ɛ))]. Define w = w( ) and w = w(+), we can expend the expectation with respect to ɛ of the above expression by 2 [ l(π ( ), z ( )) + ɛ t l(π t (w (ɛ)), z (ɛ))] w C z ɛ 2 t=2 + 2 [l(π ( ), z ( )) + ɛ t l(π t (w (ɛ)), z (ɛ))]. w C z ɛ 2 With the assumption that we do not suffer lose at the first round, which means l(π ( ), z ( )) = 0, we can see that both terms achieve the rema with the same w = w. herefore, the above expression can be rewrite as w C z t=2 [ ɛ t l(π t (w(ɛ)), z (ɛ))] ɛ 2 which is precisely (8). his concludes the proof of heorem 2. Proof [of heorem 3] By convexity of the loss, x t X x t X x t X inf [ q t (Y) y t Y ŷ t q t inf q t (Y) y t Y ŷ t q t inf l(ŷ t, y t ) inf l(π t (x t, y t ), y t )] q t (Y) y t Y ŷ t q t s t [ L,L] [ l (ŷ t, y t )(ŷ t π t (x t, y t ))] [ s t (ŷ t π t (x t, y t ))] where in the last step we passed to an upper bound by allowing for the worst-case choice s t of the derivative. We will often omit the range of the variables in our notation, and it 6

17 Competing With Strategies is understood that s t s range over [ L, L], while y t, ŷ t over Y and x t s over X. Now, by Jensen s inequality, we pass to an upper bound by exchanging ŷt and yt Y: x t = x t inf q t (Y) ŷ t q t y t inf ŷ t Y y t,s t s t [ s t (ŷ t π t (x t, y t ))] [ s t (ŷ t π t (x t, y t ))] Consider the last step, assuming all the other variables fixed: x inf ŷ = x [ s t (ŷ t π t (x t, y t ))] y,s inf ŷ p (Y [ L,L]) (y,s ) p [ s t (ŷ t π t (x t, y t ))] where the distribution p ranges over all distributions on Y [ L, L]. Now observe that the function inside the infimum is convex in ŷ, and the function inside p is linear in the distribution p. Hence, we can appeal to the minimax theorem, obtaining equality of the last expression to x = = = p (Y [ L,L]) s t ŷ t + x s t ŷ t + x s t ŷ t + x p p p inf ŷ [ s t ŷ t inf s t π t (x t, y t ))] (y,s ) p inf [s ŷ inf s t π t (x t, y t ))] ŷ (y,s ) p inf s ŷ (y,s ) p ŷ inf s t π t (x t, y t )) (y,s ) p (y,s ) p inf s ŷ (y,s ) p ŷ inf s t π t (x t, y t )) We can now upper bound the choice of ŷ by that given by π, yielding an upper bound s t ŷ t + x,p = s t ŷ t + x,p (y,s ) p inf s ŷ (y,s ) p ŷ s t π t (x t, y t )) s (y,s ) p (y,s ) p s π (x, y ) s t π t (x t, y t )) 7

18 Han Rakhlin Sridharan It is not difficult to verify that this process can be repeated for and so on. resulting upper bound is therefore V S (Π) s t s t x t,p t (y t,s t) p t (y t π t(x t, y t ),s t ) pt x t,p t = x t,p t x t 2 (y t,s t) p t (y t,s t ) p t (y t,s t) p t (y t,s t ) p t (y t,s t) (y t,s t ) ɛ t ɛ t [ (s t s t ) π t (x t, y t )] [ ɛ t (s t s t ) π t (x t, y t )] [ ɛ t (s t s t ) π t (x t, y t )] x t,y t s t,st ɛ t x t,y t s t [ L,L] ɛ t [ ɛ t (s t s t ) π t (x t, y t )] [ ɛ t s t π t (x t, y t )] Since the expression is convex in each s t, we can replace the range of s t by { L, L}, or, equivalently, V S (Π) 2L x t,y t s t {,} ɛ t he [ ɛ t s t π t (x t, y t )] (9) Now consider any arbitrary function ψ {±} R, we have that ɛ [ψ(s ɛ)] = s {±} s {±} 2 (ψ(+s) + ψ( s)) = 2 (ψ(+) + ψ( )) = ɛ [ψ(ɛ)] Since in quation (9), for each t, s t and ɛ t appear together as ɛ t s t using the above equation repeatedly, we conclude that V S (Π) 2L x t,y t ɛ t [ ɛ t π t (x t, y t )] = x,y ɛ [ ɛ t π t (x t (ɛ), y t (ɛ))] he lower bound is obtained by the same argument as in Rakhlin et al. (200). Proof [of heorem 4] Denote L t (π) = t s= l(π s (z s ), z s ). he first step of the proof is an application of the minimax theorem (we assume the necessary conditions hold): inf q t (F) z t Z = { [l(f t, z t )] + [2 ft qt z,w ɛ t+ inf p t (Z) f t F { [l(f t, z t )] + zt pt [2 zt pt z,w ɛ t+ ɛ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π)]} ɛ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π)]} 8

19 Competing With Strategies For any p t (Z), the infimum over f t of the above expression is equal to [2 z t p t z,w ɛ t+ [2 z t p t z,w ɛ t+ [2 z t,z t pt z,w ɛ t+ ɛ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) + inf f t F ɛ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) ɛ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) [l(f t, z t )] l(π t (z t ), z t )] z t p t + z t p t [l(π t (z t ), z t )] l(π t (z t ), z t )] +l(π t (z t ), z t) l(π t (z t ), z t )] We now argue that the independent z t and z t have the same distribution p t, and thus we can introduce a random sign ɛ t. he above expression then equals to [2 z t,z t pt ɛ t z,w ɛ t+ z t,z t pt z,z ɛ t z,w [2 ɛ t+ ɛ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) +ɛ t (l(π t (z t ), χ t ( ɛ t ))) l(π t (z t ), χ t (ɛ t )))] ɛ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) +ɛ t (l(π t (z t ), z t ) l(π t (z t ), z t ))] Splitting the resulting expression into two parts, we arrive at the upper bound of 2 z t,z t pt z [ ɛ t z,w ɛ t+ [ z,z,z ɛ t z,w ɛ t+ R (Π z,..., z t ). ɛ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) 2 L t (π) + ɛ t l(π t (z t ), z t )] 2ɛ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) + ɛ t l(π t (z t ), z t )] he first inequality is true as we upper bounded the expectation by the remum. he last inequality is easy to verify, as we are effectively filling in a root z t and z t for the two subtrees, for ɛ t = + and ɛ t =, respectively, and jointing the two trees with a root. One can see that the proof of admissibility corresponds to one step minimax swap and symmetrization in the proof of Rakhlin et al. (200). In contrast, in the latter paper, all minimax swaps are performed at once, followed by symmetrization steps. 9

20 Han Rakhlin Sridharan Proof [of Lemma 6 ] he first step of the proof is an application of the minimax theorem (we assume the necessary conditions hold): inf q t (F) z t Z { f t z t + f t q t max ɛ t+ s a t i(ɛ) } = inf p t (Z) f t F {f t z t + z t p t For any p t (Z), the infimum over f t of the above expression is equal to z t + zt pt max {max a t i(ɛ), max a t i(ɛ) } ɛ t+ zt p t s>t s>t s t s t max zt pt ɛ t+ s max {max a t zt pt i(ɛ), max a t i(ɛ) + z ɛ t+ z t } t p t max max z t,z t pt ɛ t+ s>t a t i(ɛ), max a t i(ɛ) + (z t z t ) s t i s,i t We now argue that the independent z t and z t have the same distribution p t, and thus we can introduce a random sign ɛ t. he above expression then equals to max max z t,z t pt ɛ t s>t a t i(ɛ), max a t i(ɛ) + ɛ t (z t z t ) s t i s,i t max max zt pt ɛ t s>t a t i(ɛ), max a t i(ɛ) + 2ɛ t z t s t i s,i t Now, the remum over p t is achieved at a delta distribution, yielding an upper bound max max z t [,] ɛ t s>t a t i(ɛ), max a t i(ɛ) + 2ɛ t z t s t i s,i t max max ɛ t s>t a t i(ɛ), max a t i(ɛ) + 2ɛ t s t i s,i t = ɛ t max s a t i (ɛ) a t i(ɛ) } Proof [of Lemma 8] Denote t s= s i= z i L t (α, β) = + β z s. he first step of the proof is an application of the minimax theorem: inf f t z t + q t (F) z t Z f t q t ɛ t+ α,β 2 ɛ s + β L t (α, β) = inf f t z t + p t (Z) f t F z t p t zt pt ɛ t+ α,β 2 ɛ s + β L t (α, β) 20

21 Competing With Strategies For any p t (Z), the infimum over f t of the above expression is equal to ɛt+ z t p t α,β 2 ɛ s + β L t (α, β) + inf f t z t f t F z t p t ɛt+ zt pt α,β 2 t i= z i ɛ s + β L t (α, β) + z t p t + β ɛt+ z t,z t pt α,β 2 t i= z i ɛ s + β L t (α, β) + + β t i= z i + β z t z t z t z t z t t i= z i + β t i= z i + β We now argue that the independent z t and z t have the same distribution p t, and thus we can introduce a random sign ɛ t. he above expression then equals to ɛt ɛt+ 2 z t,z t pt α,β ɛt 2 z t,z t Z α,β ɛ s ɛ s + β + β L t (α, β) + ɛ t L t (α, β) + ɛ t t i= z i + β t i= z i + β z t z t t i= z i + β t i= z i + β z t z t where we upper bounded the expectation by the remum. Splitting the resulting expression into two parts, we arrive at the upper bound of 2 z t Z = 2 z t Z ɛt α,β = 2 ɛt α,β ɛ s + β 2 L t (α, β) + ɛ t ɛ s + β 2 L t (α, β) + ɛ t ɛ s + β 2 L t (α, β) + ɛ t ɛt α,β t i= z i + β t i= z i + β t i= z i + β z t ( 2z t ) ɛ t z t where the last step is due to the fact that for any z t {0, }, ɛ t ( 2z t ) has the same distribution as ɛ t. We then proceed to upper bound 2 p 2 a {±} a p ɛt α,β 2 ɛt α,β ɛ s s=t ɛt α,β ɛ s + β 2 L a t (α, β) + ɛ t + β ɛ s + β 2 L a t (α, β) + ɛ t + β 2 L t (α, β) + β 2

22 Han Rakhlin Sridharan he initial condition is trivially satisfied as Rel(z ) = inf α,β s= s i= z i + β z s heorem he conditional sequential Rademacher complexity with respect to Π G (l, Π z,..., z t ) z,w is admissible. [ 2π σ t+ σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) l(π s (z s ), z s )] Proof [of heorem ] Denote L t (π) = t s= l(π s (z s ), z s ). Let c = σ σ = 2/π. he first step of the proof is an application of the minimax theorem (we assume the necessary conditions hold): inf q t (F) z t Z = { [l(f t, z t )] + [ 2 ft qt z,w σ t+ c inf p t (Z) f t F { [l(f t, z t )] + zt pt [ 2 zt pt z,w σ t+ c t s= σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π)]} For any p t (Z), the infimum over f t of the above expression is equal to [ 2 σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) z t p t z,w σ t+ c σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π)]} + inf f t F [ 2 σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) z t p t z,w σ t+ c [ 2 σ s l(π s ((z t, w s t (ɛ)), z s t (ɛ)) L t (π) z t,z t pt z,w σ t+ c [l(f t, z t )] l(π t (z t ), z t )] z t p t + z t p t [l(π t (z t ), z t )] l(π t (z t ), z t )] +l(π t (z t ), z t) l(π t (z t ), z t )] We now argue that the independent z t and z t have the same distribution p t, and thus we can introduce a gaussian random variable σ t and a random sign ɛ t = sign(σ t ). he above 22

23 Competing With Strategies expression then equals to [ 2 σ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) z t,z t pt σ t z,w σ t+ c +ɛ t (l(π t (z t ), χ t ( ɛ t ))) l(π t (z t ), χ t (ɛ t )))] [ 2 σ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) z t,z t pt σ t z,w σ t+ c +ɛ t σ t σ t c (l(π t(z t ), χ t ( ɛ t ))) l(π t (z t ), χ t (ɛ t )))] Put the expectation outside and use the fact ɛ t σ t = σ t, we get [ 2 σ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) z t,z t pt σ t z,w σ t+ c z t,z t pt z,z + σ t c (l(π t(z t ), χ t ( ɛ t ))) l(π t (z t ), χ t (ɛ t )))] [ 2 ɛ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) σ t z,w σ t+ c + σ t c (l(π t(z t ), z t ) l(π t (z t ), z t ))] Splitting the resulting expression into two parts, we arrive at the upper bound of 2 z t,z t pt z [ σ t z,w σ t+ c σ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) 2 L t (π) + σ t c l(π t(z t ), z t )] [ 2 σ s l(π s ((z t, χ t (ɛ t ), w s t (ɛ)), z s t (ɛ)) L t (π) z,z,z σ t z,w σ t+ c G (l, Π z,..., z t ). + 2σ t c l(π t(z t ), z t )] Proof [of Lemma 7] Let q t be the randomized strategy where we draw ɛ t+,..., ɛ uniformly at random and pick q t (ɛ) = argmin q [,] { f t z t + max z t {,} f t q s a t i(ɛ) } (0) 23

Competing With Strategies

Competing With Strategies Competing With Strategies Wei Han Univ. of Pennsylvania Alexander Rakhlin Univ. of Pennsylvania February 3, 203 Karthik Sridharan Univ. of Pennsylvania Abstract We study the problem of online learning

More information

Predictable Sequences and Competing with Strategies

Predictable Sequences and Competing with Strategies University of Pennsylvania ScholarlyCommons Publicly Accessible Penn Dissertations 1-1-2013 Predictable Sequences and Competing with Strategies Wei Han University of Pennsylvania, weihan.upenn@gmail.com

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

An Online Convex Optimization Approach to Blackwell s Approachability

An Online Convex Optimization Approach to Blackwell s Approachability Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

On Learnability, Complexity and Stability

On Learnability, Complexity and Stability On Learnability, Complexity and Stability Silvia Villa, Lorenzo Rosasco and Tomaso Poggio 1 Introduction A key question in statistical learning is which hypotheses (function) spaces are learnable. Roughly

More information

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks Blackwell s Approachability Theorem: A Generalization in a Special Case Amy Greenwald, Amir Jafari and Casey Marks Department of Computer Science Brown University Providence, Rhode Island 02912 CS-06-01

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality Tim Roughgarden December 3, 2014 1 Preamble This lecture closes the loop on the course s circle of

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Piecewise Constant Prediction

Piecewise Constant Prediction Piecewise Constant Prediction Erik Ordentlich Information heory Research Hewlett-Packard Laboratories Palo Alto, CA 94304 Email: erik.ordentlich@hp.com Marcelo J. Weinberger Information heory Research

More information

Online Prediction Peter Bartlett

Online Prediction Peter Bartlett Online Prediction Peter Bartlett Statistics and EECS UC Berkeley and Mathematical Sciences Queensland University of Technology Online Prediction Repeated game: Cumulative loss: ˆL n = Decision method plays

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

MythBusters: A Deep Learning Edition

MythBusters: A Deep Learning Edition 1 / 8 MythBusters: A Deep Learning Edition Sasha Rakhlin MIT Jan 18-19, 2018 2 / 8 Outline A Few Remarks on Generalization Myths 3 / 8 Myth #1: Current theory is lacking because deep neural networks have

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Gaussian Estimation under Attack Uncertainty

Gaussian Estimation under Attack Uncertainty Gaussian Estimation under Attack Uncertainty Tara Javidi Yonatan Kaspi Himanshu Tyagi Abstract We consider the estimation of a standard Gaussian random variable under an observation attack where an adversary

More information

Online learning CMPUT 654. October 20, 2011

Online learning CMPUT 654. October 20, 2011 Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations

Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations Journal of Statistical Physics, Vol. 122, No. 2, January 2006 ( C 2006 ) DOI: 10.1007/s10955-005-8006-x Scaling Limits of Waves in Convex Scalar Conservation Laws Under Random Initial Perturbations Jan

More information

Online Sparse Linear Regression

Online Sparse Linear Regression JMLR: Workshop and Conference Proceedings vol 49:1 11, 2016 Online Sparse Linear Regression Dean Foster Amazon DEAN@FOSTER.NET Satyen Kale Yahoo Research SATYEN@YAHOO-INC.COM Howard Karloff HOWARD@CC.GATECH.EDU

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

An Identity for Kernel Ridge Regression

An Identity for Kernel Ridge Regression An Identity for Kernel Ridge Regression Fedor Zhdanov and Yuri Kalnishkan Computer Learning Research Centre and Department of Computer Science, Royal Holloway, University of London, Egham, Surrey, TW20

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Optimistic Bandit Convex Optimization

Optimistic Bandit Convex Optimization Optimistic Bandit Convex Optimization Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 mohri@cims.nyu.edu Scott Yang Courant Institute 25 Mercer Street New York, NY 002 yangs@cims.nyu.edu

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

Time Series Prediction & Online Learning

Time Series Prediction & Online Learning Time Series Prediction & Online Learning Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values. earthquakes.

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations

Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations Scaling Limits of Waves in Convex Scalar Conservation Laws under Random Initial Perturbations Jan Wehr and Jack Xin Abstract We study waves in convex scalar conservation laws under noisy initial perturbations.

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information