Online Convex Optimization with Unconstrained Domains and Losses

Size: px
Start display at page:

Download "Online Convex Optimization with Unconstrained Domains and Losses"

Transcription

1 Online Convex Optimization with Unconstrained Domains and Losses Ashok Cutkosky Department of Computer Science Stanford University Kwabena Boahen Department of Bioengineering Stanford University Abstract We propose an online convex optimization algorithm RESCALEDEXP that achieves optimal regret in the unconstrained setting without prior knowledge of any bounds on the loss functions. We prove a lower bound showing an exponential separation between the regret of existing algorithms that require a known bound on the loss functions and any algorithm that does not require such knowledge. RESCALEDEXP matches this lower bound asymptotically in the number of iterations. RESCALEDEXP is naturally hyperparameter-free and we demonstrate empirically that it matches prior optimization algorithms that require hyperparameter optimization. Online Convex Optimization Online Convex Optimization OCO, ] provides an elegant framework for modeling noisy, antagonistic or changing environments. The problem can be stated formally with the help of the following definitions: Convex Set: A set W is convex if W is contained in some real vector space and tw + tw W for all w, w W and t 0, ]. Convex Function: f : W R is a convex function if ftw + tw tfw + tfw for all w, w W and t 0, ]. An OCO problem is a game of repeated rounds in which on round t a learner first chooses an element w t in some convex space W, then receives a convex loss function l t, and suffers loss l t w t. The regret of the learner with respect to some other u W is defined by R T u = l t w t l t u The objective is to design an algorithm that can achieve low regret with respect to any u, even in the face of adversarially chosen l t. Many practical problems can be formulated as OCO problems. For example, the stochastic optimization problems found widely throughout machine learning have exactly the same form, but with i.i.d. loss functions, a subset of the OCO problems. In this setting the goal is to identify a vector w with low generalization error Elw lu]. We can solve this by running an OCO algorithm for T rounds and setting w to be the average value of w t. By online-to-batch conversion results 3, 4], the generalization error is bounded by the expectation of the regret over the l t divided by T. Thus, OCO algorithms can be used to solve stochastic optimization problems while also performing well in non-i.i.d. settings. 30th Conference on Neural Information Processing Systems NIPS 06, Barcelona, Spain.

2 The regret of an OCO problem is upper-bounded by the regret on a corresponding Online Linear Optimization OLO problem, in which each l t is further constrained to be a linear function: l t w = g t w t for some g t. The reduction follows, with the help of one more definition: Subgradient: g W is a subgradient of f at w, denoted g fw, if and only if fw + g w w fw for all w. Note that fw if f is convex. To reduce OCO to OLO, suppose g t l t w t, and consider replacing l t w with the linear approximation g t w. Then using the definition of subgradient, R T u = l t w t l t u g t w t u = g t w t g t u so that replacing l t w with g t w can only make the problem more difficult. All of the analysis in this paper therefore addresses OLO, accessing convex losses functions only through subgradients. There are two major factors that influence the regret of OLO algorithms: the size of the space W and the size of the subgradients g t. When W is a bounded set the constrained case, then given B = max w W w, there exist OLO algorithms 5, 6] that can achieve R T u O BL max T without knowing L max = max t g t. When W is unbounded the unconstrained case, then given L max, there exist algorithms 7, 8, 9] that achieve R T u Õ u log u L max T or R t u Õ u log u L max T, where Õ hides factors that depend logarithmically on L max and T. These algorithms are known to be optimal up to constants for their respective regimes 0, 7]. All algorithms for the unconstrained setting to-date require knowledge of L max to achieve these optimal bounds. Thus a natural question is: can we achieve O u log u regret in the unconstrained, unknown-l max setting? This problem has been posed as a COLT 06 open problem ], and is solved in this paper. A simple approach is to maintain an estimate of L max and double it whenever we see a new g t that violates the assumed bound the so-called doubling trick, thereby turning a known-l max algorithm into an unknown-l max algorithm. This strategy fails for previous known-l max algorithms because their analysis makes strong use of the assumption that each and every g t is bounded by L max. The existence of even a small number of bound-violating g t can throw off the entire analysis. In this paper, we prove that it is actually impossible to achieve regret ] / ɛ O u log u L max T + g Lmax exp max t t Lt for any ɛ > 0 where L max and Lt = max t <t g t are unknown in advance Section. This immediately rules out the ideal bound of Õ u log u L max T which is possible in the known-lmax case. Secondly, we provide an algorithm, RESCALEDEXP, that matches our lower bound without prior knowledge of L max, leading to a naturally hyperparameter-free algorithm Section 3. To our knowledge, this is the first algorithm to address the unknown-l max issue while maintaining O u log u dependence on u. Finally, we present empirical results showing that RESCALEDEXP performs well in practice Section 4. Lower Bound with Unknown L max The following theorem rules out algorithms that achieve regret Ou logul max T without prior knowledge of L max. In fact, any such algorithm must pay an up-front penalty that is exponential in T. This lower bound resolves a COLT 06 open problem Parameter-Free and Scale-Free Online Algorithms ] in the negative. In full generality, a subgradient is an element of the dual space W. However, we will only consider cases where the subgradient is naturally identified with an element in the original space W e.g. W is finite dimensional so that the definition in terms of dot-products suffices. There are algorithms that do not require L max, but achieve only regret O u ]

3 Theorem. For any constants c, k, ɛ > 0, there exists a T and an adversarial strategy picking g t R in response to w t R such that regret is: R T u = g t w t g t u k + c u log u L max T loglmax + + kl max expt / ɛ k + c u log u L max T loglmax + + kl max exp max t for some u R where L max = max t T g t and Lt = max t <t g t. g t Lt / ɛ ] Proof. We prove the theorem by showing that for sufficiently large T, the adversary can checkmate the learner by presenting it only with the subgradient g t =. If the learner fails to have w t increase quickly, then there is a u against which the learner has high regret. On the other hand, if the learner ever does make w t higher than a particular threshold, the adversary immediately punishes the learner with a subgradient g t = T, again resulting in high regret. Let T be large enough such that both of the following hold: / T T 4 exp 4 logc > k log T + k expt / ɛ / T T exp 4 logc > kt expt / ɛ + kt T logt + The adversary plays the following strategy: for all t T, so long as w t < expt / /4 logc, give g t =. As soon as w t expt / /4 logc, give g t = T and g t = 0 for all subsequent t. Let s analyze the regret at time T in these two cases. Case : w t < expt / /4 logc for all t: In this case, let u = expt / g /4 logc. Then L max =, max t t Lt =, and using the learner s regret is at least R T u T u T = T u exp T / 4 logc = cu logu T log + T 4 exp T / 4 logc > cu logul max T loglmax + + kl max T loglmax + + kl max expt / ɛ = k + cu log ul max T loglmax + + kl max exp T / ɛ] Case : w t expt / /4 logc for some t: g In this case, L max = T and max t t Lt = T. For u = 0, using, the regret is at least R T u T exp T / 4 logc kt expt / ɛ + kt T logt + = kl max expt / ɛ + kl max T loglmax + = k + cu log ul max T loglmax + + kl max exp T / ɛ] The exponential lower-bound arises because the learner has to move exponentially fast in order to deal with exponentially far away u, but then experiences exponential regret if the adversary provides a gradient of unprecedented magnitude in the opposite direction. However, if we play against an adversary that is constrained to give loss vectors g t L max for some L max that does not grow with time, or if the losses do not grow too quickly, then we can still achieve R T u = O u log u L max T asymptotically without knowing Lmax. In the following sections we describe an algorithm that accomplishes this. 3

4 3 RESCALEDEXP Our algorithm, RESCALEDEXP, adapts to the unknown L max using a guess-and-double strategy that is robust to a small number of bound-violating g t s. We initialize a guess L for L max to g. Then we run a novel known-l max algorithm that can achieve good regret in the unconstrained u setting. As soon as we see a g t with g t > L, we update our guess to g t and restart the known-l max algorithm. To prove that this scheme is effective, we show Lemma 3 that our known-l max algorithm does not suffer too much regret when it sees a g t that violates its assumed bound. Our known-l max algorithm uses the Follow-the-Regularized-Leader FTRL framework. FTRL is an intuitive way to design OCO algorithms 3]: Given functions ψ t : W R, at time T we play w T = argmin ψ T w + ] T l tw. The functions ψ t are called regularizers. A large number of OCO algorithms e.g. gradient descent can be cleanly formulated as instances of this framework. Our known-l max algorithm is FTRL with regularizers ψ t w = ψw/η t, where ψw = w + log w + w and η t is a scale-factor that we adapt over time. Specifically, we set ηt = k M t + g :t, where we use the compressed sum notations g :T = T g t and g :T = T g t. M t is defined recursively by M 0 = 0 and M t = maxm t, g :t /p g :t, so that M t M t, and M t + g :t g :t /p. k and p are constants: k = and p = L max. RESCALEDEXP s strategy is to maintain an estimate L t of L max at all time steps. Whenever it observes g t L t, it updates L t+ = g t. We call periods during which L t is constant epochs. Every time it updates L t, it restarts our known-l max algorithm with p = L t, beginning a new epoch. Notice that since L t at least doubles every epoch, there will be at most log L max /L + total epochs. To address edge cases, we set w t = 0 until we suffer a non-constant loss function, and we set the initial value of L t to be the first non-zero g t. Pseudo-code is given in Algorithm, and Theorem states our regret bound. For simplicity, we re-index so that that g is the first non-zero gradient received. No regret is suffered when g t = 0 so this does not affect our analysis. Algorithm RESCALEDEXP Initialize: k, M 0 0, w 0, t // t is the start-time of the current epoch. for t = to T do Play w t, receive subgradient g t l t w t. if t = then L g p /L end if M t maxm t, g t :t /p g t :t. η t k M t+ g t :t //Set w t+ using FTRL update ] w t+ gt :t g expη t :t t g t :t ] // = argmin ψw w η t + g t :tw if g t > L t then //Begin a new epoch: update L and restart FTRL L t+ g t p /L t+ t t + M t 0 w t+ 0 else L t+ L t end if end for Theorem. Let W be a separable real inner-product space with corresponding norm and suppose with mild abuse of notation every loss function l t : W R has some subgradient g t W at w t such that g t w = g t w for some g t W. Let M max = max t M t. Then if L max = max t g t 4

5 and Lt = max t <t g t, rescaledexp achieves regret: Lmax R T u ψu + 96 log + M max + g :T L Lmax g t + 8L max log + min exp 8 max t Lt = O L max log Lmax L L u log u + T + exp, exp ] T/ 8 max t g t ] Lt The conditions on W in Theorem are fairly mild. In particular they are satisfied whenever W is finite-dimensional and in most kernel method settings 4]. In the kernel method setting, W is an RKHS of functions X R and our losses take the form l t w = l t w, k xt where k xt is the representing element in W of some x t X, so that g t = g t k xt where g t l t w, k xt. Although we nearly match our lower-bound exponential term of expt / ɛ, in order to have a g practical algorithm we need to do much better. Fortunately, the max t t Lt term may be significantly smaller when the losses are not fully adversarial. For example, if the loss vectors g t satisfy g t = t, then the exponential term in our bound reduces to a manageable constant even though g t is growing quickly without bound. To prove Theorem, we bound the regret of RESCALEDEXP during each epoch. Recall that during an epoch, RESCALEDEXP is running FTRL with ψ t w = ψw/η t. Therefore our first order of business is to analyze the regret of FTRL across one of these epochs, which we do in Lemma 3 proved in appendix: Lemma 3. Set k =. Suppose g t L for t < T, /L p /L, g T L max and L max L. Let W max = max t,t ] w t. Then the regret of FTRL with regularizers ψ t w = ψw/η t is: R T u ψu/η T + 96 M T + g :T + L max min W max, 4 exp 4 L max L, exp ] T/ ψu + 96 T L g t + L max + 8L max min exp L max u + log u + u + 96 T + 8L max min 4L max L, exp ] T/ max e 4L L ], e T/ Lemma 3 requires us to know the value of L in order to set p. However, the crucial point is that it encompasses the case in which L is misspecified on the last loss vector. This allows us to show that RESCALEDEXP does not suffer too much by updating p on-the-fly. Proof of Theorem. The theorem follows by applying Lemma 3 to each epoch in which L t is constant. Let = t, t, t 3,, t n be the various increasing values of t as defined in Algorithm, and we define t n+ = T +. Then define b R a:b u = g t w t u so that R T u n j= R t j:t j+ u. We will bound R tj:t j+ u for each j. t=a Fix a particular j < n. Then R tj:t j+ u is simply the regret of FTRL with k =, p = L tj, η t = k M t+ g t j :t and regularizers ψw/η t. By definition of L t, for t, t j+ ] we have g t L tj. Further, if L = max t,tj+ ] g t we have L L tj. Therefore, L tj L L tj so that L p L. Further, we have g t j+ /L tj max t g t /Lt. Thus by Lemma 3 we 5

6 have R tj:t j+ u ψu/η tj M tj+ + g t j:t j+ + L max min W max, 4 exp 4 g t j+ L t j ψu/η tj M max + g t + 8L j:t j+ max min ] tj+ t j, exp e 8 maxt g t ] Lt, e T/ ψu + 96 M max + g :T + 8L g t max min exp 8 max t Lt, exp ] T/ Summing across epochs, we have n R T u = R tj:t j+ u n j= ψu + 96 M max + g :T + 8L max min g t ]] exp 8 max t Lt, exp T/ Observe that n log L max /L + to prove the first line of the theorem. The big-oh expression tj+ T follows from the inequality: M tj+ L tj t=t j g t L max g t. Our specific choices for k and p are somewhat arbitrary. We suspect although we do not prove that the preceding theorems are true for larger values of k and any p inversely proportional to L t, albeit with differing constants. In Section 4 we perform experiments using the values for k, p and L t described in Algorithm. In keeping with the spirit of designing a hyperparameter-free algorithm, no attempt was made to empirically optimize these values at any time. 4 Experiments 4. Linear Classification To validate our theoretical results in practice, we evaluated RESCALEDEXP on 8 classification datasets. The data for each task was pulled from the libsvm website 5], and can be found individually in a variety of sources 6, 7, 8, 9, 0,, ]. We use linear classifiers with hinge-loss for each task and we compare RESCALEDEXP to five other optimization algorithms: ADAGRAD 5], SCALEINVARIANT 3], PISTOL 4], ADAM 5], and ADADELTA 6]. Each of these algorithms requires tuning of some hyperparameter for unconstrained problems with unknown L max usually a scale-factor on a learning rate. In contrast, our RESCALEDEXP requires no such tuning. We evaluate each algorithm with the average loss after one pass through the data, computing a prediction, an error, and an update to model parameters for each example in the dataset. Note that this is not the same as a cross-validated error, but is closer to the notion of regret addressed in our theorems. We plot this average loss versus hyperparameter setting for each dataset in Figures and. These data bear out the effectiveness of RESCALEDEXP: while it is not unilaterally the highest performer on all datasets, it shows remarkable robustness across datasets with zero manual tuning. 4. Convolutional Neural Networks We also evaluated RESCALEDEXP on two convolutional neural network models. These models have demonstrated remarkable success in computer vision tasks and are becoming increasingly more popular in a variety of areas, but can require significant hyperparameter tuning to train. We consider the MNIST 8] and CIFAR-0 7] image classification tasks. Our MNIST architecture consisted of two consecutive 5 5 convolution and max-pooling layers followed by a 5-neuron fully-connected layer. Our CIFAR-0 architecture was two consecutive 5 5 convolution and 3 3 max-pooling layers followed by a 384-neuron fully-connected layer and a 9-neuron fully-connected layer. 6

7 0 0 0 covtype PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 gisette_scale average loss hyperparameter setting average loss 0 - PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting 0.60 madelon 0 0 mnist average loss PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting average loss 0 - PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting Figure : Average loss vs hyperparameter setting for each algorithm across each dataset. RESCALED- EXP has no hyperparameters and so is represented by a flat yellow line. Many of the other algorithms display large sensitivity to hyperparameter setting. These models are highly non-convex, so that none of our theoretical analysis applies. Our use of RESCALEDEXP is motivated by the fact that in practice convex methods are used to train these models. We found that RESCALEDEXP can match the performance of other popular algorithms see Figure 3. In order to achieve this performance, we made a slight modification to RESCALEDEXP: when we update L t, instead of resetting w t to zero, we re-center the algorithm about the previous prediction point. We provide no theoretical justification for this modification, but only note that it makes intuitive sense in stochastic optimization problems, where one can reasonably expect that the previous prediction vector is closer to the optimal value than zero. 5 Conclusions We have presented RESCALEDEXP, an Online Convex Optimization algorithm that achieves regret Õ u log u L max T + exp8 maxt g t /Lt where L max = max t g t is unknown in advance. Since RESCALEDEXP does not use any prior-knowledge about the losses or comparison vector u, it is hyperparameter free and so does not require any tuning of learning rates. We also prove a lower-bound showing that any algorithm that addresses the unknown-l max scenario must suffer an exponential penalty in the regret. We compare RESCALEDEXP to prior optimization algorithms empirically and show that it matches their performance. While our lower-bound matches our regret bound for RESCALEDEXP in terms of T, clearly there is much work to be done. For example, when RESCALEDEXP is run on the adversarial loss sequence presented in Theorem, its regret matches the lower-bound, suggesting that the optimality gap could be improved with superior analysis. We also hope that our lower-bound inspires work in algorithms that adapt to non-adversarial properties of the losses to avoid the exponential penalty. 7

8 0 0 ijcnn PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 epsilon_normalized PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp average loss 0 - average loss hyperparameter setting hyperparameter setting 0 0 rcv_train.multiclass PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 SenseIT Vehicle Combined PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp average loss average loss hyperparameter setting hyperparameter setting Figure : Average loss vs hyperparameter setting, continued from Figure. Figure 3: We compare RESCALEDEXP to ADAM, ADAGRAD, and stochastic gradient descent SGD, with learning-rate hyperparameter optimization for the latter three algorithms. All algorithms achieve a final validation accuracy of 99% on MNIST and 84%, 84%, 83% and 85% respectively on CIFAR-0 after iterations. References ] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 0th International Conference on Machine Learning ICML-03, pages , 003. ] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:07 94, 0. 3] Nick Littlestone. From on-line to batch learning. In Proceedings of the second annual workshop on Computational learning theory, pages 69 84, 04. 8

9 4] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 509: , ] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory COLT, 00. 6] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 3rd Annual Conference on Learning Theory COLT, 00. 7] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 40 40, 0. 8] Francesco Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems, pages , 03. 9] Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems, pages 74 73, 03. 0] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, 008. ] Francesco Orabona and Dávid Pál. Scale-free online learning. arxiv preprint arxiv: , 06. ] Francesco Orabona and Dávid Pál. Open problem: Parameter-free and scale-free online algorithms. In Conference on Learning Theory, 06. 3] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, ] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 7 0, ] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology TIST, 3:7, 0. 6] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 003 feature selection challenge. In Advances in Neural Information Processing Systems, pages , ] Chih-chung Chang and Chih-Jen Lin. Ijcnn 00 challenge: Generalization ability and text decoding. In In Proceedings of IJCNN. IEEE. Citeseer, 00. 8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:78 34, ] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:36 397, ] Marco F Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 647:86 838, 004. ] M. Lichman. UCI machine learning repository, 03. ] Shimon Kogan, Dimitry Levin, Bryan R Routledge, Jacob S Sagi, and Noah A Smith. Predicting risk from financial reports with regression. In Proceedings of Human Language Technologies: The 009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages Association for Computational Linguistics, ] Francesco Orabona, Koby Crammer, and Nicolo Cesa-Bianchi. A generalized online mirror descent with applications to classification and regression. Machine Learning, 993:4 435, 04. 4] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 6 4, 04. 5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:4.6980, 04. 6] Matthew D Zeiler. Adadelta: An adaptive learning rate method. arxiv preprint arxiv:.570, 0. 7] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, ] H. Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. arxiv preprint arxiv: , 04. 9

10 A Follow-the-Regularized-Leader FTRL Regret Recall that the FTRL algorithm uses the strategy w t+ = argmin ψ tw + t ψ t are called regularizers. t = l t w, where the functions Theorem 4. FTRL with regularizers ψ t and ψ 0w = 0 obtains regret: R tu ψ T u + ψ t w t+ ψ tw t+ + l tw t l tw t+ 3 Further, if the losses are linear l tw = g t w and ψ tw = η t ψw for some values η t and fixed function ψ, then the regret is R tu ψu + ψw t+ + g t w t w t+ 4 η T η t η t Proof. The first part follows from some algebraic manipulations: l tu + ψ T u ψ T w T + + l tw T + l tu ψ T u ψ T w T + l tw T + R T u = l tw t l tu ψ T u ψ T w T + + l tw t l tw T + = ψ T u ψ T w T + + l T w T l T w T + + R T w T + ψ T u ψ T w T + + l T w T l T w T ++ ψ tw t+ ψ tw t+ + l tw t l tw t+ T + = ψ T u + l w l w ψ w + ψ t w t+ ψ tw t+ + l tw t l tw t+ t= = ψ T u + ψ t w t+ ψ tw t+ + l tw t l tw t+ where we re assuming ψ 0w = 0 in the last step. Now let s specialize to the case of linear losses l tw = g t w and regularizers of the form ψ tw = η t ψw for some fixed regularizer ψ and varying scalings η t. Plugging this into the previous bound gives: R tu ψu + ψw t+ + g t w t w t+ η T η t η t While this formulation of the regret of FTRL is sufficient for our needs, our analysis is not tight. We refer the reader to 8] for a stronger FTRL bound that can improve constants in some analyses. B Proof of Lemma 3 We start off by computing the FTRL updates with regularizers ψw/η t: ψw = log w + w w 0

11 so that w T + = argmin η T ψw + g t w = g:t expηt g:t g :t Our goal will be to show that the terms η t η t ψw t+ + g t w t w t+ in the sum in 4 are negative. In particular, note that sequence of η t is non-increasing so that η t η t ψw t+ 0 for all t. Thus our strategy will be to bound g t w t w t+. B. Reduction to one dimension In order to bound η t η t ψw t+ + g t w t w t+, we first show that it suffices to consider the case when g t and g :t are co-linear. Theorem 5. Let W be a separable inner-product space and suppose with mild abuse of notation every loss function l t : W R has some subgradient g t W such that g tw = g t, w for some g t W. Suppose we run an FTRL algorithm with regularizers η t ψ w on loss functions l t such that w t+ = g :t fηt g:t g :t c for some function f for all t where η t = for some constant c. Then for any g t with g t = L, both η t η t g :t. Mt + g :t ψ w t+ + g tw t w t+ and g tw t w t+ are maximized when g t is a scalar multiple of Proof. The proof is an application of Lagrange multipliers. Our Lagrangian for η g tw t w t+ is L = η t η t ψ w t+ + g tw t w t+ + λ g t / = η t η t ψfη t g :t + g t w t g:t g :t fηt g:t t η + λ gt t ψ w t+ + Fix a countable orthonormal basis of W. For a vector v W we let v i be the projection of v along the ith basis vector of our countable orthonormal basis. We denote the action of L on the ith basis vector by L i. Then we have L i = λg t,i + w t,i w t+,i + j j gt,i g :t fηt g:t g t,jg :t j g g:tifηt g:t :t 3 g :t jg t,j g :t f η t g :t g:tiηt g :t M t + η t η t ψ fη t g :t f η t g :t g:tiηt g :t M t g t,i + g t,i ψfη t g :t c M t + g :t = λg t,i + w t,i w t+,i + Ag t,i + Bg :t i + C Mt g t,i g :t c g t,i + g t,i M t + g :t 3/ M g :t c t g t,i + g t,i M t + g :t 3/ where A, B and C do not depend on i. Since w t,i and w t+,i are scalar multiples of g :t and g :t respectively, we can reassign the variables A and B to write L i = Ag t,i + Bg :t i + C Mt g t,i

12 Now we compute M t = maxmt, g:t /p g :t g t,i g t,i { 0 : Mt = M t = : Mt Mt g :t i p g :t gt,i Thus after again reassigning the variables A and B we have L i = Ag t,i + Bg :t i Therefore we can only have L = 0 if g t is a scalar multiple of g :t as desired. For g tw t w t+, we apply exactly the same argument. The Lagrangian is and differentiating we have L = g tw t w t+ + λ g t / = g t w t g:t g fηt g:t + λ gt :t L i = λg t,i + w t,i w t+,i gt,i g :t fηt g:t + j j g t,jg :t j g g:tifηt g:t :t 3 g :t jg t,j g :t f η t g :t g:tiηt g :t M g :t c t g t,i + g t,i M t + g :t 3/ so that again we are done. = λg t,i + w t,i w t+,i + Ag t,i + Bg :t i + C Mt g t,i = Ag t,i + Bg :t i We make the following intuitive definition: Definition 6. For any vector v W, define signv = In the next section, we prove bounds on the quantity η t η t ψ w t+ + g tw t w t+. By Theorem 5 this quantity is maximized when signg t = ±signg :t and so we consider only this case. B. One dimensional FTRL In this section we analyze the regret of our FTRL algorithm with the end-goal of proving Lemma 3. We make heavy use of Theorem 5 to allow us to consider only the case signg t = ±signg :t. In this setting we may identify the -dimensional space spanned by g t and g :t with R. Thus whenever we are operating under the assumption signg t = signg :t we will use in place of and occasionally assume g :t > 0 as this holds WLOG. We feel that this notation and assumption aids intuition in visualizing the following results. Lemma 7. Suppose signg t = signg :t. Then η t g :t η t g :t η t g t 5 Suppose instead that signg t = signg :t and also g t L. Then we still have: η t g :t η t g :t + pl η t g t 6 v v. Proof. First, suppose signg t = signg :t. Then signg :t = signg :t. WLOG, assume g :t > 0. Notice that η tg :t is an increasing function of g t for g t > 0 because η tg :t is proportional to either g :t or g :t depending on whether M t = M t or not. Then since η t < η t we have η t g :t η tg :t = η tg :t η t g :t η tg :t η tg :t = η t g t

13 so that 5 holds. Now suppose signg t = signg :t and g t L. We consider two cases. Case : η t g :t η t g :t : Since η t η t, we have η t g :t η t g :t η t g :t η t g :t g :t g :t g t g :t where the last line follows since signg :t = signg t. Therefore: so that we are done. Case : η t g :t η t g :t : η t g :t η t g :t η t g :t η t g t When g t < g :t and η t g :t η t g :t, η t g :t η t g :t is a decreasing function of g t because η t g t: is an increasing function of g t for g t < g :t. Therefore it suffices to consider the case g t g :t, so that signg :t = signg :t and g :t g :t : Since g :t g :t, we have M t = M t so that we can write: η t g :t η tg :t = g tη t + g :t η t η t = g t η t + g :t k M t + g :t = g t η t + g:t k g t η t + k M t + g :t + g t M t + g :t M t + g :t + g t g M t + g :t :t k + g t M t + g :t + g t M t + g :t g t η t + g :t η t + g :t g t g t η t + η t M t + g :t g t η t + pl we have used the identity X + g t X + g t L and M t + g :t g :t /p. Lemma 8. If then g t M t + g :t g t between lines 4 and 5, and the last line follows because X pb w T exp k g :T B Proof. First note that by definition of M T and η T, η T g :T from some algebra: pb exp k w T + = expη T g :T p g:t exp k p g:t k. The proof now follows Taking squares of logs and rearranging now gives the desired inequality. 3

14 We have the following immediate corollary: Corollary 9. Suppose signg t = ±signg :t, g t L, and pl w t exp k Then signg :t = signg :t. Now we begin analysis of the sum term in 4. Lemma 0. Suppose signg :t = signg :t and g t L. Then w t w t+ g t η t w t+ + + pl exp g tη t + pl ] Proof. Since signg :t = signg :t, we have: w t w t+ = signg :t exp η t g :t ] signg :t exp η t g :t ] = exp η t g :t exp η t g :t = w t+ + exp η t g :t η t g :t where the last line uses the definition of w t+ to observe that w t+ + = expη t g :t. Now we consider two cases: either η t g :t < η t g :t or not. Case : η t g :t < η t g :t : By convexity of exp, we have so that the lemma holds. Case : η t g :t η t g :t : Again by convexity of exp we have w t w t+ w t+ + exp η t g :t η t g :t w t+ + η t g :t η t g :t w t+ + + pl η t g t w t w t+ w t+ + exp η t g :t η t g :t w t+ + η t g :t η t g :t exp η t g :t η t g :t w t+ + + pl exp η t g t + pl ] η t g t so that the lemma still holds. The next lemma is the main workhorse of our regret bounds: Lemma. Suppose g t L and either of the following holds:. p L, k =, and w t 5.. k =, pl, and w t 4 expp L. Then ψw t+ + g tw t w t+ 0 7 η t η t Further, inequality 7 holds for any k and sufficiently large L if w t exppl. Proof. By Theorem 5 it suffices to consider the case signg t = ±signg :t, so that we may adopt our identification with R and use of throughout this proof. For p, k = we have 5 > exp pl L k and for sufficiently large L, exppl > exp pl k. Therefore in all cases w t exp pl k so that by Corollary 9 and Lemma 0 we have g t w t w t+ η tg t w t+ + + pl exp η tg t + pl ] 8 4

15 First, we prove that 7 is guaranteed if the following holds: + pl w t+ + exp exp k The previous line 9 is equivalent to: k log w t+ + η tg t + pl ] + + pl exp η tg t + pl 9 0 Notice that ψw t+ = w t+ + log w t+ + + w t+ + log w t+ +. Then multiplying 0 by η t g t we have w t+ + + pl exp η t g t + pl ] η t g t k η t g t ψw t+ Combining 8 and, we see that 9 implies g t w t w t+ k η tg t ψw t+ Now we bound η t η t ψw t+: = k M t + g :t η t η M t + g :t + g t t k M t + g :t + g g t + M t M t t M t + g M t + g :t + :t + g t g t g t k M t + g :t = k η tg t Thus when 9 holds we have ψw t+ + g tw t w t+ k η tgt ψw t+ + k ηt gt ψw t+ 0 η t η t Therefore our objective is to show that our conditions on w t imply the condition 9 on w t+. First, we bound η tg t in terms of w t. Notice that Using this we have: g :t w t + = exp k M t + g :t p g:t exp k k log w t + p g :t g t η tg t = k M t + g :t g t k M t + g :t + g t g t p k g :t + pgt L p k k p log w t + + pl 5

16 so that we can conclude: η tg t Lp k k log w t + + p L Further, by Lemma 7 we have w t + = expηt g:t ηt g:t w t+ + exp η tg t + pl ] Therefore we have w t+ + w t + exp η tg t + pl ] 3 From 3, we see that 9 is guaranteed if we have w t + exp η tg t + pl ] exp + pl k exp η tg t + pl ] + 4 If we use our expression in 4, and assume w t expl, we see that there exists some constant C depending on p and k such that the RHS of 4 is OexpL and so 4 holds for sufficiently large L. For p = /L, k =, and w t 5 we can verify 4 numerically by plugging in the bound. For the case k =, w t 4 expp L, we notice that by using, we can write 4 entirely in terms of pl. Graphing both sides numerically as functions of pl then allows us to verify the condition. We have one final lemma we need before we can start stating some real regret bounds. This lemma can be viewed as observing that ψw is roughly strongly-convex for w not much bigger than D. D Lemma. Suppose p /L, k =, w t D and g t L Then g tw t w t+ 6maxD +, exp/g t η t. Proof. By Theorem 5 it suffices to consider signg t = ±signg :t. We show that w t w t+ 6maxD +, exp/ g t η t so that the result follows by multiplying by g t. From Lemma 7, we have η t g :t η t g :t η t g t + pl ηt g t. Further, note that η t g t. We consider two cases, either signg:t = signg:t or not. k = Case : signg :t = signg :t : w t w t+ = expη t g :t expη t g :t = w t + expη t g :t η t g :t D + η t g t expη t g t D + η t g t exp k 6D + η t g t Case : signg :t signg :t : In this case, we must have g :t g t. Let X = maxη t g :t, η t g :t. Then by triangle inequality we have w t w t+ max w t, w t+ expx X expx max w t, w t+ + X Since η t g :t η t g :t η tg t, we have X η tg t + η t g :t 3η t g t so that we have w t w t+ 6max w t, w t+ + η t g t 6

17 Finally, we have w t+ + = expη t g :t expη t g t exp/, so that w t w t+ 6η t g t max w t, w t+ + 6 maxd +, exp/η t g t Now we are finally in a position to prove Lemma 3, which we re-state below: Lemma 3. Set k =. Suppose g t L for t < T, /L p /L, g T L max and L max L. Let W max = max t,t ] w t. Then the regret of FTRL with regularizers ψ tw = ψw/η t is: R T u ψu/η T + 96 M T + g :T + Lmax min W max, 4 exp 4 L max, exp ] T/ L ψu + 96 T L g t + L max + 8L max min exp 4L max L max u + log u + u + 96 T + 8L max min L, exp ] T/ max e 4L L ], e T / Proof of Lemma 3. We combine Lemma with Lemma : if w t 5 we have for all t < T : ψw t+ + g t w t w t+ < 0 η t η t and if w t 5 we have ψw t+ + g t w t w t+ g t w t w t+ η t η t η tg t = 96η tgt Therefore for all t < T we have η t η t ψw t+ + g t w t w t+ 96η tgt. R T u ψu/η T + ψu/η T + 96 ψw t+ + g t w t w t+ η t η t η tgt + ψw T + + g T w T w T + η T η T We have ψw T + < 0 η T η T so that ψw T + + g T w T w T + L maxw max η T η T Further, again using Lemma we have ψw T + + g T w T w T + < 0 η T η T for w T 4 expp L max since k =. Finally, notice that by definition of η t and L, we must have η tg :t T/ exp η t g :t exp. Thus we have p g:t k T/, so that w t ψw T + + g T w T w T + L max minw max, 4 exp4l max/l, exp T η T η T 7

18 Now we make the following classic argument: M t + g :t M t + g :t g t + M t M t M t + g :t g t M t + g :t = η tg t so that we can bound: R T u ψu/η T + 96 η tgt + ψw T + + g T w T w T + η T η T ψu/η T + 96 M T + g :T + Lmax minwmax, 4 exp4l max/l, exp T To show the remaining two lines of the theorem, we prove by induction that M t + g :t L t t = g t for all t < T. The statement is clearly true for t =. Suppose it holds for some t. Then notice that g :t+ g t+ + g :t. So we have M t+ + g :t+ = max M t + g :t+, g:t+ p max M t + g :t + L g t+, L g :t+ t+ L g t t = Finally, we observe that M T = max M T + g :T + gt, g :T p two lines of the theorem follow immediately. L max + L T g t and the last C Additional Experimental Details C. Hyperparameter Optimization For the linear classification tasks, we optimized hyperparameters in a two-step process. First, we tested every power of 0 from 0 5 to 0. Second, if λ was the best hyperparameter setting in step, we additionally tested βλ for β {0., 0.4, 0.8,.0, 4.0, 6.0, 8.0} For the neural network models, we optimized ADAM and ADAGRAD s learning rates by testing every power of 0 from 0 5 to 0 0. For stochastic gradient descent, we used an exponentially decaying learning rate schedule specified in Tensorflow s MNIST and CIFAR-0 example code. C. Coordinate-wise updates We proved all our results in arbitrarily many dimensions, leading to a dimension-independent regret bound. However, it is also possible to achieve dimension-dependent bounds by running an independent version of our algorithm on each coordinate. Formally, for OLO we have R T u = g tw t u = d i= g t,iw t,i u i = d RT u i where R T is the regret of a -dimensional instance of the algorithm. This reduction can yield substantially better regret bounds when the gradients g t are known to be sparse but can be much worse when they are not. We use this coordinate-wise update strategy for our linear classification experiments for RESCALEDEXP. We also considered coordinate-wise updates and non-coordinate wise updates for the other algorithms, taking the best-performing of the two. For all algorithms in the linear classification experiments, we found that the difference between coordinate-wise and non-coordinate wise updates was not very striking. However, for the neural network experiments we found RESCALEDEXP performed extremely poorly when using coordinate-wise updates, and performed extremely well with non-coordinate wise updates. We hypothesize that this is due to a combination of non-convexity of the model and frequent resets at different times for each coordinate. i= 8

19 ADAGRAD RESCALEDEXP ADADELTA SCALEINVARIANT ADAM PISTOL Table : Average normalized loss, using best hyperparameter setting for each algorithm. C.3 Re-centering RESCALEDEXP For the non-convex neural network tasks we used a variant of RESCALEDEXP in which we re-center our FTRL algorithm at the beginning of each epoch. Formally, the pseudo-code is provided below: Algorithm Re-centered RESCALEDEXP Initialize: k, M 0 0, w 0, t, w 0 for t = to T do Play w t, receive subgradient g t l t w t. if t = then L g p /L end if M t maxm t, g t :t /p g t :t. η t k M t+ g t :t w t+ w + argmin ψw w η t if g t > L t then L t+ g t p /L t+ t t + M t 0 w t+ 0 w w t else L t+ L t end if end for ] + g t :tw = w gt :t g expη t :t t g t :t ] So long as w u u, this algorithm maintains the same regret bound as the non-re-centered version of RESCALEDEXP. While it is intuitively reasonable to expect this to occur in a stochastic setting, an adversary can easily subvert this algorithm. C.4 Aggregating Studies It is difficult to interpret the results of a study such as our linear classification experiments see Section 4 in which no particular algorithm is always the winner for every dataset. In particular, consider the case of an analyst who wishes to run one of these algorithms on some new dataset, and doesn t have the either the resources or inclination to implement and tune each algorithm. Which should she choose? We suggest the following heuristic: pick the algorithm with the lowest loss averaged across datasets. This heuristic is problematic because datasets in which all algorithms do very poorly will dominate the crossdataset average. In order address this issue and compare losses across datasets properly, we compute a normalized loss for each algorithm and dataset. The normalized loss for an algorithm on a dataset is given by taking the loss experienced by the algorithm on its best hyperparameter setting on that dataset divided by the lowest loss observed by any algorithm and hyperparameter setting on that dataset. Thus a normalized loss of on a dataset indicates that an algorithm outperformed all other algorithms on the dataset at least for its best hyperparameter setting. We then average the normalized loss for each algorithm across datasets to obtain the scores for each algorithm see Table. These data indicate that while ADAGRAD has a slight edge after tuning, RESCALEDEXP and ADADELTA do nearly equivalently well 4% and 6% worse performance, respectively. Therefore we suggest that if our intrepid analyst is willing to perform some hyperparameter tuning, then ADAGRAD may be slightly better, but her choice doesn t matter too much. On the other hand, using RESCALEDEXP will allow her to skip any tuning step without compromising performance. 9

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates

Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Eve: A Gradient Based Optimization Method with Locally and Globally Adaptive Learning Rates Hiroaki Hayashi 1,* Jayanth Koushik 1,* Graham Neubig 1 arxiv:1611.01505v3 [cs.lg] 11 Jun 2018 Abstract Adaptive

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate

Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate Haiwen Huang School of Mathematical Sciences Peking University, Beijing, 100871 smshhw@pku.edu.cn Chang Wang

More information

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization H. Brendan McMahan Google, Inc. Abstract We prove that many mirror descent algorithms for online conve optimization

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Adaptive Gradient Methods AdaGrad / Adam

Adaptive Gradient Methods AdaGrad / Adam Case Study 1: Estimating Click Probabilities Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 The Problem with GD (and SGD)

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

IMPROVING STOCHASTIC GRADIENT DESCENT

IMPROVING STOCHASTIC GRADIENT DESCENT IMPROVING STOCHASTIC GRADIENT DESCENT WITH FEEDBACK Jayanth Koushik & Hiroaki Hayashi Language Technologies Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {jkoushik,hiroakih}@cs.cmu.edu

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University, 353 Serra Street, Stanford, CA 94305 USA JSTEINHARDT@CS.STANFORD.EDU PLIANG@CS.STANFORD.EDU Abstract We present an adaptive variant of the exponentiated

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Notes on AdaGrad. Joseph Perla 2014

Notes on AdaGrad. Joseph Perla 2014 Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence

Stochastic Gradient Estimate Variance in Contrastive Divergence and Persistent Contrastive Divergence ESANN 0 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 7-9 April 0, idoc.com publ., ISBN 97-7707-. Stochastic Gradient

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Better Algorithms for Selective Sampling

Better Algorithms for Selective Sampling Francesco Orabona Nicolò Cesa-Bianchi DSI, Università degli Studi di Milano, Italy francesco@orabonacom nicolocesa-bianchi@unimiit Abstract We study online algorithms for selective sampling that use regularized

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Algorithms for Learning Good Step Sizes

Algorithms for Learning Good Step Sizes 1 Algorithms for Learning Good Step Sizes Brian Zhang (bhz) and Manikant Tiwari (manikant) with the guidance of Prof. Tim Roughgarden I. MOTIVATION AND PREVIOUS WORK Many common algorithms in machine learning,

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

1 Review and Overview

1 Review and Overview DRAFT a final version will be posted shortly CS229T/STATS231: Statistical Learning Theory Lecturer: Tengyu Ma Lecture # 16 Scribe: Chris Cundy, Ananya Kumar November 14, 2018 1 Review and Overview Last

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Minimizing Adaptive Regret with One Gradient per Iteration

Minimizing Adaptive Regret with One Gradient per Iteration Minimizing Adaptive Regret with One Gradient per Iteration Guanghui Wang, Dakuan Zhao, Lijun Zhang National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China wanggh@lamda.nju.edu.cn,

More information

Black-Box Reductions for Parameter-free Online Learning in Banach Spaces

Black-Box Reductions for Parameter-free Online Learning in Banach Spaces Proceedings of Machine Learning Research vol 75:1 37, 2018 31st Annual Conference on Learning Theory Black-Box Reductions for Parameter-free Online Learning in Banach Spaces Ashok Cutkosky Stanford University

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation

Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Deep Neural Networks (3) Computational Graphs, Learning Algorithms, Initialisation Steve Renals Machine Learning Practical MLP Lecture 5 16 October 2018 MLP Lecture 5 / 16 October 2018 Deep Neural Networks

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Online and Stochastic Learning with a Human Cognitive Bias

Online and Stochastic Learning with a Human Cognitive Bias Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Online and Stochastic Learning with a Human Cognitive Bias Hidekazu Oiwa The University of Tokyo and JSPS Research Fellow 7-3-1

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Day 3 Lecture 3. Optimizing deep networks

Day 3 Lecture 3. Optimizing deep networks Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Importance Reweighting Using Adversarial-Collaborative Training

Importance Reweighting Using Adversarial-Collaborative Training Importance Reweighting Using Adversarial-Collaborative Training Yifan Wu yw4@andrew.cmu.edu Tianshu Ren tren@andrew.cmu.edu Lidan Mu lmu@andrew.cmu.edu Abstract We consider the problem of reweighting a

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Deep Feedforward Networks

Deep Feedforward Networks Deep Feedforward Networks Liu Yang March 30, 2017 Liu Yang Short title March 30, 2017 1 / 24 Overview 1 Background A general introduction Example 2 Gradient based learning Cost functions Output Units 3

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

Pairwise Away Steps for the Frank-Wolfe Algorithm

Pairwise Away Steps for the Frank-Wolfe Algorithm Pairwise Away Steps for the Frank-Wolfe Algorithm Héctor Allende Department of Informatics Universidad Federico Santa María, Chile hallende@inf.utfsm.cl Ricardo Ñanculef Department of Informatics Universidad

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees

The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees The Power of Selective Memory: Self-Bounded Learning of Prediction Suffix Trees Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904,

More information

Online Sparse Passive Aggressive Learning with Kernels

Online Sparse Passive Aggressive Learning with Kernels Online Sparse Passive Aggressive Learning with Kernels Jing Lu Peilin Zhao Steven C.H. Hoi Abstract Conventional online kernel methods often yield an unbounded large number of support vectors, making them

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Optimal Teaching for Online Perceptrons

Optimal Teaching for Online Perceptrons Optimal Teaching for Online Perceptrons Xuezhou Zhang Hrag Gorune Ohannessian Ayon Sen Scott Alfeld Xiaojin Zhu Department of Computer Sciences, University of Wisconsin Madison {zhangxz1123, gorune, ayonsn,

More information

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding

Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Large-Scale Feature Learning with Spike-and-Slab Sparse Coding Ian J. Goodfellow, Aaron Courville, Yoshua Bengio ICML 2012 Presented by Xin Yuan January 17, 2013 1 Outline Contributions Spike-and-Slab

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

The Forgetron: A Kernel-Based Perceptron on a Fixed Budget

The Forgetron: A Kernel-Based Perceptron on a Fixed Budget The Forgetron: A Kernel-Based Perceptron on a Fixed Budget Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {oferd,shais,singer}@cs.huji.ac.il

More information

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum

Optimizing CNNs. Timothy Dozat Stanford. Abstract. 1. Introduction. 2. Background Momentum Optimizing CNNs Timothy Dozat Stanford tdozat@stanford.edu Abstract This work aims to explore the performance of a popular class of related optimization algorithms in the context of convolutional neural

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Robust Optimization for Non-Convex Objectives

Robust Optimization for Non-Convex Objectives Robust Optimization for Non-Convex Objectives Robert Chen Computer Science Harvard University Brendan Lucier Microsoft Research New England Yaron Singer Computer Science Harvard University Vasilis Syrgkanis

More information

Efficient and Principled Online Classification Algorithms for Lifelon

Efficient and Principled Online Classification Algorithms for Lifelon Efficient and Principled Online Classification Algorithms for Lifelong Learning Toyota Technological Institute at Chicago Chicago, IL USA Talk @ Lifelong Learning for Mobile Robotics Applications Workshop,

More information

Deep Learning & Artificial Intelligence WS 2018/2019

Deep Learning & Artificial Intelligence WS 2018/2019 Deep Learning & Artificial Intelligence WS 2018/2019 Linear Regression Model Model Error Function: Squared Error Has no special meaning except it makes gradients look nicer Prediction Ground truth / target

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information