Online Convex Optimization with Unconstrained Domains and Losses

Size: px

Start display at page:

Download "Online Convex Optimization with Unconstrained Domains and Losses"

Dayna Baker
5 years ago
Views:

1 Online Convex Optimization with Unconstrained Domains and Losses Ashok Cutkosky Department of Computer Science Stanford University Kwabena Boahen Department of Bioengineering Stanford University Abstract We propose an online convex optimization algorithm RESCALEDEXP that achieves optimal regret in the unconstrained setting without prior knowledge of any bounds on the loss functions. We prove a lower bound showing an exponential separation between the regret of existing algorithms that require a known bound on the loss functions and any algorithm that does not require such knowledge. RESCALEDEXP matches this lower bound asymptotically in the number of iterations. RESCALEDEXP is naturally hyperparameter-free and we demonstrate empirically that it matches prior optimization algorithms that require hyperparameter optimization. Online Convex Optimization Online Convex Optimization OCO, ] provides an elegant framework for modeling noisy, antagonistic or changing environments. The problem can be stated formally with the help of the following definitions: Convex Set: A set W is convex if W is contained in some real vector space and tw + tw W for all w, w W and t 0, ]. Convex Function: f : W R is a convex function if ftw + tw tfw + tfw for all w, w W and t 0, ]. An OCO problem is a game of repeated rounds in which on round t a learner first chooses an element w t in some convex space W, then receives a convex loss function l t, and suffers loss l t w t. The regret of the learner with respect to some other u W is defined by R T u = l t w t l t u The objective is to design an algorithm that can achieve low regret with respect to any u, even in the face of adversarially chosen l t. Many practical problems can be formulated as OCO problems. For example, the stochastic optimization problems found widely throughout machine learning have exactly the same form, but with i.i.d. loss functions, a subset of the OCO problems. In this setting the goal is to identify a vector w with low generalization error Elw lu]. We can solve this by running an OCO algorithm for T rounds and setting w to be the average value of w t. By online-to-batch conversion results 3, 4], the generalization error is bounded by the expectation of the regret over the l t divided by T. Thus, OCO algorithms can be used to solve stochastic optimization problems while also performing well in non-i.i.d. settings. 30th Conference on Neural Information Processing Systems NIPS 06, Barcelona, Spain.

2 The regret of an OCO problem is upper-bounded by the regret on a corresponding Online Linear Optimization OLO problem, in which each l t is further constrained to be a linear function: l t w = g t w t for some g t. The reduction follows, with the help of one more definition: Subgradient: g W is a subgradient of f at w, denoted g fw, if and only if fw + g w w fw for all w. Note that fw if f is convex. To reduce OCO to OLO, suppose g t l t w t, and consider replacing l t w with the linear approximation g t w. Then using the definition of subgradient, R T u = l t w t l t u g t w t u = g t w t g t u so that replacing l t w with g t w can only make the problem more difficult. All of the analysis in this paper therefore addresses OLO, accessing convex losses functions only through subgradients. There are two major factors that influence the regret of OLO algorithms: the size of the space W and the size of the subgradients g t. When W is a bounded set the constrained case, then given B = max w W w, there exist OLO algorithms 5, 6] that can achieve R T u O BL max T without knowing L max = max t g t. When W is unbounded the unconstrained case, then given L max, there exist algorithms 7, 8, 9] that achieve R T u Õ u log u L max T or R t u Õ u log u L max T, where Õ hides factors that depend logarithmically on L max and T. These algorithms are known to be optimal up to constants for their respective regimes 0, 7]. All algorithms for the unconstrained setting to-date require knowledge of L max to achieve these optimal bounds. Thus a natural question is: can we achieve O u log u regret in the unconstrained, unknown-l max setting? This problem has been posed as a COLT 06 open problem ], and is solved in this paper. A simple approach is to maintain an estimate of L max and double it whenever we see a new g t that violates the assumed bound the so-called doubling trick, thereby turning a known-l max algorithm into an unknown-l max algorithm. This strategy fails for previous known-l max algorithms because their analysis makes strong use of the assumption that each and every g t is bounded by L max. The existence of even a small number of bound-violating g t can throw off the entire analysis. In this paper, we prove that it is actually impossible to achieve regret ] / ɛ O u log u L max T + g Lmax exp max t t Lt for any ɛ > 0 where L max and Lt = max t <t g t are unknown in advance Section. This immediately rules out the ideal bound of Õ u log u L max T which is possible in the known-lmax case. Secondly, we provide an algorithm, RESCALEDEXP, that matches our lower bound without prior knowledge of L max, leading to a naturally hyperparameter-free algorithm Section 3. To our knowledge, this is the first algorithm to address the unknown-l max issue while maintaining O u log u dependence on u. Finally, we present empirical results showing that RESCALEDEXP performs well in practice Section 4. Lower Bound with Unknown L max The following theorem rules out algorithms that achieve regret Ou logul max T without prior knowledge of L max. In fact, any such algorithm must pay an up-front penalty that is exponential in T. This lower bound resolves a COLT 06 open problem Parameter-Free and Scale-Free Online Algorithms ] in the negative. In full generality, a subgradient is an element of the dual space W. However, we will only consider cases where the subgradient is naturally identified with an element in the original space W e.g. W is finite dimensional so that the definition in terms of dot-products suffices. There are algorithms that do not require L max, but achieve only regret O u ]

3 Theorem. For any constants c, k, ɛ > 0, there exists a T and an adversarial strategy picking g t R in response to w t R such that regret is: R T u = g t w t g t u k + c u log u L max T loglmax + + kl max expt / ɛ k + c u log u L max T loglmax + + kl max exp max t for some u R where L max = max t T g t and Lt = max t <t g t. g t Lt / ɛ ] Proof. We prove the theorem by showing that for sufficiently large T, the adversary can checkmate the learner by presenting it only with the subgradient g t =. If the learner fails to have w t increase quickly, then there is a u against which the learner has high regret. On the other hand, if the learner ever does make w t higher than a particular threshold, the adversary immediately punishes the learner with a subgradient g t = T, again resulting in high regret. Let T be large enough such that both of the following hold: / T T 4 exp 4 logc > k log T + k expt / ɛ / T T exp 4 logc > kt expt / ɛ + kt T logt + The adversary plays the following strategy: for all t T, so long as w t < expt / /4 logc, give g t =. As soon as w t expt / /4 logc, give g t = T and g t = 0 for all subsequent t. Let s analyze the regret at time T in these two cases. Case : w t < expt / /4 logc for all t: In this case, let u = expt / g /4 logc. Then L max =, max t t Lt =, and using the learner s regret is at least R T u T u T = T u exp T / 4 logc = cu logu T log + T 4 exp T / 4 logc > cu logul max T loglmax + + kl max T loglmax + + kl max expt / ɛ = k + cu log ul max T loglmax + + kl max exp T / ɛ] Case : w t expt / /4 logc for some t: g In this case, L max = T and max t t Lt = T. For u = 0, using, the regret is at least R T u T exp T / 4 logc kt expt / ɛ + kt T logt + = kl max expt / ɛ + kl max T loglmax + = k + cu log ul max T loglmax + + kl max exp T / ɛ] The exponential lower-bound arises because the learner has to move exponentially fast in order to deal with exponentially far away u, but then experiences exponential regret if the adversary provides a gradient of unprecedented magnitude in the opposite direction. However, if we play against an adversary that is constrained to give loss vectors g t L max for some L max that does not grow with time, or if the losses do not grow too quickly, then we can still achieve R T u = O u log u L max T asymptotically without knowing Lmax. In the following sections we describe an algorithm that accomplishes this. 3

4 3 RESCALEDEXP Our algorithm, RESCALEDEXP, adapts to the unknown L max using a guess-and-double strategy that is robust to a small number of bound-violating g t s. We initialize a guess L for L max to g. Then we run a novel known-l max algorithm that can achieve good regret in the unconstrained u setting. As soon as we see a g t with g t > L, we update our guess to g t and restart the known-l max algorithm. To prove that this scheme is effective, we show Lemma 3 that our known-l max algorithm does not suffer too much regret when it sees a g t that violates its assumed bound. Our known-l max algorithm uses the Follow-the-Regularized-Leader FTRL framework. FTRL is an intuitive way to design OCO algorithms 3]: Given functions ψ t : W R, at time T we play w T = argmin ψ T w + ] T l tw. The functions ψ t are called regularizers. A large number of OCO algorithms e.g. gradient descent can be cleanly formulated as instances of this framework. Our known-l max algorithm is FTRL with regularizers ψ t w = ψw/η t, where ψw = w + log w + w and η t is a scale-factor that we adapt over time. Specifically, we set ηt = k M t + g :t, where we use the compressed sum notations g :T = T g t and g :T = T g t. M t is defined recursively by M 0 = 0 and M t = maxm t, g :t /p g :t, so that M t M t, and M t + g :t g :t /p. k and p are constants: k = and p = L max. RESCALEDEXP s strategy is to maintain an estimate L t of L max at all time steps. Whenever it observes g t L t, it updates L t+ = g t. We call periods during which L t is constant epochs. Every time it updates L t, it restarts our known-l max algorithm with p = L t, beginning a new epoch. Notice that since L t at least doubles every epoch, there will be at most log L max /L + total epochs. To address edge cases, we set w t = 0 until we suffer a non-constant loss function, and we set the initial value of L t to be the first non-zero g t. Pseudo-code is given in Algorithm, and Theorem states our regret bound. For simplicity, we re-index so that that g is the first non-zero gradient received. No regret is suffered when g t = 0 so this does not affect our analysis. Algorithm RESCALEDEXP Initialize: k, M 0 0, w 0, t // t is the start-time of the current epoch. for t = to T do Play w t, receive subgradient g t l t w t. if t = then L g p /L end if M t maxm t, g t :t /p g t :t. η t k M t+ g t :t //Set w t+ using FTRL update ] w t+ gt :t g expη t :t t g t :t ] // = argmin ψw w η t + g t :tw if g t > L t then //Begin a new epoch: update L and restart FTRL L t+ g t p /L t+ t t + M t 0 w t+ 0 else L t+ L t end if end for Theorem. Let W be a separable real inner-product space with corresponding norm and suppose with mild abuse of notation every loss function l t : W R has some subgradient g t W at w t such that g t w = g t w for some g t W. Let M max = max t M t. Then if L max = max t g t 4

5 and Lt = max t <t g t, rescaledexp achieves regret: Lmax R T u ψu + 96 log + M max + g :T L Lmax g t + 8L max log + min exp 8 max t Lt = O L max log Lmax L L u log u + T + exp, exp ] T/ 8 max t g t ] Lt The conditions on W in Theorem are fairly mild. In particular they are satisfied whenever W is finite-dimensional and in most kernel method settings 4]. In the kernel method setting, W is an RKHS of functions X R and our losses take the form l t w = l t w, k xt where k xt is the representing element in W of some x t X, so that g t = g t k xt where g t l t w, k xt. Although we nearly match our lower-bound exponential term of expt / ɛ, in order to have a g practical algorithm we need to do much better. Fortunately, the max t t Lt term may be significantly smaller when the losses are not fully adversarial. For example, if the loss vectors g t satisfy g t = t, then the exponential term in our bound reduces to a manageable constant even though g t is growing quickly without bound. To prove Theorem, we bound the regret of RESCALEDEXP during each epoch. Recall that during an epoch, RESCALEDEXP is running FTRL with ψ t w = ψw/η t. Therefore our first order of business is to analyze the regret of FTRL across one of these epochs, which we do in Lemma 3 proved in appendix: Lemma 3. Set k =. Suppose g t L for t < T, /L p /L, g T L max and L max L. Let W max = max t,t ] w t. Then the regret of FTRL with regularizers ψ t w = ψw/η t is: R T u ψu/η T + 96 M T + g :T + L max min W max, 4 exp 4 L max L, exp ] T/ ψu + 96 T L g t + L max + 8L max min exp L max u + log u + u + 96 T + 8L max min 4L max L, exp ] T/ max e 4L L ], e T/ Lemma 3 requires us to know the value of L in order to set p. However, the crucial point is that it encompasses the case in which L is misspecified on the last loss vector. This allows us to show that RESCALEDEXP does not suffer too much by updating p on-the-fly. Proof of Theorem. The theorem follows by applying Lemma 3 to each epoch in which L t is constant. Let = t, t, t 3,, t n be the various increasing values of t as defined in Algorithm, and we define t n+ = T +. Then define b R a:b u = g t w t u so that R T u n j= R t j:t j+ u. We will bound R tj:t j+ u for each j. t=a Fix a particular j < n. Then R tj:t j+ u is simply the regret of FTRL with k =, p = L tj, η t = k M t+ g t j :t and regularizers ψw/η t. By definition of L t, for t, t j+ ] we have g t L tj. Further, if L = max t,tj+ ] g t we have L L tj. Therefore, L tj L L tj so that L p L. Further, we have g t j+ /L tj max t g t /Lt. Thus by Lemma 3 we 5

6 have R tj:t j+ u ψu/η tj M tj+ + g t j:t j+ + L max min W max, 4 exp 4 g t j+ L t j ψu/η tj M max + g t + 8L j:t j+ max min ] tj+ t j, exp e 8 maxt g t ] Lt, e T/ ψu + 96 M max + g :T + 8L g t max min exp 8 max t Lt, exp ] T/ Summing across epochs, we have n R T u = R tj:t j+ u n j= ψu + 96 M max + g :T + 8L max min g t ]] exp 8 max t Lt, exp T/ Observe that n log L max /L + to prove the first line of the theorem. The big-oh expression tj+ T follows from the inequality: M tj+ L tj t=t j g t L max g t. Our specific choices for k and p are somewhat arbitrary. We suspect although we do not prove that the preceding theorems are true for larger values of k and any p inversely proportional to L t, albeit with differing constants. In Section 4 we perform experiments using the values for k, p and L t described in Algorithm. In keeping with the spirit of designing a hyperparameter-free algorithm, no attempt was made to empirically optimize these values at any time. 4 Experiments 4. Linear Classification To validate our theoretical results in practice, we evaluated RESCALEDEXP on 8 classification datasets. The data for each task was pulled from the libsvm website 5], and can be found individually in a variety of sources 6, 7, 8, 9, 0,, ]. We use linear classifiers with hinge-loss for each task and we compare RESCALEDEXP to five other optimization algorithms: ADAGRAD 5], SCALEINVARIANT 3], PISTOL 4], ADAM 5], and ADADELTA 6]. Each of these algorithms requires tuning of some hyperparameter for unconstrained problems with unknown L max usually a scale-factor on a learning rate. In contrast, our RESCALEDEXP requires no such tuning. We evaluate each algorithm with the average loss after one pass through the data, computing a prediction, an error, and an update to model parameters for each example in the dataset. Note that this is not the same as a cross-validated error, but is closer to the notion of regret addressed in our theorems. We plot this average loss versus hyperparameter setting for each dataset in Figures and. These data bear out the effectiveness of RESCALEDEXP: while it is not unilaterally the highest performer on all datasets, it shows remarkable robustness across datasets with zero manual tuning. 4. Convolutional Neural Networks We also evaluated RESCALEDEXP on two convolutional neural network models. These models have demonstrated remarkable success in computer vision tasks and are becoming increasingly more popular in a variety of areas, but can require significant hyperparameter tuning to train. We consider the MNIST 8] and CIFAR-0 7] image classification tasks. Our MNIST architecture consisted of two consecutive 5 5 convolution and max-pooling layers followed by a 5-neuron fully-connected layer. Our CIFAR-0 architecture was two consecutive 5 5 convolution and 3 3 max-pooling layers followed by a 384-neuron fully-connected layer and a 9-neuron fully-connected layer. 6

7 0 0 0 covtype PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 gisette_scale average loss hyperparameter setting average loss 0 - PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting 0.60 madelon 0 0 mnist average loss PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting average loss 0 - PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp hyperparameter setting Figure : Average loss vs hyperparameter setting for each algorithm across each dataset. RESCALED- EXP has no hyperparameters and so is represented by a flat yellow line. Many of the other algorithms display large sensitivity to hyperparameter setting. These models are highly non-convex, so that none of our theoretical analysis applies. Our use of RESCALEDEXP is motivated by the fact that in practice convex methods are used to train these models. We found that RESCALEDEXP can match the performance of other popular algorithms see Figure 3. In order to achieve this performance, we made a slight modification to RESCALEDEXP: when we update L t, instead of resetting w t to zero, we re-center the algorithm about the previous prediction point. We provide no theoretical justification for this modification, but only note that it makes intuitive sense in stochastic optimization problems, where one can reasonably expect that the previous prediction vector is closer to the optimal value than zero. 5 Conclusions We have presented RESCALEDEXP, an Online Convex Optimization algorithm that achieves regret Õ u log u L max T + exp8 maxt g t /Lt where L max = max t g t is unknown in advance. Since RESCALEDEXP does not use any prior-knowledge about the losses or comparison vector u, it is hyperparameter free and so does not require any tuning of learning rates. We also prove a lower-bound showing that any algorithm that addresses the unknown-l max scenario must suffer an exponential penalty in the regret. We compare RESCALEDEXP to prior optimization algorithms empirically and show that it matches their performance. While our lower-bound matches our regret bound for RESCALEDEXP in terms of T, clearly there is much work to be done. For example, when RESCALEDEXP is run on the adversarial loss sequence presented in Theorem, its regret matches the lower-bound, suggesting that the optimality gap could be improved with superior analysis. We also hope that our lower-bound inspires work in algorithms that adapt to non-adversarial properties of the losses to avoid the exponential penalty. 7

8 0 0 ijcnn PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 epsilon_normalized PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp average loss 0 - average loss hyperparameter setting hyperparameter setting 0 0 rcv_train.multiclass PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp 0 0 SenseIT Vehicle Combined PiSTOL Scale Invariant ADAM AdaDelta AdaGrad RescaledExp average loss average loss hyperparameter setting hyperparameter setting Figure : Average loss vs hyperparameter setting, continued from Figure. Figure 3: We compare RESCALEDEXP to ADAM, ADAGRAD, and stochastic gradient descent SGD, with learning-rate hyperparameter optimization for the latter three algorithms. All algorithms achieve a final validation accuracy of 99% on MNIST and 84%, 84%, 83% and 85% respectively on CIFAR-0 after iterations. References ] Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 0th International Conference on Machine Learning ICML-03, pages , 003. ] Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4:07 94, 0. 3] Nick Littlestone. From on-line to batch learning. In Proceedings of the second annual workshop on Computational learning theory, pages 69 84, 04. 8

9 4] Nicolo Cesa-Bianchi, Alex Conconi, and Claudio Gentile. On the generalization ability of on-line learning algorithms. Information Theory, IEEE Transactions on, 509: , ] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. In Conference on Learning Theory COLT, 00. 6] H. Brendan McMahan and Matthew Streeter. Adaptive bound optimization for online convex optimization. In Proceedings of the 3rd Annual Conference on Learning Theory COLT, 00. 7] Brendan Mcmahan and Matthew Streeter. No-regret algorithms for unconstrained online convex optimization. In Advances in neural information processing systems, pages 40 40, 0. 8] Francesco Orabona. Dimension-free exponentiated gradient. In Advances in Neural Information Processing Systems, pages , 03. 9] Brendan McMahan and Jacob Abernethy. Minimax optimal algorithms for unconstrained linear optimization. In Advances in Neural Information Processing Systems, pages 74 73, 03. 0] Jacob Abernethy, Peter L Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In Proceedings of the nineteenth annual conference on computational learning theory, 008. ] Francesco Orabona and Dávid Pál. Scale-free online learning. arxiv preprint arxiv: , 06. ] Francesco Orabona and Dávid Pál. Open problem: Parameter-free and scale-free online algorithms. In Conference on Learning Theory, 06. 3] S. Shalev-Shwartz. Online Learning: Theory, Algorithms, and Applications. PhD thesis, The Hebrew University of Jerusalem, ] Thomas Hofmann, Bernhard Schölkopf, and Alexander J Smola. Kernel methods in machine learning. The annals of statistics, pages 7 0, ] Chih-Chung Chang and Chih-Jen Lin. Libsvm: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology TIST, 3:7, 0. 6] Isabelle Guyon, Steve Gunn, Asa Ben-Hur, and Gideon Dror. Result analysis of the nips 003 feature selection challenge. In Advances in Neural Information Processing Systems, pages , ] Chih-chung Chang and Chih-Jen Lin. Ijcnn 00 challenge: Generalization ability and text decoding. In In Proceedings of IJCNN. IEEE. Citeseer, 00. 8] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86:78 34, ] David D Lewis, Yiming Yang, Tony G Rose, and Fan Li. Rcv: A new benchmark collection for text categorization research. The Journal of Machine Learning Research, 5:36 397, ] Marco F Duarte and Yu Hen Hu. Vehicle classification in distributed sensor networks. Journal of Parallel and Distributed Computing, 647:86 838, 004. ] M. Lichman. UCI machine learning repository, 03. ] Shimon Kogan, Dimitry Levin, Bryan R Routledge, Jacob S Sagi, and Noah A Smith. Predicting risk from financial reports with regression. In Proceedings of Human Language Technologies: The 009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages Association for Computational Linguistics, ] Francesco Orabona, Koby Crammer, and Nicolo Cesa-Bianchi. A generalized online mirror descent with applications to classification and regression. Machine Learning, 993:4 435, 04. 4] Francesco Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning. In Advances in Neural Information Processing Systems, pages 6 4, 04. 5] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arxiv preprint arxiv:4.6980, 04. 6] Matthew D Zeiler. Adadelta: An adaptive learning rate method. arxiv preprint arxiv:.570, 0. 7] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, ] H. Brendan McMahan. A survey of algorithms and analysis for adaptive online learning. arxiv preprint arxiv: , 04. 9

10 A Follow-the-Regularized-Leader FTRL Regret Recall that the FTRL algorithm uses the strategy w t+ = argmin ψ tw + t ψ t are called regularizers. t = l t w, where the functions Theorem 4. FTRL with regularizers ψ t and ψ 0w = 0 obtains regret: R tu ψ T u + ψ t w t+ ψ tw t+ + l tw t l tw t+ 3 Further, if the losses are linear l tw = g t w and ψ tw = η t ψw for some values η t and fixed function ψ, then the regret is R tu ψu + ψw t+ + g t w t w t+ 4 η T η t η t Proof. The first part follows from some algebraic manipulations: l tu + ψ T u ψ T w T + + l tw T + l tu ψ T u ψ T w T + l tw T + R T u = l tw t l tu ψ T u ψ T w T + + l tw t l tw T + = ψ T u ψ T w T + + l T w T l T w T + + R T w T + ψ T u ψ T w T + + l T w T l T w T ++ ψ tw t+ ψ tw t+ + l tw t l tw t+ T + = ψ T u + l w l w ψ w + ψ t w t+ ψ tw t+ + l tw t l tw t+ t= = ψ T u + ψ t w t+ ψ tw t+ + l tw t l tw t+ where we re assuming ψ 0w = 0 in the last step. Now let s specialize to the case of linear losses l tw = g t w and regularizers of the form ψ tw = η t ψw for some fixed regularizer ψ and varying scalings η t. Plugging this into the previous bound gives: R tu ψu + ψw t+ + g t w t w t+ η T η t η t While this formulation of the regret of FTRL is sufficient for our needs, our analysis is not tight. We refer the reader to 8] for a stronger FTRL bound that can improve constants in some analyses. B Proof of Lemma 3 We start off by computing the FTRL updates with regularizers ψw/η t: ψw = log w + w w 0

11 so that w T + = argmin η T ψw + g t w = g:t expηt g:t g :t Our goal will be to show that the terms η t η t ψw t+ + g t w t w t+ in the sum in 4 are negative. In particular, note that sequence of η t is non-increasing so that η t η t ψw t+ 0 for all t. Thus our strategy will be to bound g t w t w t+. B. Reduction to one dimension In order to bound η t η t ψw t+ + g t w t w t+, we first show that it suffices to consider the case when g t and g :t are co-linear. Theorem 5. Let W be a separable inner-product space and suppose with mild abuse of notation every loss function l t : W R has some subgradient g t W such that g tw = g t, w for some g t W. Suppose we run an FTRL algorithm with regularizers η t ψ w on loss functions l t such that w t+ = g :t fηt g:t g :t c for some function f for all t where η t = for some constant c. Then for any g t with g t = L, both η t η t g :t. Mt + g :t ψ w t+ + g tw t w t+ and g tw t w t+ are maximized when g t is a scalar multiple of Proof. The proof is an application of Lagrange multipliers. Our Lagrangian for η g tw t w t+ is L = η t η t ψ w t+ + g tw t w t+ + λ g t / = η t η t ψfη t g :t + g t w t g:t g :t fηt g:t t η + λ gt t ψ w t+ + Fix a countable orthonormal basis of W. For a vector v W we let v i be the projection of v along the ith basis vector of our countable orthonormal basis. We denote the action of L on the ith basis vector by L i. Then we have L i = λg t,i + w t,i w t+,i + j j gt,i g :t fηt g:t g t,jg :t j g g:tifηt g:t :t 3 g :t jg t,j g :t f η t g :t g:tiηt g :t M t + η t η t ψ fη t g :t f η t g :t g:tiηt g :t M t g t,i + g t,i ψfη t g :t c M t + g :t = λg t,i + w t,i w t+,i + Ag t,i + Bg :t i + C Mt g t,i g :t c g t,i + g t,i M t + g :t 3/ M g :t c t g t,i + g t,i M t + g :t 3/ where A, B and C do not depend on i. Since w t,i and w t+,i are scalar multiples of g :t and g :t respectively, we can reassign the variables A and B to write L i = Ag t,i + Bg :t i + C Mt g t,i

12 Now we compute M t = maxmt, g:t /p g :t g t,i g t,i { 0 : Mt = M t = : Mt Mt g :t i p g :t gt,i Thus after again reassigning the variables A and B we have L i = Ag t,i + Bg :t i Therefore we can only have L = 0 if g t is a scalar multiple of g :t as desired. For g tw t w t+, we apply exactly the same argument. The Lagrangian is and differentiating we have L = g tw t w t+ + λ g t / = g t w t g:t g fηt g:t + λ gt :t L i = λg t,i + w t,i w t+,i gt,i g :t fηt g:t + j j g t,jg :t j g g:tifηt g:t :t 3 g :t jg t,j g :t f η t g :t g:tiηt g :t M g :t c t g t,i + g t,i M t + g :t 3/ so that again we are done. = λg t,i + w t,i w t+,i + Ag t,i + Bg :t i + C Mt g t,i = Ag t,i + Bg :t i We make the following intuitive definition: Definition 6. For any vector v W, define signv = In the next section, we prove bounds on the quantity η t η t ψ w t+ + g tw t w t+. By Theorem 5 this quantity is maximized when signg t = ±signg :t and so we consider only this case. B. One dimensional FTRL In this section we analyze the regret of our FTRL algorithm with the end-goal of proving Lemma 3. We make heavy use of Theorem 5 to allow us to consider only the case signg t = ±signg :t. In this setting we may identify the -dimensional space spanned by g t and g :t with R. Thus whenever we are operating under the assumption signg t = signg :t we will use in place of and occasionally assume g :t > 0 as this holds WLOG. We feel that this notation and assumption aids intuition in visualizing the following results. Lemma 7. Suppose signg t = signg :t. Then η t g :t η t g :t η t g t 5 Suppose instead that signg t = signg :t and also g t L. Then we still have: η t g :t η t g :t + pl η t g t 6 v v. Proof. First, suppose signg t = signg :t. Then signg :t = signg :t. WLOG, assume g :t > 0. Notice that η tg :t is an increasing function of g t for g t > 0 because η tg :t is proportional to either g :t or g :t depending on whether M t = M t or not. Then since η t < η t we have η t g :t η tg :t = η tg :t η t g :t η tg :t η tg :t = η t g t

13 so that 5 holds. Now suppose signg t = signg :t and g t L. We consider two cases. Case : η t g :t η t g :t : Since η t η t, we have η t g :t η t g :t η t g :t η t g :t g :t g :t g t g :t where the last line follows since signg :t = signg t. Therefore: so that we are done. Case : η t g :t η t g :t : η t g :t η t g :t η t g :t η t g t When g t < g :t and η t g :t η t g :t, η t g :t η t g :t is a decreasing function of g t because η t g t: is an increasing function of g t for g t < g :t. Therefore it suffices to consider the case g t g :t, so that signg :t = signg :t and g :t g :t : Since g :t g :t, we have M t = M t so that we can write: η t g :t η tg :t = g tη t + g :t η t η t = g t η t + g :t k M t + g :t = g t η t + g:t k g t η t + k M t + g :t + g t M t + g :t M t + g :t + g t g M t + g :t :t k + g t M t + g :t + g t M t + g :t g t η t + g :t η t + g :t g t g t η t + η t M t + g :t g t η t + pl we have used the identity X + g t X + g t L and M t + g :t g :t /p. Lemma 8. If then g t M t + g :t g t between lines 4 and 5, and the last line follows because X pb w T exp k g :T B Proof. First note that by definition of M T and η T, η T g :T from some algebra: pb exp k w T + = expη T g :T p g:t exp k p g:t k. The proof now follows Taking squares of logs and rearranging now gives the desired inequality. 3

14 We have the following immediate corollary: Corollary 9. Suppose signg t = ±signg :t, g t L, and pl w t exp k Then signg :t = signg :t. Now we begin analysis of the sum term in 4. Lemma 0. Suppose signg :t = signg :t and g t L. Then w t w t+ g t η t w t+ + + pl exp g tη t + pl ] Proof. Since signg :t = signg :t, we have: w t w t+ = signg :t exp η t g :t ] signg :t exp η t g :t ] = exp η t g :t exp η t g :t = w t+ + exp η t g :t η t g :t where the last line uses the definition of w t+ to observe that w t+ + = expη t g :t. Now we consider two cases: either η t g :t < η t g :t or not. Case : η t g :t < η t g :t : By convexity of exp, we have so that the lemma holds. Case : η t g :t η t g :t : Again by convexity of exp we have w t w t+ w t+ + exp η t g :t η t g :t w t+ + η t g :t η t g :t w t+ + + pl η t g t w t w t+ w t+ + exp η t g :t η t g :t w t+ + η t g :t η t g :t exp η t g :t η t g :t w t+ + + pl exp η t g t + pl ] η t g t so that the lemma still holds. The next lemma is the main workhorse of our regret bounds: Lemma. Suppose g t L and either of the following holds:. p L, k =, and w t 5.. k =, pl, and w t 4 expp L. Then ψw t+ + g tw t w t+ 0 7 η t η t Further, inequality 7 holds for any k and sufficiently large L if w t exppl. Proof. By Theorem 5 it suffices to consider the case signg t = ±signg :t, so that we may adopt our identification with R and use of throughout this proof. For p, k = we have 5 > exp pl L k and for sufficiently large L, exppl > exp pl k. Therefore in all cases w t exp pl k so that by Corollary 9 and Lemma 0 we have g t w t w t+ η tg t w t+ + + pl exp η tg t + pl ] 8 4

15 First, we prove that 7 is guaranteed if the following holds: + pl w t+ + exp exp k The previous line 9 is equivalent to: k log w t+ + η tg t + pl ] + + pl exp η tg t + pl 9 0 Notice that ψw t+ = w t+ + log w t+ + + w t+ + log w t+ +. Then multiplying 0 by η t g t we have w t+ + + pl exp η t g t + pl ] η t g t k η t g t ψw t+ Combining 8 and, we see that 9 implies g t w t w t+ k η tg t ψw t+ Now we bound η t η t ψw t+: = k M t + g :t η t η M t + g :t + g t t k M t + g :t + g g t + M t M t t M t + g M t + g :t + :t + g t g t g t k M t + g :t = k η tg t Thus when 9 holds we have ψw t+ + g tw t w t+ k η tgt ψw t+ + k ηt gt ψw t+ 0 η t η t Therefore our objective is to show that our conditions on w t imply the condition 9 on w t+. First, we bound η tg t in terms of w t. Notice that Using this we have: g :t w t + = exp k M t + g :t p g:t exp k k log w t + p g :t g t η tg t = k M t + g :t g t k M t + g :t + g t g t p k g :t + pgt L p k k p log w t + + pl 5

16 so that we can conclude: η tg t Lp k k log w t + + p L Further, by Lemma 7 we have w t + = expηt g:t ηt g:t w t+ + exp η tg t + pl ] Therefore we have w t+ + w t + exp η tg t + pl ] 3 From 3, we see that 9 is guaranteed if we have w t + exp η tg t + pl ] exp + pl k exp η tg t + pl ] + 4 If we use our expression in 4, and assume w t expl, we see that there exists some constant C depending on p and k such that the RHS of 4 is OexpL and so 4 holds for sufficiently large L. For p = /L, k =, and w t 5 we can verify 4 numerically by plugging in the bound. For the case k =, w t 4 expp L, we notice that by using, we can write 4 entirely in terms of pl. Graphing both sides numerically as functions of pl then allows us to verify the condition. We have one final lemma we need before we can start stating some real regret bounds. This lemma can be viewed as observing that ψw is roughly strongly-convex for w not much bigger than D. D Lemma. Suppose p /L, k =, w t D and g t L Then g tw t w t+ 6maxD +, exp/g t η t. Proof. By Theorem 5 it suffices to consider signg t = ±signg :t. We show that w t w t+ 6maxD +, exp/ g t η t so that the result follows by multiplying by g t. From Lemma 7, we have η t g :t η t g :t η t g t + pl ηt g t. Further, note that η t g t. We consider two cases, either signg:t = signg:t or not. k = Case : signg :t = signg :t : w t w t+ = expη t g :t expη t g :t = w t + expη t g :t η t g :t D + η t g t expη t g t D + η t g t exp k 6D + η t g t Case : signg :t signg :t : In this case, we must have g :t g t. Let X = maxη t g :t, η t g :t. Then by triangle inequality we have w t w t+ max w t, w t+ expx X expx max w t, w t+ + X Since η t g :t η t g :t η tg t, we have X η tg t + η t g :t 3η t g t so that we have w t w t+ 6max w t, w t+ + η t g t 6

17 Finally, we have w t+ + = expη t g :t expη t g t exp/, so that w t w t+ 6η t g t max w t, w t+ + 6 maxd +, exp/η t g t Now we are finally in a position to prove Lemma 3, which we re-state below: Lemma 3. Set k =. Suppose g t L for t < T, /L p /L, g T L max and L max L. Let W max = max t,t ] w t. Then the regret of FTRL with regularizers ψ tw = ψw/η t is: R T u ψu/η T + 96 M T + g :T + Lmax min W max, 4 exp 4 L max, exp ] T/ L ψu + 96 T L g t + L max + 8L max min exp 4L max L max u + log u + u + 96 T + 8L max min L, exp ] T/ max e 4L L ], e T / Proof of Lemma 3. We combine Lemma with Lemma : if w t 5 we have for all t < T : ψw t+ + g t w t w t+ < 0 η t η t and if w t 5 we have ψw t+ + g t w t w t+ g t w t w t+ η t η t η tg t = 96η tgt Therefore for all t < T we have η t η t ψw t+ + g t w t w t+ 96η tgt. R T u ψu/η T + ψu/η T + 96 ψw t+ + g t w t w t+ η t η t η tgt + ψw T + + g T w T w T + η T η T We have ψw T + < 0 η T η T so that ψw T + + g T w T w T + L maxw max η T η T Further, again using Lemma we have ψw T + + g T w T w T + < 0 η T η T for w T 4 expp L max since k =. Finally, notice that by definition of η t and L, we must have η tg :t T/ exp η t g :t exp. Thus we have p g:t k T/, so that w t ψw T + + g T w T w T + L max minw max, 4 exp4l max/l, exp T η T η T 7

18 Now we make the following classic argument: M t + g :t M t + g :t g t + M t M t M t + g :t g t M t + g :t = η tg t so that we can bound: R T u ψu/η T + 96 η tgt + ψw T + + g T w T w T + η T η T ψu/η T + 96 M T + g :T + Lmax minwmax, 4 exp4l max/l, exp T To show the remaining two lines of the theorem, we prove by induction that M t + g :t L t t = g t for all t < T. The statement is clearly true for t =. Suppose it holds for some t. Then notice that g :t+ g t+ + g :t. So we have M t+ + g :t+ = max M t + g :t+, g:t+ p max M t + g :t + L g t+, L g :t+ t+ L g t t = Finally, we observe that M T = max M T + g :T + gt, g :T p two lines of the theorem follow immediately. L max + L T g t and the last C Additional Experimental Details C. Hyperparameter Optimization For the linear classification tasks, we optimized hyperparameters in a two-step process. First, we tested every power of 0 from 0 5 to 0. Second, if λ was the best hyperparameter setting in step, we additionally tested βλ for β {0., 0.4, 0.8,.0, 4.0, 6.0, 8.0} For the neural network models, we optimized ADAM and ADAGRAD s learning rates by testing every power of 0 from 0 5 to 0 0. For stochastic gradient descent, we used an exponentially decaying learning rate schedule specified in Tensorflow s MNIST and CIFAR-0 example code. C. Coordinate-wise updates We proved all our results in arbitrarily many dimensions, leading to a dimension-independent regret bound. However, it is also possible to achieve dimension-dependent bounds by running an independent version of our algorithm on each coordinate. Formally, for OLO we have R T u = g tw t u = d i= g t,iw t,i u i = d RT u i where R T is the regret of a -dimensional instance of the algorithm. This reduction can yield substantially better regret bounds when the gradients g t are known to be sparse but can be much worse when they are not. We use this coordinate-wise update strategy for our linear classification experiments for RESCALEDEXP. We also considered coordinate-wise updates and non-coordinate wise updates for the other algorithms, taking the best-performing of the two. For all algorithms in the linear classification experiments, we found that the difference between coordinate-wise and non-coordinate wise updates was not very striking. However, for the neural network experiments we found RESCALEDEXP performed extremely poorly when using coordinate-wise updates, and performed extremely well with non-coordinate wise updates. We hypothesize that this is due to a combination of non-convexity of the model and frequent resets at different times for each coordinate. i= 8

19 ADAGRAD RESCALEDEXP ADADELTA SCALEINVARIANT ADAM PISTOL Table : Average normalized loss, using best hyperparameter setting for each algorithm. C.3 Re-centering RESCALEDEXP For the non-convex neural network tasks we used a variant of RESCALEDEXP in which we re-center our FTRL algorithm at the beginning of each epoch. Formally, the pseudo-code is provided below: Algorithm Re-centered RESCALEDEXP Initialize: k, M 0 0, w 0, t, w 0 for t = to T do Play w t, receive subgradient g t l t w t. if t = then L g p /L end if M t maxm t, g t :t /p g t :t. η t k M t+ g t :t w t+ w + argmin ψw w η t if g t > L t then L t+ g t p /L t+ t t + M t 0 w t+ 0 w w t else L t+ L t end if end for ] + g t :tw = w gt :t g expη t :t t g t :t ] So long as w u u, this algorithm maintains the same regret bound as the non-re-centered version of RESCALEDEXP. While it is intuitively reasonable to expect this to occur in a stochastic setting, an adversary can easily subvert this algorithm. C.4 Aggregating Studies It is difficult to interpret the results of a study such as our linear classification experiments see Section 4 in which no particular algorithm is always the winner for every dataset. In particular, consider the case of an analyst who wishes to run one of these algorithms on some new dataset, and doesn t have the either the resources or inclination to implement and tune each algorithm. Which should she choose? We suggest the following heuristic: pick the algorithm with the lowest loss averaged across datasets. This heuristic is problematic because datasets in which all algorithms do very poorly will dominate the crossdataset average. In order address this issue and compare losses across datasets properly, we compute a normalized loss for each algorithm and dataset. The normalized loss for an algorithm on a dataset is given by taking the loss experienced by the algorithm on its best hyperparameter setting on that dataset divided by the lowest loss observed by any algorithm and hyperparameter setting on that dataset. Thus a normalized loss of on a dataset indicates that an algorithm outperformed all other algorithms on the dataset at least for its best hyperparameter setting. We then average the normalized loss for each algorithm across datasets to obtain the scores for each algorithm see Table. These data indicate that while ADAGRAD has a slight edge after tuning, RESCALEDEXP and ADADELTA do nearly equivalently well 4% and 6% worse performance, respectively. Therefore we suggest that if our intrepid analyst is willing to perform some hyperparameter tuning, then ADAGRAD may be slightly better, but her choice doesn t matter too much. On the other hand, using RESCALEDEXP will allow her to skip any tuning step without compromising performance. 9

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford