Notes on online convex optimization

Similar documents
Lecture 4: November 13

Online Convex Optimization Example And Follow-The-Leader

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

1 Review of Zero-Sum Games

Online Learning with Partial Feedback. 1 Online Mirror Descent with Estimated Gradient

Ensamble methods: Boosting

Ensamble methods: Bagging and Boosting

Lecture 9: September 25

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

Supplement for Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

PENALIZED LEAST SQUARES AND PENALIZED LIKELIHOOD

Lecture 20: Riccati Equations and Least Squares Feedback Control

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

Finish reading Chapter 2 of Spivak, rereading earlier sections as necessary. handout and fill in some missing details!

Hamilton- J acobi Equation: Explicit Formulas In this lecture we try to apply the method of characteristics to the Hamilton-Jacobi equation: u t

MATH 5720: Gradient Methods Hung Phan, UMass Lowell October 4, 2018

Physics 235 Chapter 2. Chapter 2 Newtonian Mechanics Single Particle

Inventory Analysis and Management. Multi-Period Stochastic Models: Optimality of (s, S) Policy for K-Convex Objective Functions

Hamilton- J acobi Equation: Weak S olution We continue the study of the Hamilton-Jacobi equation:

Hamilton Jacobi equations

Matrix Versions of Some Refinements of the Arithmetic-Geometric Mean Inequality

Lie Derivatives operator vector field flow push back Lie derivative of

1 Widrow-Hoff Algorithm

Homework sheet Exercises done during the lecture of March 12, 2014

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

A Forward-Backward Splitting Method with Component-wise Lazy Evaluation for Online Structured Convex Optimization

Lecture 33: November 29

Boosting with Online Binary Learners for the Multiclass Bandit Problem

Differential Geometry: Numerical Integration and Surface Flow

Final Spring 2007

Learning a Class from Examples. Training set X. Class C 1. Class C of a family car. Output: Input representation: x 1 : price, x 2 : engine power

1 Solutions to selected problems

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

CSE/NB 528 Lecture 14: From Supervised to Reinforcement Learning (Chapter 9) R. Rao, 528: Lecture 14

Chapter 21. Reinforcement Learning. The Reinforcement Learning Agent

Sections 2.2 & 2.3 Limit of a Function and Limit Laws

We just finished the Erdős-Stone Theorem, and ex(n, F ) (1 1/(χ(F ) 1)) ( n

Notes for Lecture 17-18

Math Week 15: Section 7.4, mass-spring systems. These are notes for Monday. There will also be course review notes for Tuesday, posted later.

Introduction to Probability and Statistics Slides 4 Chapter 4

Transform Techniques. Moment Generating Function

CSE/NB 528 Lecture 14: Reinforcement Learning (Chapter 9)

Longest Common Prefixes

Non-parametric techniques. Instance Based Learning. NN Decision Boundaries. Nearest Neighbor Algorithm. Distance metric important

Appendix to Online l 1 -Dictionary Learning with Application to Novel Document Detection

arxiv: v1 [math.oc] 11 Sep 2017

Echocardiography Project and Finite Fourier Series

arxiv: v2 [cs.lg] 28 Dec 2018

State-Space Models. Initialization, Estimation and Smoothing of the Kalman Filter

Games Against Nature

Beating the Adaptive Bandit with High Probability

Solutions from Chapter 9.1 and 9.2

dt = C exp (3 ln t 4 ). t 4 W = C exp ( ln(4 t) 3) = C(4 t) 3.

Econ107 Applied Econometrics Topic 7: Multicollinearity (Studenmund, Chapter 8)

2.7. Some common engineering functions. Introduction. Prerequisites. Learning Outcomes

Two Coupled Oscillators / Normal Modes

Západočeská Univerzita v Plzni, Czech Republic and Groupe ESIEE Paris, France

Chapter 2. First Order Scalar Equations

ELE 538B: Large-Scale Optimization for Data Science. Quasi-Newton methods. Yuxin Chen Princeton University, Spring 2018

Asymptotic Equipartition Property - Seminar 3, part 1

Lecture 2-1 Kinematics in One Dimension Displacement, Velocity and Acceleration Everything in the world is moving. Nothing stays still.

E β t log (C t ) + M t M t 1. = Y t + B t 1 P t. B t 0 (3) v t = P tc t M t Question 1. Find the FOC s for an optimum in the agent s problem.

Stationary Distribution. Design and Analysis of Algorithms Andrei Bulatov

Some Ramsey results for the n-cube

EXERCISES FOR SECTION 1.5

L07. KALMAN FILTERING FOR NON-LINEAR SYSTEMS. NA568 Mobile Robotics: Methods & Algorithms

Let us start with a two dimensional case. We consider a vector ( x,

t dt t SCLP Bellman (1953) CLP (Dantzig, Tyndall, Grinold, Perold, Anstreicher 60's-80's) Anderson (1978) SCLP

Primal-Dual Splitting: Recent Improvements and Variants

Approximation Algorithms for Unique Games via Orthogonal Separators

Minimizing Regret on Reflexive Banach Spaces and Nash Equilibria in Continuous Zero-Sum Games

On Boundedness of Q-Learning Iterates for Stochastic Shortest Path Problems

ELE 538B: Large-Scale Optimization for Data Science. Introduction. Yuxin Chen Princeton University, Spring 2018

Homogenization of random Hamilton Jacobi Bellman Equations

Congruent Numbers and Elliptic Curves

Math 527 Lecture 6: Hamilton-Jacobi Equation: Explicit Formulas

Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization

Optimality Conditions for Unconstrained Problems

GMM - Generalized Method of Moments

Lecture 16: FTRL and Online Mirror Descent

Slide03 Historical Overview Haykin Chapter 3 (Chap 1, 3, 3rd Ed): Single-Layer Perceptrons Multiple Faces of a Single Neuron Part I: Adaptive Filter

Chapter 3 Boundary Value Problem

SOLUTIONS TO ECE 3084

EECE 301 Signals & Systems Prof. Mark Fowler

Section 3.5 Nonhomogeneous Equations; Method of Undetermined Coefficients

3.1 More on model selection

Demodulation of Digitally Modulated Signals

Multiarmed Bandits With Limited Expert Advice

Tasty Coffee example

HW6: MRI Imaging Pulse Sequences (7 Problems for 100 pts)

Acta Mathematica Academiae Paedagogicae Nyíregyháziensis 32 (2016), ISSN

4.5 Constant Acceleration

6. Stochastic calculus with jump processes

Unit Root Time Series. Univariate random walk

Math 10B: Mock Mid II. April 13, 2016

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

Random variables. A random variable X is a function that assigns a real number, X(ζ), to each outcome ζ in the sample space of a random experiment.

Simulation-Solving Dynamic Models ABE 5646 Week 2, Spring 2010

Deep Learning: Theory, Techniques & Applications - Recurrent Neural Networks -

Transcription:

Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec w S. Receive a convex loss f : S R chosen adversarially. Suffer loss f (w ). Each hypohesis is a vecor in some convex se S. The loss funcion f : S R is convex and defined for each ime sep individually. Our goal is o have small regre wih respec o a hypohesis space U, namely Regre T (U) := max u U Regre T (u) where Regre T (u) := f (w ) f (u) Unregularized aggregae loss minimizaion A ime, we have observed losses f... f, so a naural choice of w is one ha minimizes he sum of all pas losses. This is known as Follow-he-Leader (FTL): f i (w) () Lemma.. If we use Eq. () in OCO, we have Regre T (S) f (w ) f (w + ) 2 Regularized aggregae loss minimizaion Lemma. suggess a need for conaining f (w ) f (w + ). If we assume f is L -Lipschiz wih respec o S and some norm, we have f (w ) f (w + ) L w w + This is a bird s eye view of he incredible uorial by Shai Shalev-Shwarz (20). For full deails, see he original uorial.

which in urn suggess a need for conaining w w +. If he objecive in Eq. () F (w) := f i (w) happens o be σ-srongly-convex, w w + canno be arbirarily large: by he definiion of w and w + and srong convexiy, Adding hese wo inequaliies, we ge: F (w + ) F (w ) σ 2 w w + 2 F + (w ) F + (w + ) σ 2 w w + 2 w w + f (w ) f (w + ) L σ σ We can always endow σ-srong-convexiy on F by adding a σ-srongly-convex regularizer R : S R. This is known as Follow-he-Regularized-Leader (FoReL): R(w) + f i (w) (2) By reaing R as he (convex) loss a ime = 0, we ge he following corollary from Lemma.. Corollary 2.. If we use Eq. (2) in OCO, for all u S we have Regre T (u) R(u) min R(v) + T f (w ) f (w + ) v S Theorem 2.2. Le f : S R be convex loss funcions ha are L -Lipschiz over convex S wih respec o. Le L R be a consan such ha L 2 (/T ) T L2, and le R : S R be a σ-srongly-convex regularizer. Then he regre of FoReL wih respec o u S is bounded above as: T L2 Regre T (u) R(u) min R(v) + v S σ 3 Linearizaion of convex losses Theorem 2.2 assumes an oracle ha solves Eq. (2), so i s no very useful for deriving concree algorihms. Bu a echnique known as linearizaion of convex losses grealy simplifies his ask. Since S is a convex se and f is convex, a each round of OCO we can selec z f (w ) so ha f (w ) f (w + ) z, w z, w + (3) Thus given a general convex loss f, we can preend ha i s a linear loss g (u) := z, u where z is a sub-gradien of f a w. In ligh of Corollary 2. and Eq. (3), running FoReL on hese linearized losses: enjoys he same regre bound in Theorem 2.2. R(w) + w, z i (4) 2

3. Online mirror descen Eq. (4) can be addiionally analyzed in a dual framework known as online mirror descen (OMD). OMD frames Eq. (4) as wo separae seps: saring wih θ := 0, w = g(θ ) θ = θ z where g(θ) := arg max w, θ R(w) is known as he link funcion. The paricular form of he link funcion comes from he convex conjugae of R (R is assumed o be closed and convex): R (θ) := max w, θ R(w) A propery of R is ha if z R (θ), hen R (θ) = z, θ R(z). Thus g(θ ) = z R (θ ). This framework can be used o show ha OMD achieves Regre T (u) R(u) + min R(v) + T v S D R ( ) z i z i where D R (u v) is he Bregman divergence beween u and v under R. If R is (/η)- srongly-convex wih respec o, hen R is η-srongly-smooh wih respec o he dual norm : in his case, 3.2 Example algorihms Regre T (u) R(u) + min v S R(v) + η 2 (5) z 2 (6) We can now crank ou algorihms under he OMD framework. All hese algorihms enjoy he bound in Theorem 2.2 (or Eq. (6)). Online gradien descen (OGD): Assumes an unconsrained domain S = R d and an l 2 regularizer R(w) = 2η w 2 2. We have g(θ) = ηθ and w = w ηz (7) Online gradien descen wih lazy projecions (OGDLP): Assumes a general convex se S and an l 2 regularizer R(w) = 2η w 2 2. Noe ha Thus he link funcion g(θ) projecs ηθ ono S. 2η w 2 2 w, θ = arg min w ηθ 2 2 (8) Unnormalized exponeniaed gradien descen (UEG): Assumes an unconsrained domain S = R d and a shifed enropy regularizer R(w) = η i w i(log w i log λ) where λ > 0. We have g i (θ) = λ exp(ηθ i ), hus w = (λ... λ) and for i > : [w ] i = [w ] i exp( η[z ] i ) (9) 3

Normalized exponeniaed gradien descen (NEG): Assumes a probabiliy simplex S = {w R d : w 0, i w i = } and an enropy regularizer R(w) = i w i log w i. We have g i (θ) = exp(ηθi) j exp(ηθj), hus w = (/d... /d) and for i > : η [w ] i = [w ] i exp( η[z ] i ) j [w ] j exp( η[z ] j ) (0) 4 Applicaions o classificaion problems The cenral sep in applying OCO o a classificaion problem is finding he righ convex surrogae of he problem. 4. Percepron A each round, we re given a poin x R d. We predic p {, +} and receive he rue class y {, +}. The (non-convex) loss is given by { if p y l(p, y ) := 0 if p = y Noe ha he cumulaive loss M := l(p, y ) is he number of misakes. Convex surrogae: We mainain a vecor w R d ha defines p := sign w, x. We use a hinge loss f (w ) := max(0, y w, x ) which by he paricular consrucion is convex and upperbounds he original loss l(p, y ). Using a sub-gradien z f (w ) where z = y x if y w, x and z = 0 oherwise, we can now run OGD using some η > 0: w := 0 and { w + ηy w + := x if y w, x w if y w, x > Le L := max z. I s possible o apply Eq. (6) and show ha for any u R d M f (u) + u 2 L f (u) + L 2 u 2 2 In paricular, if here exiss u R d such ha f (u) = 0, we have M L 2 u 2 2. 4.2 Weighed majoriy A each round, we re given a poin x X and d hypoheses H = {h,..., h d } where h i : X {0, }. We make a choice p [d] and receive he rue class y {0, }. The (non-convex) loss is given by { if hp (x l(p, y ) := ) y 0 if h p (x ) = y 4

Convex surrogae: We mainain a vecor w {w R d : w 0, i w i = }. This vecor defines weighed majoriy : p = if d [w ] i h i (x ) /2 and p = 0 oherwise. We use he convex loss funcion: f (w ) := d [w ] i h i (x ) y = w, z where [z ] i := h i (x ) y (hus z is also he gradien of f ). Hence we have an online linear problem suiable for NEG. I s possible o show ha if here exiss some h H such ha T h(x ) y = 0, hen NEG achieves f (w ) 4 log d. 4.2. Muli-armed bandi A problem closely relaed o weighed majoriy is he so-called muli-armed bandi problem. A each round, here d slo machines ( one-armed bandis ) o choose from. We make a choice p [d] and receive he cos of playing ha machine: [y ] p [0, ]. A crucial aspec of he problem is he exisence of unobserved coss [y ] i [0, ] for i p, because if we observe all y [0, ] d we can jus formulae i as an online linear problem by minimizing he expeced loss f (w ) := w, y where w {w R d : w 0, i w i = } again defines weighed majoriy over d machines. Since y is he gradien of f, anoher way of saing he difficuly is ha gradiens are no fully observed. A soluion is o use a p -dependen esimaor z (p) [z (p) ] i { [y ] i /[w ] i if i = p 0 if i p of he gradien y as follows: This is indeed an unbiased esimaor of y over he randomness of p since E[z (p) ] i := d p p(p )[z (p) [y ] i ] i = w i + 0 = [y ] i [w ] i p i Thus we can run NEG by subsiuing he unobserved gradien y wih z (p). Noe ha he algorihm will be slighly differen from he weighed majoriy algorihm since we need o acually make he predicion p w which is required for compuing. I s possible o derive regre bounds where he regre is defined as he difference beween he algorihm s expeced cumulaive cos (over he randomness of p ) and he cumulaive cos of he bes machine: [ T ] E [y ] p min [y ] i z (p) i [d] Reference Shalev-Shwarz, S. (20). Online Learning and Online Convex Opimizaion. 5