Noes on online convex opimizaion Karl Sraos Online convex opimizaion (OCO) is a principled framework for online learning: OnlineConvexOpimizaion Inpu: convex se S, number of seps T For =, 2,..., T : Selec w S. Receive a convex loss f : S R chosen adversarially. Suffer loss f (w ). Each hypohesis is a vecor in some convex se S. The loss funcion f : S R is convex and defined for each ime sep individually. Our goal is o have small regre wih respec o a hypohesis space U, namely Regre T (U) := max u U Regre T (u) where Regre T (u) := f (w ) f (u) Unregularized aggregae loss minimizaion A ime, we have observed losses f... f, so a naural choice of w is one ha minimizes he sum of all pas losses. This is known as Follow-he-Leader (FTL): f i (w) () Lemma.. If we use Eq. () in OCO, we have Regre T (S) f (w ) f (w + ) 2 Regularized aggregae loss minimizaion Lemma. suggess a need for conaining f (w ) f (w + ). If we assume f is L -Lipschiz wih respec o S and some norm, we have f (w ) f (w + ) L w w + This is a bird s eye view of he incredible uorial by Shai Shalev-Shwarz (20). For full deails, see he original uorial.
which in urn suggess a need for conaining w w +. If he objecive in Eq. () F (w) := f i (w) happens o be σ-srongly-convex, w w + canno be arbirarily large: by he definiion of w and w + and srong convexiy, Adding hese wo inequaliies, we ge: F (w + ) F (w ) σ 2 w w + 2 F + (w ) F + (w + ) σ 2 w w + 2 w w + f (w ) f (w + ) L σ σ We can always endow σ-srong-convexiy on F by adding a σ-srongly-convex regularizer R : S R. This is known as Follow-he-Regularized-Leader (FoReL): R(w) + f i (w) (2) By reaing R as he (convex) loss a ime = 0, we ge he following corollary from Lemma.. Corollary 2.. If we use Eq. (2) in OCO, for all u S we have Regre T (u) R(u) min R(v) + T f (w ) f (w + ) v S Theorem 2.2. Le f : S R be convex loss funcions ha are L -Lipschiz over convex S wih respec o. Le L R be a consan such ha L 2 (/T ) T L2, and le R : S R be a σ-srongly-convex regularizer. Then he regre of FoReL wih respec o u S is bounded above as: T L2 Regre T (u) R(u) min R(v) + v S σ 3 Linearizaion of convex losses Theorem 2.2 assumes an oracle ha solves Eq. (2), so i s no very useful for deriving concree algorihms. Bu a echnique known as linearizaion of convex losses grealy simplifies his ask. Since S is a convex se and f is convex, a each round of OCO we can selec z f (w ) so ha f (w ) f (w + ) z, w z, w + (3) Thus given a general convex loss f, we can preend ha i s a linear loss g (u) := z, u where z is a sub-gradien of f a w. In ligh of Corollary 2. and Eq. (3), running FoReL on hese linearized losses: enjoys he same regre bound in Theorem 2.2. R(w) + w, z i (4) 2
3. Online mirror descen Eq. (4) can be addiionally analyzed in a dual framework known as online mirror descen (OMD). OMD frames Eq. (4) as wo separae seps: saring wih θ := 0, w = g(θ ) θ = θ z where g(θ) := arg max w, θ R(w) is known as he link funcion. The paricular form of he link funcion comes from he convex conjugae of R (R is assumed o be closed and convex): R (θ) := max w, θ R(w) A propery of R is ha if z R (θ), hen R (θ) = z, θ R(z). Thus g(θ ) = z R (θ ). This framework can be used o show ha OMD achieves Regre T (u) R(u) + min R(v) + T v S D R ( ) z i z i where D R (u v) is he Bregman divergence beween u and v under R. If R is (/η)- srongly-convex wih respec o, hen R is η-srongly-smooh wih respec o he dual norm : in his case, 3.2 Example algorihms Regre T (u) R(u) + min v S R(v) + η 2 (5) z 2 (6) We can now crank ou algorihms under he OMD framework. All hese algorihms enjoy he bound in Theorem 2.2 (or Eq. (6)). Online gradien descen (OGD): Assumes an unconsrained domain S = R d and an l 2 regularizer R(w) = 2η w 2 2. We have g(θ) = ηθ and w = w ηz (7) Online gradien descen wih lazy projecions (OGDLP): Assumes a general convex se S and an l 2 regularizer R(w) = 2η w 2 2. Noe ha Thus he link funcion g(θ) projecs ηθ ono S. 2η w 2 2 w, θ = arg min w ηθ 2 2 (8) Unnormalized exponeniaed gradien descen (UEG): Assumes an unconsrained domain S = R d and a shifed enropy regularizer R(w) = η i w i(log w i log λ) where λ > 0. We have g i (θ) = λ exp(ηθ i ), hus w = (λ... λ) and for i > : [w ] i = [w ] i exp( η[z ] i ) (9) 3
Normalized exponeniaed gradien descen (NEG): Assumes a probabiliy simplex S = {w R d : w 0, i w i = } and an enropy regularizer R(w) = i w i log w i. We have g i (θ) = exp(ηθi) j exp(ηθj), hus w = (/d... /d) and for i > : η [w ] i = [w ] i exp( η[z ] i ) j [w ] j exp( η[z ] j ) (0) 4 Applicaions o classificaion problems The cenral sep in applying OCO o a classificaion problem is finding he righ convex surrogae of he problem. 4. Percepron A each round, we re given a poin x R d. We predic p {, +} and receive he rue class y {, +}. The (non-convex) loss is given by { if p y l(p, y ) := 0 if p = y Noe ha he cumulaive loss M := l(p, y ) is he number of misakes. Convex surrogae: We mainain a vecor w R d ha defines p := sign w, x. We use a hinge loss f (w ) := max(0, y w, x ) which by he paricular consrucion is convex and upperbounds he original loss l(p, y ). Using a sub-gradien z f (w ) where z = y x if y w, x and z = 0 oherwise, we can now run OGD using some η > 0: w := 0 and { w + ηy w + := x if y w, x w if y w, x > Le L := max z. I s possible o apply Eq. (6) and show ha for any u R d M f (u) + u 2 L f (u) + L 2 u 2 2 In paricular, if here exiss u R d such ha f (u) = 0, we have M L 2 u 2 2. 4.2 Weighed majoriy A each round, we re given a poin x X and d hypoheses H = {h,..., h d } where h i : X {0, }. We make a choice p [d] and receive he rue class y {0, }. The (non-convex) loss is given by { if hp (x l(p, y ) := ) y 0 if h p (x ) = y 4
Convex surrogae: We mainain a vecor w {w R d : w 0, i w i = }. This vecor defines weighed majoriy : p = if d [w ] i h i (x ) /2 and p = 0 oherwise. We use he convex loss funcion: f (w ) := d [w ] i h i (x ) y = w, z where [z ] i := h i (x ) y (hus z is also he gradien of f ). Hence we have an online linear problem suiable for NEG. I s possible o show ha if here exiss some h H such ha T h(x ) y = 0, hen NEG achieves f (w ) 4 log d. 4.2. Muli-armed bandi A problem closely relaed o weighed majoriy is he so-called muli-armed bandi problem. A each round, here d slo machines ( one-armed bandis ) o choose from. We make a choice p [d] and receive he cos of playing ha machine: [y ] p [0, ]. A crucial aspec of he problem is he exisence of unobserved coss [y ] i [0, ] for i p, because if we observe all y [0, ] d we can jus formulae i as an online linear problem by minimizing he expeced loss f (w ) := w, y where w {w R d : w 0, i w i = } again defines weighed majoriy over d machines. Since y is he gradien of f, anoher way of saing he difficuly is ha gradiens are no fully observed. A soluion is o use a p -dependen esimaor z (p) [z (p) ] i { [y ] i /[w ] i if i = p 0 if i p of he gradien y as follows: This is indeed an unbiased esimaor of y over he randomness of p since E[z (p) ] i := d p p(p )[z (p) [y ] i ] i = w i + 0 = [y ] i [w ] i p i Thus we can run NEG by subsiuing he unobserved gradien y wih z (p). Noe ha he algorihm will be slighly differen from he weighed majoriy algorihm since we need o acually make he predicion p w which is required for compuing. I s possible o derive regre bounds where he regre is defined as he difference beween he algorihm s expeced cumulaive cos (over he randomness of p ) and he cumulaive cos of he bes machine: [ T ] E [y ] p min [y ] i z (p) i [d] Reference Shalev-Shwarz, S. (20). Online Learning and Online Convex Opimizaion. 5