Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 1 / 10

Setup Setting is learning from experts: n experts, T rounds For t = 1,...,T : Learner chooses distribution w t n over the experts Nature reveals losses z t [ 1,1] n of the experts Learner suffers loss w t z t Goal: minimize Regret = where i is the best fixed expert. T t=1 w t z t T t=1 z t,i, Typical algorithm: multiplicative weights (aka exponentiated gradient): w t+1,i w t,i exp( ηz t,i ). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 2 / 10

Outline Compare two variants of the multiplicative weights (exponentiated gradient) algorithm Understand the difference through lens of adaptive mirror descent (Orabona et al., 2013) Combine with machinery of optimistic updates (Rakhlin & Sridharan, 2012) to beat best existing bounds. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 3 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) The regret is bounded as Regret log(n) η Regret log(n) η + η + η T t=1 z t 2 T z 2 t,i t=1 (Regret:MW1) (Regret:MW2) If best expert i has loss close to zero, then second bound better than first. Gap can be Θ( T) (in actual performance, not just upper bounds). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Two Types of Updates In literature, two similar but different updates (Kivinen & Warmuth, 1997; Cesa-Bianchi et al., 2007): w t+1,i w t,i exp( ηz t,i ) w t+1,i w t,i (1 ηz t,i ) (MW1) (MW2) Mirror descent is the gold standard meta-algorithm for online learning. How do (MW1, MW2) relate to it? (MW1) is mirror descent with regularizer 1 η n i=1 w i log(w i ) (MW2) is NOT mirror descent for any fixed regularizer Unsettling: should we abandon mirror descent as a gold standard? No: can cast (MW2) as adaptive mirror descent (Orabona et al., 2013) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 4 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Adaptive Mirror Descent to the Rescue Recall that mirror descent is the (meta-)algorithm t 1 w t = argmin w ψ(w) + s=1 w z s. For ψ(w) = 1 η n i=1 w i log(w i ), we recover (MW1). Adaptive mirror descent (Orabona et al., 2013) is the meta-algorithm t 1 w t = argmin w ψ t (w) + s=1 w z s. For ψ t (w) = 1 η n i=1 w i log(w i ) + η n i=1 t 1 s=1 w izs,i 2, we approximately recover (MW2). Update: w t+1,i w t,i exp( ηz t,i η 2 zt,i 2 ) w t,i(1 ηz t,i ) Enough to achieve better regret bound. Can recover (MW2) exactly with more complicated ψ t. J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 5 / 10

Advantages of Our Perspective So far, we have cast [ (MW2) as adaptive mirror ] descent, with regularizer ψ t (w) = n i=1 w 1 i η log(w i) + η t 1 s=1 z2 s,i. Explains the better regret bound while staying within the mirror descent framework, which is nice. Our new perspective also allows us to apply lots of modern machinery: optimistic updates (Rakhlin & Sridharan, 2012) matrix multiplicative weights (Tsuda et al., 2005; Arora & Kale, 2007) By turning the crank, we get results that beat state of the art! J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 6 / 10

Beating State of the Art Optimism Adaptivity In the above we let J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S Kivinen & Warmuth, 1997 Adaptivity In the above we let S = T t=1 z t 2 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism S Kivinen & Warmuth, 1997 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 S = T t=1 z t 2 S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art Optimism V S Kivinen & Warmuth, 1997 max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 V = T t=1 z t z 2, S = T t=1 z t 2 V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art D Chiang et al., 2012 Optimism V S Kivinen & Warmuth, 1997 max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let S i Cesa-Bianchi et al., 2007 D = T t=1 z t z t 1 2, V = T t=1 z t z 2, S = T t=1 z t 2 V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Beating State of the Art D Chiang et al., 2012 Optimism V S Kivinen & Warmuth, 1997 max i D i max i V i Hazan & Kale, 2008 max i S i Adaptivity In the above we let D i this work V i S i Cesa-Bianchi et al., 2007 D = T t=1 z t z t 1 2, V = T t=1 z t z 2, S = T t=1 z t 2 D i = T t=1(z t,i z t 1,i ) 2, V i = T t=1(z t,i z i ) 2, S i = T t=1 z2 t,i J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 7 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint m t : ] w t = argmin w ψ(w) + w [m t 1 t + z s Guesses (m t ) the next term (z t ) in the cost function. Pay regret in terms of z t m t rather than z t. s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Optimistic Updates: A Brief Review (Normal) mirror descent: [ ] t 1 w t = argmin w ψ(w) + w z s s=1 Optimistic mirror descent (Rakhlin & Sridharan, 2012); add a hint m t : ] w t = argmin w ψ(w) + w [m t 1 t + z s Guesses (m t ) the next term (z t ) in the cost function. Pay regret in terms of z t m t rather than z t. E.g.: m t = z t 1, m t = 1 t t 1 s=1 z s s=1 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 8 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) MW3 β t+1,i = β t,i ηz t,i η 2 (z t,i z t 1,i ) 2 w t,i exp(β t,i ηz t 1,i ) J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Multiplicative Weights with Optimism Name Auxiliary Update Prediction (w t ) MW1 β t+1,i = β t,i ηz t,i w t,i exp(β t,i ) MW2 β t+1,i = β t,i ηz t,i η 2 zt,i 2 w t,i exp(β t,i ) MW3 β t+1,i = β t,i ηz t,i η 2 (z t,i z t 1,i ) 2 w t,i exp(β t,i ηz t 1,i ) Regret of MW3: Regret log(n) η + η T t=1 Dominates all existing bounds in this setting! (z t,i z t 1,i ) 2 J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 9 / 10

Summary Cast multiplicative weights algorithm as adaptive mirror descent Applied machinery of optimistic updates to beat best existing bounds Also in paper: extension to general convex losses extension to matrices generalization of FTRL lemma to convex cones J. Steinhardt & P. Liang (Stanford) Adaptive Exponentiated Gradient Jun 11, 2013 10 / 10