Full-information Online Learning

Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017

Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2 Prediction with Expert Advice 3 Online Convex Optimization Convex Functions Strongly Convex Functions Exponentially Concave Functions

Outline Introduction Expert Advice OCO Definitions Regret 1 Introduction Definitions Regret 2 Prediction with Expert Advice 3 Online Convex Optimization Convex Functions Strongly Convex Functions Exponentially Concave Functions

Introduction Expert Advice OCO Definitions Regret What Happens in an Internet Minute? http://www.dailyinfographic.com/what-happens-in-an-internet-minute-infographic

Introduction Expert Advice OCO Definitions Regret Online Algorithm vs Online Algorithm [Karp, 1992] An online algorithm is one that receives a sequence of requests and performs an immediate action in response to each request. [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information.

Introduction Expert Advice OCO Definitions Regret Online Algorithm vs Online Algorithm [Karp, 1992] An online algorithm is one that receives a sequence of requests and performs an immediate action in response to each request. Computer Vision, Machine Learning, Data Mining Theoretical Computer Science, Computer Networks [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information.

Introduction Expert Advice OCO Definitions Regret Online Algorithm vs Online Algorithm [Karp, 1992] An online algorithm is one that receives a sequence of requests and performs an immediate action in response to each request. Computer Vision, Machine Learning, Data Mining Theoretical Computer Science, Computer Networks [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information. Machine Learning, Game Theory, Information Theory

Introduction Expert Advice OCO Definitions Regret Online Algorithm vs Online Algorithm [Karp, 1992] An online algorithm is one that receives a sequence of requests and performs an immediate action in response to each request. The Ski Rental Problem: in the t-th day Rent (1$) or Buy (10$)? [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information.

Introduction Expert Advice OCO Definitions Regret Online Algorithm vs Online Algorithm [Karp, 1992] An online algorithm is one that receives a sequence of requests and performs an immediate action in response to each request. The Ski Rental Problem: in the t-th day Rent (1$) or Buy (10$)? [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information. Online Classification: in the t-th round Recevie x t R d, predict ŷ t, observe y t

Introduction Expert Advice OCO Definitions Regret Full-information vs Bandit [Shalev-Shwartz, 2011] Online learning is the process of answering a sequence of questions given (maybe partial) knowledge of answers to previous questions and possibly additional information.

Introduction Expert Advice OCO Definitions Regret Formal Definitions 1: for t = 1, 2,...,T do 4: end for

Introduction Expert Advice OCO Definitions Regret Formal Definitions 1: for t = 1, 2,...,T do 2: Learner picks a decision w t W Adversary chooses a function f t ( ) 4: end for A classifier Learner An example A loss Adversary

Introduction Expert Advice OCO Definitions Regret Formal Definitions 1: for t = 1, 2,...,T do 2: Learner picks a decision w t W Adversary chooses a function f t ( ) 3: Learner suffers loss f t (w t ) and updates w t 4: end for Cumulative Loss Cumulative Loss = f t (w t ) Full-information Learner observes the function f t ( ) Online first-order optimization Bandit Learner only observes the value of f t (w t ) Online zero-order optimization

Regret Introduction Expert Advice OCO Definitions Regret Cumulative Loss Cumulative Loss = f t (w t )

Regret Introduction Expert Advice OCO Definitions Regret Cumulative Loss Regret Regret = Cumulative Loss = f t (w t ) f t (w t ) min w W f t (w)

Regret Introduction Expert Advice OCO Definitions Regret Cumulative Loss Cumulative Loss = f t (w t ) Regret Regret = f t (w t ) }{{} Cumulative Loss of Online Learner min w W f t (w) }{{} Minimal Loss of Batch Learner

Introduction Expert Advice OCO Definitions The Prediction Protocol 1: for t = 1,...,T do 7: end for

Introduction Expert Advice OCO Definitions The Prediction Protocol 1: for t = 1,...,T do 2: The environment chooses the next outcome y t and the expert advice {f i,t : i [N]} 7: end for

Introduction Expert Advice OCO Definitions The Prediction Protocol 1: for t = 1,...,T do 2: The environment chooses the next outcome y t and the expert advice {f i,t : i [N]} 3: The expert advice is revealed to the forecaster 4: The forecaster chooses the prediction ˆp t 5: The environment reveals the next outcome y t 6: The forecaster incurs loss l(ˆp t, y t ) and each expert i incurs loss l(f i,t, y t ) 7: end for

Introduction Expert Advice OCO Online Algorithms Weighted Average Prediction N i=1 p t = w i,t 1f i,t N j=1 w j,t 1 where w i,t 1 is the weight assigned to expert i

Introduction Expert Advice OCO Online Algorithms Weighted Average Prediction N i=1 p t = w i,t 1f i,t N j=1 w j,t 1 where w i,t 1 is the weight assigned to expert i and Polynomially Weighted Average Forecaster w i,t 1 = 2(R i,t 1) p 1 + (R t 1 ) + p 2 p p t = N t 1 ( i=1( s=1 l( p s, y s ) l(f i,s, y s ) )) p 1 ( l( p s, y s ) l(f j,s, y s ) )) p 1 N j=1( t 1 s=1 + f i,t +

Introduction Expert Advice OCO Online Algorithms Weighted Average Prediction N i=1 p t = w i,t 1f i,t N j=1 w j,t 1 where w i,t 1 is the weight assigned to expert i Exponentially Weighted Average Forecaster w i,t 1 = e ηr i,t 1 N j=1 eηr j,t 1 and p t = ( ) N i=1 exp η( L t 1 L i,t 1 ) f i,t ( ) = N j=1 exp η( L t 1 L j,t 1 ) N i=1 e ηl i,t 1f i,t N j=1 e ηl j,t 1

Introduction Expert Advice OCO Regret Bounds I Corollary 2.1 of [Cesa-Bianchi and Lugosi, 2006] Assume that the loss function l is convex in its first argument and that it takes values in [0, 1]. Then, for any sequence y 1, y 2,... of outcomes and for any T 1, the regret of the polynomially weighted average forecaster satisfies L T min i=1,...,n L i,t T(p 1)N 2/p

Introduction Expert Advice OCO Regret Bounds II Corollary 2.2 of [Cesa-Bianchi and Lugosi, 2006] Assume that the loss function l is convex in its first argument and that it takes values in [0, 1]. Then, for any sequence y 1, y 2,... of outcomes and for any T 1, the regret of the exponentially weighted average forecaster satisfies L T min L i,t ln N i=1,...,n η + ηt 2

Introduction Expert Advice OCO Extensions Tighter Regret Bounds Small Losses L T min L i,t i=1,...,n where L T = min i=1,...,n L i,t Exp-concave Losses 2L T ln N +ln N L T min i=1,...,n L i,t ln N η

Introduction Expert Advice OCO Extensions Tighter Regret Bounds Small Losses L T min L i,t i=1,...,n where L T = min i=1,...,n L i,t Exp-concave Losses 2L T ln N +ln N L T min i=1,...,n L i,t ln N η Tracking Regret ( R(i 1,...,i T ) = l( p t, y t ) l(f it,t, y t ) )

Outline Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co 1 Introduction Definitions Regret 2 Prediction with Expert Advice 3 Online Convex Optimization Convex Functions Strongly Convex Functions Exponentially Concave Functions

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Definitions Online Convex Optimization 1: for t = 1, 2,...,T do 2: Learner picks a decision w t W Adversary chooses a function f t ( ) 3: Learner suffers loss f t (w t ) and updates w t 4: end for where W and f t s are convex

Analysis I Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Define w t+1 = w t η t f t (w t ). For any w W, we have f t (w t ) f t (w) f t (w t ), w t w = 1 η t w t w t+1, w t w = 1 ( wt w 2 2 w t+1 w 2 2 + w t w 2η t+1 2 ) 2 t = 1 2η t ( wt w 2 2 w t+1 w 2 2 1 2η t ( wt w 2 2 w t+1 w 2 2 By adding the inequalities of all iterations, we have f t (w t ) f t (w) 1 2η 1 w 1 w 2 2 1 2η T w T+1 w 2 2 + 1 2 t=2 ( 1 1 ) w t w 2 2 + 1 η t η t 1 2 ) η t + 2 f t(w t ) 2 2 ) η t + 2 f t(w t ) 2 2 η t f t (w t ) 2 2

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Analysis II Assuming we have By setting η t η t 1, w t w 2 2 D 2, and f t (w t ) 2 2 G 2, f t (w t ) f t (w) D2 + D2 2η 1 2 = D2 + G2 2η T 2 t=2 η t = 1 t, ( 1 1 )+ G2 η t η t 1 2 η t η t we have f t (w t ) f t (w) ( ) D 2 T 2 + G2

Analysis I Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Define w t+1 = w t η t f t (w t ). For any w W, we have f t (w t ) f t (w) f t (w t ), w t w λ 2 w t w 2 2 1 ( wt w 2 2 w t+1 w 2 ) η t 2 + 2η t 2 f t(w t ) 2 2 λ 2 w t w 2 2 By adding the inequalities of all iterations, we have f t (w t ) f t (w) 1 2η 1 w 1 w 2 2 λ 2 w 1 w 2 2 1 2η T w T+1 w 2 2 + 1 2 t=2 ( 1 1 ) λ w t w 2 2 + 1 η t η t 1 2 η t f t (w t ) 2 2

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Analysis II Assuming we have f t (w t ) 2 2 G 2, f t (w t ) f t (w) 1 2η 1 w 1 w 2 2 λ 2 w 1 w 2 2 By setting we have + 1 2 t=2 ( 1 1 λ ) w t w 22 + G2 η t η t 1 2 η t = 1 λt, η t f t (w t ) f t (w) G2 2λ 1 t G2 (log T + 1) 2λ

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Definitions Exponential Concavity A function f( ) : W R is α-exp-concave if exp( αf( )) is concave over W.

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Definitions Exponential Concavity A function f( ) : W R is α-exp-concave if exp( αf( )) is concave over W. Logistic Loss Square Loss Negative Logarithm Loss ( ) f(w) = log 1+exp( yx w) f(w) = (x w y) 2 f(w) = log(x w)

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Online Newton Step The Algorithm 1: A 1 = 1 β 2 D 2 I 2: for t = 1, 2,...,T do 3: A t+1 = A t + f t (w t )[ f t (w t )] 4: w t+1 =Π A t W =argmin w W ( w t 1 β A 1 where w t+1 = w t 1 β A 1 t f t (w t ) 5: end for ) t f t (w t ) ( w w t+1 ) At ( w w t+1 )

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Regret Bound Theorem 2 of [Hazan et al., 2007] Assume that for all t, the loss function f t : W R d R is α-exp-concave and has the property that f t (w) 2 G, w W, t [T] Then, online Newton Step with β = 1 { } 1 2 min 4GD,α has the following regret bound ( ) 1 f t (w t ) min f t (w) 5 w W α + GD d log T

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Online-to-batch Conversion Statistical Assumption f 1, f 2,... are sampled independently from P Define F(w) = E f P [f(w)] Regret Bound f t (w t ) f t (w) C Taking expectation over both sides, we have [ ] E F(w t ) F(w) C

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Examples of Conversion Convex Functions E[F( w)] F(w) 1 ( ) D 2 T 2 + G2 High-probability Bound [Cesa-bianchi et al., 2002] Strongly Convex Functions E[F( w)] F(w) G2 (log T + 1) 2λT High-probability Bound [Kakade and Tewari, 2009]

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Extensions Faster Rates [Srebro et al., 2010] f t (w t ) min f t (w) O( Tf ) w W where f = min w W T f t(w)

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Extensions Faster Rates [Srebro et al., 2010] f t (w t ) min f t (w) O( Tf ) w W where f = min w W T f t(w) Dynamic Regret [Zinkevich, 2003, Yang et al., 2016] R(u 1,...,u T ) = f t (w t ) f t (u t ) R(T,τ) = max [s,s+τ 1] [T] Adaptive Regret [Hazan and Seshadhri, 2007, Daniely et al., 2015] ( s+τ 1 t=s s+τ 1 f t (w t ) min w W t=s f t (w) )

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Reference I Cesa-bianchi, N., Conconi, A., and Gentile, C. (2002). On the generalization ability of on-line learning algorithms. In Advances in Neural Information Processing Systems 14, pages 359 366. Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, Learning, and Games. Cambridge University Press. Daniely, A., Gonen, A., and Shalev-Shwartz, S. (2015). Strongly adaptive online learning. In Proceedings of The 32nd International Conference on Machine Learning. Hazan, E., Agarwal, A., and Kale, S. (2007). Logarithmic regret algorithms for online convex optimization. Machine Learning, 69(2-3):169 192. Hazan, E. and Seshadhri, C. (2007). Adaptive algorithms for online decision problems. Electronic Colloquium on Computational Complexity, 88. Kakade, S. M. and Tewari, A. (2009). On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems 21, pages 801 808.

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Reference II Karp, R. M. (1992). On-line algorithms versus off-line algorithms: How much is it worth to know the future? In Proceedings of the IFIP 12th World Computer Congress on Algorithms, Software, Architecture Information Processing, pages 416 429. Mahdavi, M., Zhang, L., and Jin, R. (2015). Lower and upper bounds on the generalization of stochastic exponentially concave optimization. In Proceedings of the 28th Conference on Learning Theory. Shalev-Shwartz, S. (2011). Online learning and online convex optimization. Foundations and Trends in Machine Learning, 4(2):107 194. Srebro, N., Sridharan, K., and Tewari, A. (2010). Optimistic rates for learning with a smooth loss. ArXiv e-prints, arxiv:1009.3896. Yang, T., Zhang, L., Jin, R., and Yi, J. (2016). Tracking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In Proceedings of the 33rd International Conference on Machine Learning.

Introduction Expert Advice OCO Convex Functions Strongly Convex Functions Exponentially Co Reference III Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning, pages 928 936.