Online Prediction Peter Bartlett

Size: px

Start display at page:

Download "Online Prediction Peter Bartlett"

Ashley Hutchinson
5 years ago
Views:

1 Online Prediction Peter Bartlett Statistics and EECS UC Berkeley and Mathematical Sciences Queensland University of Technology

2 Online Prediction Repeated game: Cumulative loss: ˆL n = Decision method plays a t A World reveals l t L l t (a t ). Aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = l t (a t ) } {{ } ˆL n min a A Data can be adversarially chosen. l t (a). } {{ } L n

3 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a).

4 Online Prediction: Motivations 1. Adversarial model is often appropriate, e.g., in Computer security. Computational finance. 2. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 3. Studying the adversarial model can reveal the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases. 4. There are significant overlaps in the design of methods for the two problems: Regularization plays a central role. Many online prediction strategies have a natural interpretation as a Bayesian method.

5 Computer Security: Spam Detection

6 Computer Security: Spam Detection Here, the action a t might be a classification rule, and l t is the indicator for a particular being incorrectly classified (e.g., spam allowed through). The sender can determine if an is delivered (or detected as spam), and try to modify it. An adversarial model allows an arbitrary sequence. We cannot hope for good classification accuracy in an absolute sense; regret is relative to a comparison class. Minimizing regret ensures that the spam detection accuracy is close to the best performance in retrospect on the particular sequence.

7 Computer Security: Spam Detection Suppose we consider features of messages from some set X (e.g., information about the header, about words in the message, about attachments). The decision method s action a t is a mapping from X to [0, 1] (think of the value as an estimated probability that the message is spam). At each round, the adversary chooses a feature vector x t X and a label y t {0, 1}, and the loss is defined as l t (a t ) = (y t a t (x t )) 2. The regret is then the excess squared error, over the best achievable on the data sequence: l t (a t ) min a A l t (a) = (y t a t (x t )) 2 min a A (y t a(x t )) 2.

8 Computational Finance: Portfolio Optimization

9 Computational Finance: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). Here, the action a t is a distribution on instruments, and l t might be the negative logarithm of the portfolio s increase, a t r t, where r t is the vector of relative price increases. We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions.

10 Computational Finance: Portfolio Optimization The decision method s action a t is a distribution on the m instruments, a t m = {a [0, 1] m : i a i = 1}. At each round, the adversary chooses a vector of returns r t R m +; the ith component is the ratio of the price of instrument i at time t to its price at the previous time, and the loss is defined as l t (a t ) = log (a t r t ). The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: l t (a t ) min a A l t (a) = max a A log(a r t ) log(a t r t ).

11 Online Prediction: Motivations 2. Online algorithms are also effective in probabilistic settings. Easy to convert an online algorithm to a batch algorithm. Easy to show that good online performance implies good i.i.d. performance, for example.

12 Online Prediction: Motivations 3. Understanding statistical prediction methods. Many statistical methods, based on probabilistic assumptions, can be effective in an adversarial setting. Analyzing their performance in adversarial settings provides perspective on their robustness. We would like violations of the probabilistic assumptions to have a limited impact.

13 Key Points Online Prediction: repeated game. aim to minimize regret. Data can be adversarially chosen. Motivations: Often appropriate (security, finance). Algorithms also effective in probabilistic settings. Can provide insight into statistical prediction methods.

14 Synopsis A finite comparison class: A = {1,..., m}. An easy start. Online, adversarial versus batch, probabilistic. Similar bounds. Optimal regret: dual game. Rademacher averages for probabilistic. Sequential Rademacher averages for adversarial. Online convex optimization. Regularization methods.

15 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.

16 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

17 Prediction with Expert Advice Suppose we are predicting whether it will rain tomorrow. We have access to a set of m experts, who each make a forecast of 0 or 1. Can we ensure that we predict almost as well as the best expert? Here, A = {1,..., m}. There are m experts, and each has a forecast sequence f1 i, f 2 i,... from {0, 1}. At round t, the adversary chooses an outcome y t {0, 1}, and sets { l t (i) = 1[ft i 1 if f i y t ] = t y t, 0 otherwise.

18 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a). ˆL n = l t (a t ), L n = min a A l t (a).

19 Prediction with Expert Advice An easier game: suppose that the adversary is constrained to choose the sequence y t so that some expert incurs no loss (L n = 0), that is, there is an i {1,..., m} such that for all t, y t = ft i. How should we predict?

20 Prediction with Expert Advice: Halving Define the set of experts who have been correct so far: C t = {i : l 1 (i) = = l t 1 (i) = 0}. Choose a t any element of { ( )} i : ft i = majority {f j t : j C t }. Theorem This strategy has regret no more than log 2 m.

21 Prediction with Expert Advice: Halving Theorem The halving strategy has regret no more than log 2 m. Proof. If it makes a mistake (that is, l t (a t ) = 1), then the minority of {f j t : j C t } is correct, so at least half of the experts are eliminated: C t+1 C t 2. And otherwise C t+1 C t (because C t never increases). Thus, ˆL n = l t (a t ) log 2 C 1 C n+1 = log 2 m log 2 C n+1 log 2 m.

22 Prediction with Expert Advice The proof follows a pattern we shall see again: find some measure of progress (here, C t ) that changes monotonically when excess loss is incurred (here, it halves), is somehow constrained (here, it cannot fall below 1, because there is an expert who predicts perfectly). What if there is no perfect expert?

23 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

24 Prediction with Expert Advice: Mixed Strategies We have m experts. Allow a mixed strategy, that is, a t chosen from the simplex m the set of distributions on {1,..., m}, { } m m = a [0, 1] m : a i = 1. We can think of the strategy as choosing an element of {1,..., m} randomly, according to a distribution a t. Or we can think of it as playing an element a t of m, and incurring the expected loss, l t (a t ) = i=1 m atl i t (e i ), i=1 where l t (e i ) [0, 1] is the loss incurred by expert i. (e i denotes the vector with a single 1 in the ith coordinate, and the rest zeros.)

25 Prediction with Expert Advice: Exponential Weights Maintain a set of (unnormalized) weights over experts: w i 1 = 1, w i t+1 = w i t exp ( ηl t (e i )). Here, η > 0 is a parameter of the algorithm. Choose a t as the normalized vector, a t = 1 m i=1 w i t w t.

26 Prediction with Expert Advice: Exponential Weights Theorem The exponential weights strategy with parameter has regret satisfying η = 8 ln m n ˆL n L n n ln m 2.

27 Exponential Weights: Proof Idea We use a measure of progress: W t = m wt i. i=1 1. W n grows at least as ( exp 2. W n grows no faster than ( exp η min i η ) l t (e i ). ) l t (a t ).

28 Exponential Weights: Proof 1 ln W n+1 W 1 = ln ( m i=1 w i n+1 ) ln m ( m ( = ln exp η i=1 t ( ( ln max exp i ( = η min i = ηl n ln m. t η t ) l t (e i ) l t (e i ) l t (e i ) ln m )) )) ln m ln m

29 Exponential Weights: Proof 2 ln W t+1 W t ( m i=1 = ln exp( ηl t(e i ))wt i i η l t(e i )wt i i w t i = ηl t (a t ) + η2 8, i w i t + η2 8 where we have used Hoeffding s inequality: for a random variable X [a, b] and λ R, ln (Ee ) λx λex + λ2 (b a) 2. 8 )

30 Aside: Proof of Hoeffding s inequality Define ( ) ( A(λ) = log Ee λx = log ) e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statistic x. Since P has bounded support, A(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in [a, b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (where X λ has the same distribution as X) gives A(λ) λex + λ2 8 (b a)2.

31 Exponential Weights: Proof Thus, ηl n ln m ln W n+1 W 1 ˆL n L n ln m η ηˆl n + nη ηn 8. Choosing the optimal η gives the result: Theorem The exponential weights strategy with parameter η = n ln m 8 ln m/n has regret no more than 2.

32 Key Points For a finite set of actions (experts): If one action is perfect (i.e., has zero loss), the halving algorithm gives per round regret of ln m n. Exponential weights gives per round regret of ( ) ln m O. n

33 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? 2. Do we need to know n to set η?

34 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? Replace Hoeffding: with: ln Ee λx λex + λ2 8, ln Ee λx (e λ 1)EX. (for X [0, 1]: linear upper bound on e λx ).

35 Exponential Weights: Proof 2 Thus ln W ( m t+1 i=1 = ln exp( ηl t(e i ))w i ) t W t i w t i ( e η 1 ) l t (a t ). ˆL n η 1 e η L n + ln m 1 e η. For example, if L n = 0 and η is large, we obtain a regret bound of roughly ln m again. And η large is like the halving algorithm (it puts equal weight on all experts that have zero loss so far).

36 Prediction with Expert Advice: Refinements 2. Do we need to know n to set η? We used the optimal setting η = 8 ln m/n. But can this regret bound be achieved uniformly across time? Yes; using a time-varying η t = 8 ln m/t gives the same rate (worse constants). It is also possible to set η as a function of L t, the best cumulative loss so far, to give the improved bound for small losses uniformly across time (worse constants).

37 Prediction with Expert Advice: Refinements 3. We could work with arbitrary convex losses on m : We defined loss as linear in a: l t (a) = i a i l t (e i ). We could replace this with any bounded convex function on m. The only change in the proof is an equality becomes an inequality: i η l t(e i )wt i i w t i ηl t (a t ).

38 Prediction with Expert Advice: Refinements But note that the exponential weights strategy only competes with the corners of the simplex: Theorem For convex functions l t : m [0, 1], the exponential weights strategy, with η = 8 ln m/n, satisfies l t (a t ) min i l t (e i ) + n ln m 2.

39 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with outcome vector y and model p(y j) = h(y) exp( ηy j ). 2. An easy regret bound for Bayesian prediction shows that its regret, wrt this scaled log loss: l Bayes (p, y) = 1 η log E J p exp( ηy J ), is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.

40 Bayesian Update parameter space: Θ outcome space: Y Assume joint: p(θ, y) = π(θ) }{{} prior p(y θ) }{{} likelihood predictive distribution: ˆp t+1 (y) = p(y y 1,..., y t ) = p(y θ) p(θ y 1,..., y t ) dθ }{{} p t+1 (θ) update p t on Θ: p 1 (θ) = π(θ) cumulative log loss: p t+1 (θ) = p t (θ)p(y t θ) pt (θ )p(y t θ )dθ log ˆp t (y t ).

41 Suppose that the likelihood is Bayesian Interpretation p(y j) = h(y) exp( ηy j ), for j = 1,..., m and y R m. Then the Bayes update is: p t+1 (j) = 1 Z p t(j) exp( ηy j t ), (where Z is normalization). If y j t is the loss of expert j, this is the exponential weights algorithm.

42 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ).

43 Performance of Bayesian Prediction: Proof First, hence ˆp 1 (y 1 ) ˆp n (y n ) }{{} exp( ˆL n) = p(y 1 )p(y 2 y 1 ) p(y n y 1,..., y n 1 ) = p(y 1,..., y n ) = p(y 1 θ) p(y n θ) dπ(θ), Θ }{{} exp( L n(θ)) exp( ˆL n ) exp( L n (θ))dπ(θ) S exp( L n (θ 0 )) dπ(θ), where S = {θ Θ : L n (θ) L n (θ 0 )}. Thus ˆL n L n (θ 0 ) ln(π(s)). S

44 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ). So if π(i) = 1/m, the exponential weights strategy s log loss is within log(m) of optimal. But what is the log loss here?

45 Bayesian Interpretation For a posterior p t on {1,..., m}, the predicted probability is ˆp t (y) = E J pt p(y J) = c(y)e J pt exp( ηy J ), So setting the loss as the negative log of the predicted probability is equivalent to defining the loss of a posterior p t with outcome y t as ηl Bayes with l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) (ignoring additive constants). Compare to the linear loss, l(p t, y t ) = E J pt y J t. These are equal at the corners of the simplex: l Bayes (δ j, y) = l(δ j, y) = y j.

46 Bayesian Interpretation The theorem shows that, with this loss and any prior, ( ) l(e j, y t ). ˆL Bayes,n min j log π(j) η But this is not the linear loss: l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) versus l(p t, y t ) = E J pt y J t. They coincide at the corners, p t = e j, and l Bayes is convex. What is the gap in Jensen s inequality?

47 Bayesian Interpretation l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) l(p t, y t ) = E J pt y J t. Hoeffding s inequality for X [a, b]: implies as before. A( η) = log (E exp( ηx)) ηex η2 8 (b a)2, ˆL n ˆL Bayes,n + ηn ( 8 min l(e j, y t ) π(j) ) + ηn j η 8 = min l(e j, y t ) + log m + ηn j η 8,

48 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with model p(y j) = h(y) exp( ηy j ). 2. Easy regret bound for Bayesian prediction shows that its regret wrt the scaled log loss l Bayes (p, y) = 1 η log E J p exp( ηy J ) is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.

49 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

50 Probabilistic Prediction Setting Let s consider a probabilistic formulation of a prediction problem. There is a sample of size n drawn i.i.d. from an unknown probability distribution P on X Y: (X 1, Y 1 ),..., (X n, Y n ). Some method chooses ˆf : X Y. It suffers regret El(ˆf (X), Y ) min El(f (X), Y ). f F Here, F is a class of functions from X to Y.

51 Probabilistic Setting: Zero Loss Theorem If some f F has El(f (X), Y ) = 0, then choosing } ˆf Cn = {f F : Êl(f (X), Y ) = 0 leads to regret that is ( ) log F O. n

52 Probabilistic Setting: Zero Loss Proof. Pr(El(ˆf ) ɛ) Pr( f F : Êl(f ) = 0, El(f ) ɛ) F (1 ɛ) n F e nɛ. Integrating the tail bound Pr(El(ˆf )n ln F + x) e x gives El(ˆf ) c ln F /n.

53 Probabilistic Setting Theorem Choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), leads to regret that is ( ) log F O. n

54 Probabilistic Setting Proof. By the triangle inequality and the definition of ˆf,. min Elˆf f F El f 2E sup f F Êl f El f E sup f F = E sup f F E sup f F 1 (l(y t, f (X t )) Pl(Y, f (X))) n 1 ( P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( ɛ t l(yt, f (X t )) l(y t, f (X t )) ) n 1 ɛ t l(y t, f (X t )) n, 2E sup f F where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.

55 Aside: Rademacher Averages of a Finite Class Theorem: For V R n, E max v V i=1 ɛ i v i 2 ln V max v V v. Proof idea: Hoeffding s inequality. ( ) ( ) exp λe max ɛ i v i E exp λ max ɛ i v i v v i i ( E exp λ ) ɛ i v i v i = E exp(λɛ i v i ) v i exp(λ 2 vi 2 /2) v i ( ) V exp λ 2 max v 2 /2. v

56 Probabilistic Setting min Elˆf El 1 f 4E sup ɛ t l f F f F n f (X t, Y t ) t 2 log F 4 max l f (X i, Y i ) 2 X i,y i,f n t 2 log F 4. n

57 Probabilistic Setting: Key Points For a finite function class If one function has zero loss, choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), gives per round regret of ln F n. In any case, ˆf has per round regret of ( ) ln F O. n The same as the adversarial setting.

58 Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions. 5. Probabilistic prediction with a finite class. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.

59 Online to Batch Conversion Suppose we have an online strategy that, given observations l 1,..., l t 1, produces a t = A(l 1,..., l t 1 ). Can we convert this to a method that is suitable for a probabilistic setting? That is, if the l t are chosen i.i.d., can we use A s choices a t to come up with an â A so that is small? El 1 (â) min a A El 1 (a) Consider the following simple randomized method: 1. Pick T uniformly from {0,..., n}. 2. Let â = A(l T +1,..., l n ).

60 Online to Batch Conversion Theorem If A has a regret bound of C n+1 for sequences of length n + 1, then for any stationary process generating the l 1,..., l n+1, this method satisfies El n+1 (â) min a A El n (a) C n+1 n + 1. (Notice that the expectation averages also over the randomness of the method.)

61 Online to Batch Conversion Proof. El n+1 (â) = El n+1 (A(l T +1,..., l n )) = E 1 l n+1 (A(l t+1,..., l n )) n + 1 = E 1 n + 1 t=0 l n t+1 (A(l 1,..., l n t )) t=0 = E 1 n+1 l t (A(l 1,..., l t 1 )) n + 1 ( E 1 n + 1 min a ) n+1 l t (a) + C n+1 min a El t (a) + C n+1 n + 1.

62 Online to Batch Conversion The theorem is for the expectation over the randomness of the method. For a high probability result, we could 1. Choose â = 1 n n a t, provided A is convex and the l t are all convex. 2. Choose â = arg min a t ( 1 n t s=t+1 ) log(n/δ) l s (a t ) + c. n t In both cases, the analysis involves concentration of martingales.

63 Key Point: Online to Batch Conversion An online strategy with regret bound C n can be converted to a batch method. The regret per trial in the probabilistic setting is bounded by the regret per trial in the adversarial setting.

64 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. 1. Dual game. 2. Rademacher averages and sequential Rademacher averages. 3. Linear games. Online convex optimization.

65 Optimal Regret Joint work with Jake Abernethy, Alekh Agarwal, Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari. We have: a set of actions A, a set of loss functions L. At time t, Player chooses an action a t from A. Adversary chooses l t : A R from L. Player incurs loss l t (a t ). Regret is the value of the game: ( V n (A, L) = inf sup inf sup l t (a t ) inf a 1 a n l n a A l 1 ) l t (a).

66 Optimal Regret: Dual Game Theorem If A is compact and all l t are convex, continuous functions, then ( ) V n (A, L) = sup E inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a), P a t A a A where the supremum is over joint distributions P over sequences l 1,..., l n in L n. As we ll see, this follows from a minimax theorem. Dual game: adversary plays first by choosing P. Value of the game is the difference between minimal conditional expected loss and minimal empirical loss. If P were i.i.d., this would be the difference between the minimal expected loss and the minimal empirical loss.

67 Optimal Regret: Extensions We could replace L n by a set of sequences of loss functions: l 1 L 1, l 2 L 2 (l 1 ), l 3 L 3 (l 1, l 2 ),..., l n L n (l 1, l 2,..., l n 1 ). That is, the constraints on the adversary s choice l t could depend on previous choices l 1,..., l t 1. We can ensure convexity of the l t by allowing mixed strategies: replace A by the set of probability distributions P on A and replace l(a) by E a P l(a).

68 Dual Game: Proof Idea Theorem (Sion, 1957) If A is compact and for every b B, f (, b) is a convex-like, 1 lower semi-continuous function, and for every a A, f (a, ) is concave-like, then inf sup f (a, b) = sup a A b B b B inf f (a, b). a A We ll define B as the set of probability distributions on L and f (a, b) = c + E[l(a) + φ(l)], and we ll assume that A is compact and each l L is convex and continuous. 1 Convex-like [Fan, 1953]: a 1, a 2 A, α [0, 1], a A, αl(a 1 ) + (1 α)l(a 2 ) l(a).

69 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 P n ( ( l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ) l t (a), because allowing mixed strategies does not help the adversary.

70 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf supe a 1 a n l 1 P n = inf sup inf sup a 1 a n 1 l 1 ( ( l n 1 sup P n l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n ) l t (a) l t (a t ) inf a A by Sion s generalization of von Neumann s minimax theorem. ) l t (a),

71 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 l 1 P n = inf sup inf sup a 1 a n 1 ( ( l n 1 sup = inf sup inf sup a 1 l a n 1 1 ( sup P n P n 1 E P n ( n 1 l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n l t (a t ) + ) l t (a) l t (a t ) inf a A infe [l n (a n ) l 1,..., l n 1 ] inf a n a A ) l t (a) )) l t (a), splitting the sum and allowing the adversary a mixed strategy at round n 1.

72 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n 1 l 1 sup P n ( = inf sup sup a 1 l 1 sup P n ( P n 1 E ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A applying Sion s minimax theorem again. )) l t (a) )) l t (a),

73 Dual Game: Proof Idea V n (A, L) = inf sup sup a 1 l 1 sup P n = inf sup a 1. l 1 ( sup E Pn 1 n = sup E P inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A ( n 2 sup inf E l t (a t ) + P a n 2 n 2 ( ( t=n 1 )) l t (a) inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A )) l t (a) ) l t (a).

74 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.

75 Prediction in Probabilistic Settings i.i.d. (X, Y ), (X 1, Y 1 ),..., (X n, Y n ) P from X Y. Use data (X 1, Y 1 ),..., (X n, Y n ) to choose f n : X A with small risk, R(f n ) = Pl(Y, f n (X)), ideally not much larger than the minimum risk over some comparison class F: excess risk = R(f n ) inf R(f ). f F

76 Tools for the analysis of probabilistic problems For f n = arg min n f F l(y t, f (X t )), 1 R(f n ) inf Pl(Y, f (X)) 2 sup l(y t, f (X t )) Pl(Y, f (X)) f F f F n. So supremum of empirical process, indexed by F, gives upper bound on excess risk.

77 Tools for the analysis of probabilistic problems Typically, this supremum is concentrated about 1 P sup (l(y t, f (X t )) Pl(Y, f (X))) f F n = P sup 1 ( f F P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( E sup ɛ f F t l(yt, f (X t )) l(y t, f (X t )) ) n 1 2E sup ɛ f F t l(y t, f (X t )) n, where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.

78 Tools for the analysis of probabilistic problems That is, for f n = arg min n f F l(y t, f (X t )), with high probability, 1 R(f n ) inf Pl(Y, f (X)) ce sup ɛ f F f F t l(y t, f (X t )) n, where ɛ t are independent Rademacher (uniform ±1) random variables. Rademacher averages capture complexity of {(x, y) l(y, f (x)) : f F}: they measure how well functions align with a random (ɛ 1,..., ɛ n ). Rademacher averages are a key tool in analysis of many statistical methods: related to covering numbers (Dudley) and combinatorial dimensions (Vapnik-Chervonenkis, Pollard), for example. A related result applies in the online setting...

79 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to the bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1.

80 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â), where â minimizes t l t(a).

81 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â) sup Esup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)).

82 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ), where l t is a tangent sequence: conditionally independent of l t given l 1,..., l t 1, with the same conditional distribution.

83 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ) ( l t (a) l t (a) ), moving the supremum inside the expectation.

84 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )), for ɛ n { 1, 1}, since l n has the same conditional distribution, given l 1,..., l n 1, as l n.

85 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )) = sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup P E l1,...,l n 1 supe ɛn sup l n,l n a A ɛ n ( l n (a) l n (a) )). ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) +

86 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup E l1,...,l n 1 sup E ɛn sup P l n,l n a A ( ɛ n l n (a) l n (a) )). sup l 1,l 1 E ɛ1 sup E ɛn sup l n,l n a A ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) + ( ( ɛ t l t (a) l t (a) )).

87 Sequential Rademacher Averages: Proof Idea V n (A, L) sup l 1,l 1 = 2 sup l 1 E ɛ1 sup E ɛn sup l n,l n a A E ɛ1 sup E ɛn sup l n a A ( ( ɛ t l t (a) l t (a) )) ( ) ɛ t l t (a), since the two sums are identical (ɛ t and ɛ t have the same distribution).

88 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1 : sequential Rademacher averages.

89 Sequential Rademacher Averages: Example Consider step functions on R: f a : x 1[x a] l a (y, x) = 1[f a (x) y] L = {a 1[f a (x) y] : x R, y {0, 1}}. Fix a distribution on R {±1}, and consider the Rademacher averages, E sup ɛ t l a (Y t, X t ). a R

90 Rademacher Averages: Example For step functions on R, Rademacher averages are: E sup a R ɛ t l a (Y t, X t ) = E sup a R ɛ t l a (1, X t ) sup E sup x t a R = E max 0 i n+1 = O( n). ɛ t 1[x t < a] i ɛ t

91 Sequential Rademacher Averages: Example Consider the sequential Rademacher averages: sup l 1 = sup x 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. If ɛ t = 1, we d like to choose a such that x t < a. If ɛ t = 1, we d like to choose a such that x t a.

92 Sequential Rademacher Averages: Example Sequential Rademacher averages are sup x 1 E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. We can choose x 1 = 0 and, for t = 1,..., n, t 1 x t = 2 i ɛ i = x t (t 1) ɛ t 1. i=1 Then if we set we have ɛ t 1[x t < a] = a = x n + 2 n ɛ n, { 1 if ɛ t = 1, 0 otherwise, which is maximal.

93 Sequential Rademacher Averages: Example So the sequential Rademacher averages are sup l 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) = E Compare with the Rademacher averages: E sup a R 1[ɛ t = 1] = n 2. ɛ t l a (Y t, X t ) = O( n).

94 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.

95 Optimal Regret: Linear Games Loss is l(a) = c, a. Examples: Online linear optimization: L is a set of bounded linear functions on the bounded set A R d. Prediction with expert advice: A = m, L = [0, 1] m.

96 Optimal Regret: Linear Games Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a, {Z t } M C C a A where M C C is the set of martingales with differences in C C. If C is symmetric ( C = C), then V n (A, L) sup E sup Z n, a. {Z t } M C C a A

97 Linear Games: Proof Idea The sequential Rademacher averages can be written sup l 1 = sup c 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup c n a sup E sup Z n, a, {Z t } M C C a ɛ t c t, a where M C C is the set of martingales with differences in C C.

98 Linear Games: Proof Idea For the lower bound, consider the duality result: ( V n (A, L) = sup E P inf E [ c t, a t c 1,..., c t 1 ] inf a t A a A ) c t, a, where the supremum is over joint distributions of sequences c 1,..., c n. If we restrict P to the set M C C of martingales with differences in C = C C, the first term is zero and we have V n (A, L) sup {Z t } M C C E inf a A Z n, a = sup E sup Z n, a. {Z t } M C C a A

99 Linear Games: Convex Duality [Sham Kakade, Karthik Sridharan and Ambuj Tewari, 2009] Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a. {Z t } M C C a A The linear criterion brings to mind a dual norm. Suppose we have a norm defined on C. Then we can view A as a subset of the dual space of linear functions on C, with the dual norm a = sup { c, a : c C, c 1}.

100 Linear Games: Convex Duality We will measure the size of L using sup c C c. We will measure the size of A using sup a A R(a), for a strongly convex R. We ll call a function R : A R σ-strongly convex wrt if for all a, b A and α [0, 1], R(αa + (1 α)b) αr(a) + (1 α)r(b) σ 2 α(1 α) a b 2.

101 Linear Games: Convex Duality The Legendre dual of R is R (c) = sup ( c, a R(a)). a A If inf c C R(c) = 0, then R (0) = 0. If R is σ-strongly convex, then R is differentiable and σ-smooth wrt, that is, for all c, d C, R (c + d) R (c) + R (c), d + 1 2σ d 2.

102 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies V n (A, L) 2 sup E sup Z n, a {Z t } M C C a A 2n 2A σ.

103 Linear Games: Proof Idea The definition of the Legendre dual of R is R (c) = sup ( c, a R(a)). a A From the definition, for any λ > 0, 1 E sup a, Z n E sup a A a A λ (R(a) + R (λz n )) A2 λ + ER (λz n ). λ

104 Linear Games: Proof Idea Consider the evolution of R (λz t ). Because R is σ-smooth, E [R (λz t ) Z 1,..., Z t 1 ] R (λz t 1 ) + E [ R (λz t 1 ), Z t Z t 1 Z 1,..., Z t 1 ] + 1 [ ] 2σ E λ 2 Z t Z t 1 2 Z 1,...,, Z t 1 R (λz t 1 ) + λ2 2σ. Thus, R (λz n ) nλ 2 /(2σ).

105 Linear Games: Proof Idea for λ = 2A 2 σ/n. E sup a, Z n A2 a A λ + ER (λz n ) λ A2 λ + nλ 2σ = 2A n2σ

106 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies 2n V n (A, L) 2A σ.

107 Linear Games: Examples The p- and q- norms (with 1/p + 1/q = 1) on C = B p (1) and A = B q (A): c = c p, a = a q, Here, V n (A, L) A (p 1)n. R(a) = 1 2 a 2 q A 2, σ = 2(q 1).

108 Linear Games: Examples The - and 1-norms on C = [ 1, 1] d and A = d : c = c, a = a 1, R(a) = log d + i a i log a i log d, σ = 1. Here, V n (A, L) 2n log d. (The bounds in these examples are tight within small constant factors.)

109 Optimal Regret: Lower Bounds [Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari, 2010] For the case of prediction with absolute loss: l t (a t ) = y t a t (x t ), there are (almost) corresponding lower bounds: where R n (A) = sup x 1 c 1 R n (A) log 3/2 n V n c 2 R n (A), E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ).

110 Optimal Regret: Structural Results R n (A) = sup x 1 E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ). It is straightforward to verify that the following properties extend to these sequential Rademacher averages: A B implies R n (A) R n (B). R n (co(a)) = R n (A). R n (ca) = c R n (A). R n (φ(a)) φ Lip R n (A). R n (A + b) = R n (A).

111 Optimal Regret: Key Points Dual game: Adversary chooses a joint distribution to maximize the difference between the minimal conditional expected loss and the minimal empirical loss. Upper bound in terms of sequential Rademacher averages. Linear games: bound on a martingale using a strongly convex function.

112 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization. 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization 5. Regret bounds

113 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

114 Online Convex Optimization A = convex subset of R d. L = set of convex real functions on A. For example, l t (a) = (x t a y t ) 2. l t (a) = x t a y t. l t (a) = log (exp(a T (y t ) A(a))), for A(a) the log normalization of this exponential family, with sufficient statistic T (y).

115 Online Convex Optimization: Example Choosing a t to minimize past losses, a t = arg min t 1 a A s=1 l s(a), can fail. ( fictitious play, follow the leader ) Suppose A = [ 1, 1], L = {a v a : v 1}. Consider the following sequence of losses: a 1 = 0, l 1 (a) = 1 2 a, a 2 = 1, l 2 (a) = a, a 3 = 1, l 3 (a) = a, a 4 = 1, l 4 (a) = a, a 5 = 1, l 5 (a) = a,.. a = 0 shows L n 0, but ˆL n = n 1.

116 Online Convex Optimization: Example Choosing a t to minimize past losses can fail. The strategy must avoid overfitting, just as in probabilistic settings. Similar approaches (regularization; Bayesian inference) are applicable in the online setting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement.

117 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = arg min x a. a A Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n.

118 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = {a R d : a 1}, L = {a v a : v 1}. D = 2, G 1. Regret is no more than 2 n. (And O( n) is optimal.)

119 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = m, L = {a v a : v 1}. D = 2, G m. Regret is no more than 2 mn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( m versus log m).

120 Online Convex Optimization: Gradient Method Proof. Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fix a A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 + η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, (l t (a t ) l t (a)) l t (a t ) (a t a) a 1 a 2 a n+1 a 2 2η + η l t (a t ) 2 2

121 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

122 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion a t+1 = arg min a A ( η ) t l s (a) a 2 s=1 corresponds to the gradient step a t+1 = a t η l t (a t ).

123 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

124 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). s=1 R keeps the sequence of a t s stable: it diminishes l t s influence. We can view the choice of a t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close to a t. This is a perspective that motivated many algorithms in the literature. We ll investigate why regularized minimization can be viewed this way.

125 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 + ηl t, so that a t+1 = arg min a A ( η ) t l s (a) + R(a) s=1 = arg min a A Φ t(a).

126 Bregman Divergence Definition For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a, b R d, as D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). D Φ (a, b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b.

127 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a, b) = 1 ( ) 1 2 a 2 2 b 2 + b (a b) = 1 2 a b 2, the squared euclidean norm.

128 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a [0, ) d, the unnormalized negative entropy, Φ(a) = d i=1 a i (ln a i 1), has D Φ (a, b) = i = i (a i (ln a i 1) b i (ln b i 1) ln b i (a i b i )) ( a i ln a ) i + b i a i, b i the unnormalized KL divergence. Thus, for a d, Φ(a) = i a i ln a i has D φ (a, b) = i a i ln a i b i.

129 Bregman Divergence When the domain of Φ is A R d, in addition to differentiability and strict convexity, we make two more assumptions: The interior of A is convex, For a sequence approaching the boundary of A, Φ(a n ). We say that such a Φ is a Legendre function.

130 Bregman Divergence Properties: 1. D Φ 0, D Φ (a, a) = D A+B = D A + D B. 3. Bregman projection, Π Φ A (b) = arg min a A D Φ (a, b) is uniquely defined for closed, convex A. 4. Generalized Pythagorus: for closed, convex A, a = Π Φ A (b), and a A, D Φ (a, b) D Φ (a, a ) + D Φ (a, b). 5. a D Φ (a, b) = Φ(a) Φ(b). 6. For l linear, D Φ+l = D Φ. 7. For Φ the Legendre dual of Φ, Φ = ( Φ) 1, D Φ (a, b) = D Φ ( φ(b), φ(a)).

131 Legendre Dual For a Legendre function Φ : A R, the Legendre dual is Φ (u) = sup (u v Φ(v)). v A Φ is Legendre. dom(φ ) = Φ(int dom Φ). Φ = ( Φ) 1. D Φ (a, b) = D Φ ( φ(b), φ(a)). Φ = Φ.

132 Legendre Dual Example For Φ = p, the Legendre dual is Φ = q, where 1/p + 1/q = 1. Example For Φ(a) = d i=1 ea i, so Φ(a) = (e a 1,..., e a d ), ( Φ) 1 (u) = Φ (u) = (ln u 1,..., ln u d ), and Φ (u) = i u i(ln u i 1).

133 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

134 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a) + D Φt 1 (a, ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a) + R(a). s=1

135 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a) + D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + D Φt 1 (a, ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a) + a D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, So ã t+1 minimizes Φ t.

136 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem For we have a t+1 = arg min Φ t(a), a A ã t+1 = arg min Φ t (a), a R d a t+1 = Π Φ t A (ã t+1 ).

137 Properties of Regularization Methods Proof. Let a t+1 denote ΠΦ t A (ã t+1 ). First, by definition of a t+1, Conversely, But Φ t (ã t+1 ) = 0, so Thus, Φ t (a t+1 ) Φ t(a t+1 ). Φ t (a t+1 ) Φ t (a t+1 ). D Φt (a t+1, ã t+1 ) D Φt (a t+1, ã t+1 ). D Φt (a, ã t+1 ) = Φ t (a) Φ t (ã t+1 ).

138 Properties of Regularization Methods Example For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a) + R(a) a A s=1 ( ) = Π R A arg min (ηl t (a) + D R (a, ã t )), a R d because adding a linear function to Φ does not change D Φ.

139 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

140 Properties of Regularization Methods: Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Theorem Any strategy for online linear optimization, with regret satisfying g t a t min a A g t a C n (g 1,..., g n ) can be used to construct a strategy for online convex optimization, with regret l t (a t ) min a A l t (a) C n ( l 1 (a 1 ),..., l n (a n )). Proof. Convexity implies l t (a t ) l t (a) l t (a t ) (a t a).

141 Properties of Regularization Methods: Linear Loss Key Point: We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, we can work with linear l t.

142 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem The decisions can be written ã t+1 = arg min a R d ( η ) t g s a + R(a) s=1 ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1.

143 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η t g s, s=1 t 1 R(ã t ) = η g s, s=1 so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ).

144 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

145 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

146 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret against any a A of l t (a t ) l t (a) = D R(a, a 1 ) D Φn (a, a n+1 ) + 1 η η and thus ˆL n inf a R d ( ) l t (a) + D R(a, a 1 ) + 1 η η D Φt (a t, a t+1 ), D Φt (a t, a t+1 ). So the sizes of the steps D Φt (a t, a t+1 ) determine the regret bound.

147 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret ( ) ˆL n inf l t (a) + D R(a, a 1 ) + 1 D Φt (a t, a t+1 ). a R d η η Notice that we can write D Φt (a t, a t+1 ) = D Φ t ( Φ t (a t+1 ), Φ t (a t )) = D Φ t (0, Φ t 1 (a t ) + η l t (a t )) = D Φ t (0, η l t (a t )). So it is the size of the gradient steps, D Φ t (0, η l t (a t )), that determines the regret.

148 Regularization Methods: Regret Bounds Example Suppose R = Then we have ˆL n L n + a a 1 2 2η + η 2 g t 2. And if g t G and a a 1 D, choosing η appropriately gives ˆL n L n DG n.

149 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

150 Regularization Methods: Regret Bounds Seeing the future gives small regret: Theorem For regularized minimization, that is, for all a A, a t+1 = arg min a A l t (a t+1 ) ( η ) t l s (a) + R(a), s=1 l t (a) 1 η (R(a) R(a 1)).

151 Regularization Methods: Regret Bounds Proof. Since a t+1 minimizes Φ t, η t l s (a) + R(a) η s=1 t l s (a t+1 ) + R(a t+1 ) s=1 t 1 = ηl t (a t+1 ) + η l s (a t+1 ) + R(a t+1 ) s=1 t 1 ηl t (a t+1 ) + η l s (a t ) + R(a t ). η s=1 t l s (a s+1 ) + R(a 1 ). s=1

152 Regularization Methods: Regret Bounds Theorem For all a A, l t (a t+1 ) l t (a) 1 η (R(a) R(a 1)). Thus, if a t and a t+1 are close, then regret is small: Corollary For all a A, (l t (a t ) l t (a)) (l t (a t ) l t (a t+1 )) + 1 η (R(a) R(a 1)). So how can we control the increments l t (a t ) l t (a t+1 )?

153 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

154 Regularization Methods: Regret Bounds Definition We say R is strongly convex wrt a norm if, for all a, b, R(a) R(b) + R(b) (a b) a b 2. For linear losses and strongly convex regularizers, the dual norm of the gradient is small: Theorem If R is strongly convex wrt a norm, and l t (a) = g t a, then a t a t+1 η g t, where is the dual norm to : v = sup{ v a : a A, a 1}.

155 Regularization Methods: Regret Bounds Proof. R(a t ) R(a t+1 ) + R(a t+1 ) (a t a t+1 ) a t a t+1 2, R(a t+1 ) R(a t ) + R(a t ) (a t+1 a t ) a t a t+1 2. Combining, a t a t+1 2 ( R(a t ) R(a t+1 )) (a t a t+1 ) Hence, a t a t+1 R(a t ) R(a t+1 ) = ηg t.

156 Regularization Methods: Regret Bounds This leads to the regret bound: Corollary For linear losses, if R is strongly convex wrt, then for all a A, (l t (a t ) l t (a)) η g t η (R(a) R(a 1)). Thus, for g t G and R(a) R(a 1 ) D 2, choosing η appropriately gives regret no more than 2GD n.

157 Regularization Methods: Regret Bounds Example Consider R(a) = 1 2 a 2, a 1 = 0, and A contained in a Euclidean ball of diameter D. Then R is strongly convex wrt and =. And the mapping between primal and dual spaces is the identity. So if sup a A l t (a) G, then regret is no more than 2GD n.

158 Regularization Methods: Regret Bounds Example Consider A = m, R(a) = i a i ln a i. Then the mapping between primal and dual spaces is R(a) = ln(a) (component-wise). And the divergence is the KL divergence, D R (a, b) = i a i ln(a i /b i ). And R is strongly convex wrt 1. Suppose that g t 1. Also, R(a) R(a 1 ) ln m, so the regret is no more than 2 n ln m.

159 Regularization Methods: Regret Bounds Example A = m, R(a) = i a i ln a i. What are the updates? a t+1 = Π R A(ã t+1 ) = Π R A( R ( R(ã t ) ηg t )) = Π R A( R (ln(ã t exp( ηg t ))) = Π R A(ã t exp( ηg t )), where the ln and exp functions are applied component-wise. This is exponentiated gradient: mirror descent with R = ln. It is easy to check that the projection corresponds to normalization, Π R A (ã) = ã/ a 1.

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect