Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Size: px

Start display at page:

Download "Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley"

Vernon Preston
5 years ago
Views:

1 Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley

2 Online Learning Repeated game: Aim: minimize ˆL n = Decision method plays a t World reveals l t L n l t (a t ). t=1 For example, aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = n t=1 = ˆL n L n. l t (a t ) min a A Data can be adversarially chosen. n l t (a) t=1

3 max l 1 Online Learning Minimax regret is the value of the game: ( n min min max a 1 a n l n t=1 l t (a t ) min a A ) n l t (a). t=1

4 Online Learning: Motivations 1. Adversarial model is appropriate for Computer security. Computational finance. 2. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 3. Studying the adversarial model sometimes reveals the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases, and in particular between their dependence on the complexity of the class of prediction rules. 4. There are significant overlaps in the design of methods for the two problems: Regularization plays a central role. Many online prediction strategies have a natural interpretation as a Bayesian method.

5 Computer Security: Spam Detection

6 Computer Security: Spam Detection Here, the action a t might be a classification rule, and l t is the indicator for a particular being incorrectly classified (e.g., spam allowed through). The sender can determine if an is delivered (or detected as spam), and try to modify it. An adversarial model allows an arbitrary sequence. We cannot hope for good classification accuracy in an absolute sense; regret is relative to a comparison class. Minimizing regret ensures that the spam detection accuracy is close to the best performance in retrospect on the particular spam sequence.

7 Computer Security: Spam Detection Suppose we consider features of messages from some set X (e.g., information about the header, about words in the message, about attachments). The decision method s action a t is a mapping from X to [0, 1] (think of the value as an estimated probability that the message is spam). At each round, the adversary chooses a feature vector x t X and a label y t {0, 1}, and the loss is defined as l t (a t ) = (y t a t (x t )) 2. The regret is then the excess squared error, over the best achievable on the data sequence: n t=1 l t (a t ) min a A n l t (a) = t=1 n t=1 (y t a t (x t )) 2 min a A n (y t a(x t )) 2. t=1

8 Computer Security: Web Spam Detection Web Spam Challenge (

9 Computer Security: Detecting Denial of Service ACM

10 Computational Finance: Portfolio Optimization

11 Computational Finance: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). Here, the action a t is a distribution on instruments, and l t might be the negative logarithm of the portfolio s increase, a t r t, where r t is the vector of relative price increases. We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions.

12 Computational Finance: Portfolio Optimization The decision method s action a t is a distribution on the m instruments, a t m = {a [0, 1] m : i a i = 1}. At each round, the adversary chooses a vector of returns r t R m +; the ith component is the ratio of the price of instrument i at time t to its price at the previous time, and the loss is defined as l t (a t ) = log (a t r t ). The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: n t=1 l t (a t ) min a A n t=1 l t (a) = max a A n log(a r t ) t=1 n log(a t r t ). t=1

13 Online Learning: Motivations 2. Online algorithms are also effective in probabilistic settings. Easy to convert an online algorithm to a batch algorithm. Easy to show that good online performance implies good i.i.d. performance, for example.

14 Online Learning: Motivations 3. Understanding statistical prediction methods. Many statistical methods, based on probabilistic assumptions, can be effective in an adversarial setting. Analyzing their performance in adversarial settings provides perspective on their robustness. We would like violations of the probabilistic assumptions to have a limited impact.

15 Key Points Online Learning: repeated game. aim to minimize regret. Data can be adversarially chosen. Motivations: Often appropriate (security, finance). Algorithms also effective in probabilistic settings. Can provide insight into statistical prediction methods.

16 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss.

17 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

18 Prediction with Expert Advice Suppose we are predicting whether it will rain tomorrow. We have access to a set of m experts, who each make a forecast of 0 or 1. Can we ensure that we predict almost as well as the best expert? Here, A = {1,..., m}. There are m experts, and each has a forecast sequence f1 i, f 2 i,... from {0, 1}. At round t, the adversary chooses an outcome y t {0, 1}, and sets { l t (i) = 1[ft i 1 if f i y t ] = t y t, 0 otherwise.

19 max l 1 Online Learning Minimax regret is the value of the game: ( n min min max a 1 a n l n t=1 l t (a t ) min a A ) n l t (a). t=1 n ˆL n = l t (a t ), t=1 L n = min a A n l t (a). t=1

20 Prediction with Expert Advice An easier game: suppose that the adversary is constrained to choose the sequence y t so that some expert incurs no loss (L n = 0), that is, there is an i {1,..., m} such that for all t, y t = ft i. How should we predict?

21 Prediction with Expert Advice: Halving Define the set of experts who have been correct so far: C t = {i : l 1 (i) = = l t 1 (i) = 0}. Choose a t any element of { ( )} i : ft i = majority {f j t : j C t }. Theorem This strategy has regret no more than log 2 m.

22 Prediction with Expert Advice: Halving Theorem The halving strategy has regret no more than log 2 m. Proof. If it makes a mistake (that is, l t (a t ) = 1), then the minority of {f j t : j C t } is correct, so at least half of the experts are eliminated: C t+1 C t 2. And otherwise C t+1 C t (because C t never increases). Thus, ˆL n = n l t (a t ) t=1 log 2 C 1 C n+1 = log 2 m log 2 C n+1 log 2 m.

23 Prediction with Expert Advice The proof follows a pattern we shall see again: find some measure of progress (here, C t ) that changes monotonically when excess loss is incurred (here, it halves), is somehow constrained (here, it cannot fall below 1, because there is an expert who predicts perfectly). What if there is no perfect expert? Maintaining C t makes no sense.

24 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

25 Prediction with Expert Advice: Mixed Strategies We have m experts. Allow a mixed strategy, that is, a t chosen from the simplex m the set of distributions on {1,..., m}, { } m m = a [0, 1] m : a i = 1. We can think of the strategy as choosing an element of {1,..., m} randomly, according to a distribution a t. Or we can think of it as playing an element a t of m, and incurring the expected loss, l t (a t ) = i=1 m atl i t (e i ), i=1 where l t (e i ) [0, 1] is the loss incurred by expert i. (e i denotes the vector with a single 1 in the ith coordinate, and the rest zeros.)

26 Prediction with Expert Advice: Exponential Weights Maintain a set of (unnormalized) weights over experts: w i 0 = 1, w i t+1 = w i t exp ( ηl t (e i )). Here, η > 0 is a parameter of the algorithm. Choose a t as the normalized vector, a t = 1 m i=1 w i t w t.

27 Prediction with Expert Advice: Exponential Weights Theorem The exponential weights strategy with parameter has regret satisfying η = 8 ln m n ˆL n L n n ln m 2.

28 Exponential Weights: Proof Idea We use a measure of progress: W t = m wt i. i=1 1. W n grows at least as ( exp 2. W n grows no faster than ( exp η min i η ) n l t (e i ). t=1 ) n l t (a t ). t=1

29 Exponential Weights: Proof 1 ln W n+1 W 1 = ln ( m i=1 w i n+1 ) ln m ( m ( = ln exp η i=1 t ( ( ln max exp i ( = η min i = ηl n ln m. t η t ) l t (e i ) l t (e i ) l t (e i ) ln m )) )) ln m ln m

30 Exponential Weights: Proof 2 ln W t+1 W t ( m i=1 = ln exp( ηl t(e i ))wt i i η l t(e i )wt i i w t i = ηl t (a t ) + η2 8, i w i t + η2 8 where we have used Hoeffding s inequality: for a random variable X [a, b] and λ R, ln (Ee ) λx λex + λ2 (b a) 2. 8 )

31 Aside: Proof of Hoeffding s inequality Define A(λ) = log (Ee ) λx ( ) = log e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statistic x. Since P has bounded support, A(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in [a, b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (where X λ has the same distribution as X) gives A(λ) λex + λ2 8 (b a)2.

32 Exponential Weights: Proof Thus, ηl n ln m ln W n+1 W 1 ˆL n L n ln m η ηˆl n + nη ηn 8. Choosing the optimal η gives the result: Theorem The exponential weights strategy with parameter η = n ln m 8 ln m/n has regret no more than 2.

33 Key Points For a finite set of actions (experts): If one is perfect (zero loss), halving algorithm gives per round regret of ln m n. Exponential weights gives per round regret of ( ) ln m O. n

34 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? 2. Do we need to know n to set η?

35 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? Replace Hoeffding: ln Ee λx λex + λ2 8, with Bernstein : (for X [0, 1]). ln Ee λx (e λ 1)EX.

36 Exponential Weights: Proof 2 Thus ln W ( m t+1 i=1 = ln exp( ηl t(e i ))w i ) t W t i w t i ( e η 1 ) l t (a t ). ˆL n η 1 e η L n + ln m 1 e η. For example, if L n = 0 and η is large, we obtain a regret bound of roughly ln m/n again. And η large is like the halving algorithm (it puts roughly equal weight on all experts that have zero loss so far).

37 Prediction with Expert Advice: Refinements 2. Do we need to know n to set η? We used the optimal setting η = 8 ln m/n. But can this regret bound be achieved uniformly across time? Yes; using a time-varying η t = 8 ln m/t gives the same rate (worse constants). It is also possible to set η as a function of L t, the best cumulative loss so far, to give the improved bound for small losses uniformly across time (worse constants).

38 Prediction with Expert Advice: Refinements 3. We can interpret the exponential weights strategy as computing a Bayesian posterior. Consider ft i [0, 1], y t {0, 1}, and l i t = f t i y t. Then consider a Bayesian prior that is uniform on m distributions. Given the ith distribution, y t is a Bernoulli random variable with parameter e η(1 f i t ) e η(1 f t i ) + e ηf t i Then exponential weights is computing the posterior distribution over the m distributions..

39 Prediction with Expert Advice: Refinements 4. We could work with arbitrary convex losses on m : We defined loss as linear in a: l t (a) = i a i l t (e i ). We could replace this with any bounded convex function on m. The only change in the proof is an equality becomes an inequality: i η l t(e i )wt i i w t i ηl t (a t ).

40 Prediction with Expert Advice: Refinements But note that the exponential weights strategy only competes with the corners of the simplex: Theorem For convex functions l t : m [0, 1], the exponential weights strategy, with η = 8 ln m/n, satisfies n t=1 l t (a t ) min i n l t (e i ) + t=1 n ln m 2.

41 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

42 Probabilistic Prediction Setting Let s consider a probabilistic formulation of a prediction problem. There is a sample of size n drawn i.i.d. from an unknown probability distribution P on X Y: (X 1, Y 1 ),..., (X n, Y n ). Some method chooses ˆf : X Y. It suffers regret El(ˆf (X), Y ) min El(f (X), Y ). f F Here, F is a class of functions from X to Y.

43 Probabilistic Setting: Zero Loss Theorem If some f F has El(f (X), Y ) = 0, then choosing } ˆf Cn = {f F : Êl(f (X), Y ) = 0 leads to regret that is ( ) log F O. n

44 Probabilistic Setting: Zero Loss Proof. Pr(El(ˆf ) ɛ) Pr( f F : Êl(f ) = 0, El(ˆf ) ɛ) F (1 ɛ) n F e nɛ. Integrating the tail bound Pr(El(ˆf )n/ ln F x) 1 e x gives El(ˆf ) c ln F /n.

45 Probabilistic Setting Theorem Choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), leads to regret that is ( ) log F O. n

46 Probabilistic Setting Proof. By the triangle inequality and the definition of ˆf, Elf min Elˆf f F El f 2E sup f F Êl f. E sup f F El f Êl f = E sup f F E sup f F 2E sup f F 2 max X i,y i 2 log F 2. n EÊ l f Êl f 1 n t 1 ɛ t l n f (X t, Y t ) t t ɛ t ( lf (X t, Y t ) l f (X t, Y t ) ) l(f (X i, Y i )) 2 2 log F n

47 Probabilistic Setting: Key Points For a finite function class If one is perfect (zero loss), choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), gives per round regret of ln F n. In any case, this ˆf has per round regret of ( ) ln F O. n just as in the adversarial setting.

48 Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions. 5. Statistical prediction with a finite class. Converting online to batch. Online convex optimization. Log loss.

49 Online to Batch Conversion Suppose we have an online strategy that, given observations l 1,..., l t 1, produces a t = A(l 1,..., l t 1 ). Can we convert this to a method that is suitable for a probabilistic setting? That is, if the l t are chosen i.i.d., can we use A s choices a t to come up with a â A so that is small? El 1 (â) min a A El 1 (a) Consider the following simple randomized method: 1. Pick T uniformly from {0,..., n}. 2. Let â = A(l T +1,..., l n ).

50 Online to Batch Conversion Theorem If A has a regret bound of C n+1 for sequences of length n + 1, then for any stationary process generating the l 1,..., l n+1, this method satisfies El n+1 (â) min a A El n (a) C n+1 n + 1. (Notice that the expectation averages also over the randomness of the method.)

51 Online to Batch Conversion Proof. El n+1 (â) = El n+1 (A(l T +1,..., l n )) = E 1 n l n+1 (A(l t+1,..., l n )) n + 1 = E 1 n + 1 t=0 n l n t+1 (A(l 1,..., l n t )) t=0 = E 1 n+1 l t (A(l 1,..., l t 1 )) n + 1 t=1 ( E 1 n + 1 min a ) n+1 l t (a) + C n+1 t=1 min a El t (a) + C n+1 n + 1.

52 Online to Batch Conversion The theorem is for the expectation over the randomness of the method. For a high probability result, we could 1. Choose â = 1 n n t=1 a t, provided A is convex and the l t are all convex. 2. Choose â = arg min a t ( 1 n t n s=t+1 ) log(n/δ) l s (a t ) + c. n t In both cases, the analysis involves concentration of martingale sequences. The second (more general) approach does not recover the C n /n result: the penalty has the wrong form when C n = o( n).

53 Key Point: Online to Batch Conversion An online strategy with regret bound C n can be converted to a batch method. The regret per trial in the probabilistic setting is bounded by the regret per trial in the adversarial setting.

54 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization 5. Regret bounds Log loss.

55 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

56 Online Convex Optimization A = convex subset of R d. L = set of convex real functions on A. For example, l t (a) = (x t a y t ) 2. l t (a) = x t a y t.

57 Online Convex Optimization: Example Choosing a t to minimize past losses, a t = arg min t 1 a A s=1 l s(a), can fail. ( fictitious play, follow the leader ) Suppose A = [ 1, 1], L = {a v a : v 1}. Consider the following sequence of losses: a 1 = 0, l 1 (a) = 1 2 a, a 2 = 1, l 2 (a) = a, a 3 = 1, l 3 (a) = a, a 4 = 1, l 4 (a) = a, a 5 = 1, l 5 (a) = a,.. a = 0 shows L n 0, but ˆL n = n 1.

58 Online Convex Optimization: Example Choosing a t to minimize past losses can fail. The strategy must avoid overfitting, just as in probabilistic settings. Similar approaches (regularization; Bayesian inference) are applicable in the online setting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement.

59 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = arg min x a. a A Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n.

60 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = {a R d : a 1}, L = {a v a : v 1}. D = 2, G 1. Regret is no more than 2 n. (And O( n) is optimal.)

61 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = m, L = {a v a : v 1}. D = 2, G m. Regret is no more than 2 mn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( m versus log m).

62 Online Convex Optimization: Gradient Method Proof. Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fix a A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 + η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, n (l t (a t ) l t (a)) t=1 n l t (a t ) (a t a) t=1 a 1 a 2 a n+1 a 2 2η + η n l t (a t ) 2 2 t=1

63 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

64 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion a t+1 = arg min a A ( η ) t l s (a) a 2 s=1 corresponds to the gradient step a t+1 = a t η l t (a t ).

65 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

66 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). s=1 R keeps the sequence of a t s stable: it diminishes l t s influence. We can view the choice of a t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close to a t. This is a perspective that motivated many algorithms in the literature. We ll investigate why regularized minimization can be viewed this way.

67 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 + ηl t, so that a t+1 = arg min a A ( η ) t l s (a) + R(a) s=1 = arg min a A Φ t(a).

68 Bregman Divergence Definition For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a, b R d, as D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). D Φ (a, b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b.

69 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a, b) = 1 ( ) 1 2 a 2 2 b 2 + b (a b) = 1 2 a b 2, the squared euclidean norm.

70 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a [0, ) d, the unnormalized negative entropy, Φ(a) = d i=1 a i (ln a i 1), has D Φ (a, b) = i = i (a i (ln a i 1) b i (ln b i 1) ln b i (a i b i )) ( a i ln a ) i + b i a i, b i the unnormalized KL divergence. Thus, for a d, Φ(a) = i a i ln a i has D φ (a, b) = i a i ln a i b i.

71 Bregman Divergence When the range of Φ is A R d, in addition to differentiability and strict convexity, we make two more assumptions: The interior of A is convex, For a sequence approaching the boundary of A, Φ(a n ). We say that such a Φ is a Legendre function.

72 Bregman Divergence Properties: 1. D Φ 0, D Φ (a, a) = D A+B = D A + D B. 3. Bregman projection, Π Φ A (b) = arg min a A D Φ (a, b) is uniquely defined for closed, convex A. 4. Generalized Pythagorus: for closed, convex A, b = Π Φ A (b), and a A, D Φ (a, b) D Φ (a, a ) + D Φ (a, b). 5. a D Φ (a, b) = Φ(a) Φ(b). 6. For l linear, D Φ+l = D Φ. 7. For Φ the Legendre dual of Φ, Φ = ( Φ) 1, D Φ (a, b) = D Φ ( φ(b), φ(a)).

73 Legendre Dual For a Legendre function Φ : A R, the Legendre dual is Φ (u) = sup (u v Φ(v)). v A Φ is Legendre. dom(φ ) = Φ(int dom Φ). Φ = ( Φ) 1. D Φ (a, b) = D Φ ( φ(b), φ(a)). Φ = Φ.

74 Legendre Dual Example For Φ = p, the Legendre dual is Φ = q, where 1/p + 1/q = 1. Example For Φ(a) = d i=1 ea i, so Φ(a) = (e a 1,..., e a d ), ( Φ) 1 (u) = Φ (u) = (ln u 1,..., ln u d ), and Φ (u) = i u i(ln u i 1).

75 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

76 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a) + D Φt 1 (a, ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a) + R(a). s=1

77 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a) + D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + D Φt 1 (a, ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a) + a D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, So ã t+1 minimizes Φ t.

78 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem For we have a t+1 = arg min Φ t(a), a A ã t+1 = arg min Φ t (a), a R d a t+1 = Π Φ t A (ã t+1 ).

79 Properties of Regularization Methods Proof. Let a t+1 denote ΠΦ t A (ã t+1 ). First, by definition of a t+1, Conversely, But Φ t (ã t+1 ) = 0, so Thus, Φ t (a t+1 ) Φ t(a t+1 ). Φ t (a t+1 ) Φ t (a t+1 ). D Φt (a t+1, ã t+1 ) D Φt (a t+1, ã t+1 ). D Φt (a, ã t+1 ) = Φ t (a) Φ t (ã t+1 ).

80 Properties of Regularization Methods Example For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a) + R(a) a A s=1 ( ) = Π R A arg min (ηl t (a) + D R (a, ã t )), a R d because adding a linear function to Φ does not change D Φ.

81 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

82 Properties of Regularization Methods: Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Theorem Any strategy for online linear optimization, with regret satisfying n t=1 g t a t min a A n g t a C n (g 1,..., g n ) t=1 can be used to construct a strategy for online convex optimization, with regret n t=1 l t (a t ) min a A n l t (a) C n ( l 1 (a 1 ),..., l n (a n )). t=1 Proof. Convexity implies l t (a t ) l t (a) l t (a t ) (a t a).

83 Properties of Regularization Methods: Linear Loss Key Point: We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, we can work with linear l t.

84 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem The decisions can be written ã t+1 = arg min a R d ( η ) t g s a + R(a) s=1 ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1.

85 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η t g s, s=1 t 1 R(ã t ) = η g s, s=1 so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ).

86 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

87 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

88 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret against any a A of n n l t (a t ) l t (a) = D R(a, a 1 ) D Φn (a, a n+1 ) + 1 η η t=1 and thus t=1 ˆL n inf a R d ( n t=1 ) l t (a) + D R(a, a 1 ) + 1 η η n D Φt (a t, a t+1 ), t=1 n D Φt (a t, a t+1 ). So the sizes of the steps D Φt (a t, a t+1 ) determine the regret bound. t=1

89 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret ( n ) ˆL n inf l t (a) + D R(a, a 1 ) + 1 n D Φt (a t, a t+1 ). a R d η η t=1 Notice that we can write t=1 D Φt (a t, a t+1 ) = D Φ t ( Φ t (a t+1 ), Φ t (a t )) = D Φ t (0, Φ t 1 (a t ) + η l t (a t )) = D Φ t (0, η l t (a t )). So it is the size of the gradient steps, D Φ t (0, η l t (a t )), that determines the regret.

90 Regularization Methods: Regret Bounds Example Suppose R = Then we have ˆL n L n + a a 1 2 2η + η 2 n g t 2. t=1 And if g t G and a a 1 D, choosing η appropriately gives ˆL n L n DG n.

91 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

92 Regularization Methods: Regret Bounds Seeing the future gives small regret: Theorem For all a A, n l t (a t+1 ) t=1 n l t (a) 1 η (R(a) R(a 1)). t=1

93 Regularization Methods: Regret Bounds Proof. Since a t+1 minimizes Φ t, η t l s (a) + R(a) η s=1 t l s (a t+1 ) + R(a t+1 ) s=1 t 1 = ηl t (a t+1 ) + η l s (a t+1 ) + R(a t+1 ) s=1 t 1 ηl t (a t+1 ) + η l s (a t ) + R(a t ). η s=1 t l s (a s+1 ) + R(a 1 ). s=1

94 Regularization Methods: Regret Bounds Theorem For all a A, n l t (a t+1 ) t=1 n l t (a) 1 η (R(a) R(a 1)). t=1 Thus, if a t and a t+1 are close, then regret is small: Corollary For all a A, n (l t (a t ) l t (a)) t=1 n (l t (a t ) l t (a t+1 )) + 1 η (R(a) R(a 1)). t=1 So how can we control the increments l t (a t ) l t (a t+1 )?

95 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

96 Regularization Methods: Regret Bounds Definition We say R is strongly convex wrt a norm if, for all a, b, R(a) R(b) + R(b) (a b) a b 2. For linear losses and strongly convex regularizers, the dual norm of the gradient is small: Theorem If R is strongly convex wrt a norm, and l t (a) = g t a, then a t a t+1 η g t, where is the dual norm to : v = sup{ v a : a A, a 1}.

97 Regularization Methods: Regret Bounds Proof. R(a t ) R(a t+1 ) + R(a t+1 ) (a t a t+1 ) a t a t+1 2, R(a t+1 ) R(a t ) + R(a t ) (a t+1 a t ) a t a t+1 2. Combining, a t a t+1 2 ( R(a t ) R(a t+1 )) (a t a t+1 ) Hence, a t a t+1 R(a t ) R(a t+1 ) = ηg t.

98 Regularization Methods: Regret Bounds This leads to the regret bound: Corollary For linear losses, if R is strongly convex wrt, then for all a A, n (l t (a t ) l t (a)) η t=1 n g t η (R(a) R(a 1)). t=1 Thus, for g t G and R(a) R(a 1 ) D 2, choosing η appropriately gives regret no more than 2GD n.

99 Regularization Methods: Regret Bounds Example Consider R(a) = 1 2 a 2, a 1 = 0, and A contained in a Euclidean ball of diameter D. Then R is strongly convex wrt and =. And the mapping between primal and dual spaces is the identity. So if sup a A l t (a) G, then regret is no more than 2GD n.

100 Regularization Methods: Regret Bounds Example Consider A = m, R(a) = i a i ln a i. Then the mapping between primal and dual spaces is R(a) = ln(a) (component-wise). And the divergence is the KL divergence, D R (a, b) = i a i ln(a i /b i ). And R is strongly convex wrt 1 (check!). Suppose that g t 1. Also, R(a) R(a 1 ) ln m, so the regret is no more than 2 n ln m.

101 Regularization Methods: Regret Bounds Example A = m, R(a) = i a i ln a i. What are the updates? a t+1 = Π R A(ã t+1 ) = Π R A( R ( R(ã t ) ηg t )) = Π R A( R (ln(ã t exp( ηg t ))) = Π R A(ã t exp( ηg t )), where the ln and exp functions are applied component-wise. This is exponentiated gradient: mirror descent with R = ln. It is easy to check that the projection corresponds to normalization, Π R A (ã) = ã/ a 1.

102 Regularization Methods: Regret Bounds Notice that when the losses are linear, exponentiated gradient is exactly the exponential weights strategy we discussed for a finite comparison class. Compare R(a) = i a i ln a i with R(a) = 1 2 a 2, for g t 1, A = m : O( n ln m) versus O( mn).

103 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Strong convexity Examples (gradient, exponentiated gradient) Extensions

104 Regularization Methods: Extensions Instead of ( a t+1 = arg min ηlt (a) + D (a, ã Φt 1 t ) ), a A we can use ( a t+1 = arg min ηlt (a) + D (a, a Φt 1 t) ). a A And analogous results apply. For instance, this is the approach used by the first gradient method we considered. We can get faster rates with stronger assumptions on the losses...

105 Regularization Methods: Varying η Theorem Define For any a R d, ˆL n n l t (a) t=1 a t+1 = arg min a R d n t=1 If we linearize the l t, we have ˆL n n l t (a) t=1 n t=1 ( n ) η t l t (a) + R(a). t=1 But what if l t are strongly convex? 1 η t ( DΦt (a t, a t+1 ) + D Φt 1 (a, a t ) D Φt (a, a t+1 ) ). 1 η t (D R (a t, a t+1 ) + D R (a, a t ) D R (a, a t+1 )).

106 Regularization Methods: Strongly Convex Losses Theorem If l t is σ-strongly convex wrt R, that is, for all a, b R d, l t (a) l t (b) + l t (b) (a b) + σ 2 D R(a, b), then for any a R d, this strategy with η t = 2 tσ ˆL n n l t (a) t=1 n t=1 1 η t D R (a t, a t+1 ). has regret

107 Strongly Convex Losses: Proof idea n (l t (a t ) l t (a)) t=1 n t=1 n t=1 t=1 ( l t (a t ) (a t a) σ ) 2 D R(a, a t ) 1 ( D R (a t, a t+1 ) + D R (a, a t ) D R (a, a t+1 ) η ) tσ η t 2 D R(a, a t ) n 1 D R (a t, a t+1 ) + η t ( 1 + σ η 1 2 n t=2 ) D R (a, a 1 ). ( 1 1 σ ) D R (a, a t ) η t η t 1 2 And choosing η t appropriately eliminates the second and third terms.

108 Strongly Convex Losses Example For R(a) = 1 2 a 2, we have ˆL n L n 1 2 n t=1 1 η t η t l t 2 n t=1 G 2 ( ) G 2 tσ = O σ log n.

109 Strongly Convex Losses Key Point: When the loss is strongly convex wrt the regularizer, the regret rate can be faster; in the case of quadratic R (and l t ), it is O(log n), versus O( n).

110 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss. Three views of log loss. Normalized maximum likelihood. Sequential investment. Constantly rebalanced portfolios.

111 Log Loss A family of decision problems with several equivalent interpretations: Maximizing long term rate of growth in portfolio optimization. Minimizing redundancy in data compression. Minimizing likelihood ratio in sequential probability assignment. See Nicolò Cesa-Bianchi and Gàbor Lugosi, Prediction, Learning and Games, Chapters 9, 10.

112 Log Loss Consider a finite outcome space Y = {1,..., m}. The comparison class A is a set of sequences f 1, f 2,... of maps f t : Y t Y. We write f t (y t y 1,..., y t 1 ), notation that is suggestive of a conditional probability distribution. The adversary chooses, at round t, a value y t Y, and the loss function for a particular sequence f is l t (f ) = ln(f t (y t y 1,..., y t 1 )).

113 Log Loss: Notation y n = y1 n = (y 1,..., y n ), n f n (y n ) = f t (y t y t 1 ), a n (y n ) = t=1 n a t (y t y t 1 ). t=1 Again, this notation is suggestive of probability distributions. Check: f n (y n ) 0 f n (y n ) = 1. y n Y n

114 Log Loss: Three applications Sequential probability assignment. Gambling/investment. Data compression.

115 Log Loss: Sequential Probability Assignment Think of y t as the indicator for the event that it rains on day t. Minimizing log loss is forecasting Pr(y t y t 1 ) sequentially: L n = inf f F ˆL n = t=1 n 1 ln f t (y t y t 1 ) t=1 n 1 ln a t (y t y t 1 ) L n ˆL n = sup ln f n(y n ) f F a n (y n ), which is the worst ratio of log likelihoods.

116 Log Loss: Gambling Suppose we are investing our initial capital C in proportions a t (1),..., a t (m) across m horses. If horse i wins, it pays odds o t (i) 0. In that case, our capital becomes Ca t (i)o t (i). Let y t {1,..., m} denote the winner of race t. Suppose that a t (y y 1,..., y t 1 ) depends on the previous winners. Then our capital goes from C to C n a t (y t y t 1 )o t (y t ). t=1

117 Log Loss: Gambling Compared to a set F of experts (who also start with capital C), the ratio of the best expert s final capital to ours is sup f F C n t=1 f t(y t y t 1 )o t (y t ) C n t=1 a t(y t y t 1 )o t (y t ) f n (y n ) = sup f F a n (y n ) ( = exp sup ln f n(y n ) f F a n (y n ) ).

118 Log Loss: Data Compression We can identify probability distributions with codes, and view ln p(y n ) as the length (in nats) of an optimal sequentially constructed codeword encoding the sequence y n, under the assumption that y n is generated by p. Then ln p n (y n ( ) inf ln fn (y n ) ) = ˆL L f F is the redundancy (excess length) of the code with respect to a family F of codes.

119 Log Loss: Optimal Prediction The minimax regret for a class F is V n (F) = inf a sup ln sup f F f n(y n ) y n Y n a n (y n. ) For a class F and n > 0, define the normalized maximum likelihood strategy a by a n(y n ) = sup f F f n (y n ) x n Y n sup f F f n(x n ).

120 Log Loss: Optimal Prediction Theorem 1. a is the unique strategy that satisfies sup ln sup f F f n(y n ) y n Y n an(y n = V n (F). ) 2. For all y n Y n, ln sup f F f n(y n ) a n(y n ) = ln sup f n (x n ). x n Y n f F

121 Log Loss: Optimal Prediction Proof. 2. By the definition of a n, ln sup f F f n(y n ) a n(y n ) = ln sup x n Y n f F f n (x n ). 1. For any other a, there must be a y n Y n with a n (y n ) < a n(y n ). Then ln sup f F f n(y n ) a n (y n ) > ln sup f F f n(y n ) an(y n, ) which implies the sup over y n is bigger than its value for a.

122 Log Loss: Optimal Prediction How do we compute the normalized maximum likelihood strategy? an(y n sup ) = f F f n (y n ) x n Y n sup f F f n(x n ). This a n is a probability distribution on Y n. We can calculate it sequentially via a t (y t y t 1 ) = a t (y t ) a t 1 (y t 1 ), where a t (y t ) = y n t+1 Yn t a n(y n ).

123 Log Loss: Optimal Prediction In general, these are big sums. The normalized maximum likelihood strategy does not exist if we cannot sum sup f F f n (x n ) over x n Y \. We need to know the horizon n: it is not possible to extend the strategy for n 1 to the strategy for n. In many cases, there are efficient strategies that approximate the performance of the optimal (normalized maximum likelihood) strategy.

124 Log Loss: Minimax Regret Example Suppose F = m. Then we have V n (F) = ln sup f n (y n ) y n Y n f F ln f n (y n ) y n Y n f F = ln f n (y n ) f F y n Y n = ln N.

125 Log Loss: Minimax Regret Example Consider the class F of all constant experts: For Y = 2, f t (y y t 1 ) = f t (y). V n (F) = 1 2 ln n ln π 2 + o(1).

126 Minimax Regret: Proof Idea V n (F) = ln sup y n Y n f F f n (y n ). Suppose that f (1) = q, f (0) = 1 q. Clearly, f n (y n ) depends only on the number n 1 of 1s in y n, and it s easy to check that the maximizing value of q is n 1 /n, so sup f n (y n ) = max (1 f F q q)n n 1 q n 1 = n 1 =1 n 1 ( n n1 n ) n n1 ( n1 ) n1. n Thus (using Stirling s approximation), n 1 ( ) ( ) n n n n1 n1 ( n1 ) n1 V n (F) = ln n n. ( ) nπ = ln (1 + o(1)). 2

127 Log Loss Three views of log loss. Normalized maximum likelihood. Sequential investment. Constantly rebalanced portfolios.

128 Sequential Investment Suppose that we have n financial instruments (let s call them 1, 2,..., n), and at each period we need to choose how to spread our capital. We invest a proportion p i in instrument i (with p i 0 and i p i = 1). During the period, the value of instrument i increases by a factor of x i 0 and so our wealth increases by a factor of p x = n p i x i. i=1 For instance, x 1 = 1 and x 2 {0, 2} corresponds to a choice between doing nothing and placing a fair bet at even odds.

129 Asymptotic growth rate optimality of logarithmic utility Logarithmic utility has the attractive property that, if the vectors of market returns X 1, X 2,... X t,... are random, then maximizing expected log wealth leads to the optimal asymptotic growth rate. We ll illustrate with a simple example, and then state a general result. Suppose that we are betting on two instruments many times. Their one-period returns (that is, the ratio of the instrument s value after period t to that before period t) satisfy Pr(X t,1 = 1) = 1, Pr(X t,2 = 0) = p, Pr(X t,2 = 2) = 1 p. Clearly, one is risk free, and the other has two possible outcomes: complete loss of the investment, and doubling of the investment.

130 Asymptotic growth rate optimality of logarithmic utility For instance, suppose that we start with wealth at t = 0 of V 0 > 0, and 0 < p < 1. If we bet all of our money on instrument 2 at each step, then after T rounds we end up with expected wealth of EV T = (2(1 p)) T V 0, and this is the maximum value of expected wealth over all strategies. But with probability one, we will eventually have wealth zero if we follow this strategy. What should we do?

131 Asymptotic growth rate optimality of logarithmic utility Suppose that, for period t, we bet a fraction b t of our wealth on instrument 2. Then if we define W t = 1[X t,2 = 2] (that is, we win the bet), then we have V t+1 = (1 + b t ) W t (1 b) 1 W t V t. Consider the asymptotic growth rate of wealth, (This extracts the exponent.) 1 G = lim T T log V T 2. V 0

132 Asymptotic growth rate optimality of logarithmic utility By the weak law of large numbers, we have ( 1 T G = lim T T = lim T ( 1 T t=1 (W t log 2 (1 + b t ) + (1 W t ) log 2 (1 b t )) ) T ((1 p) log 2 (1 + b t ) + p log 2 (1 b t )). t=1 For what values of b t is this maximized? Well, the concavity of log 2, together with Jensen s inequality, implies that, for all x i 0 with i x i = x, ) max s.t. xi log y i yi = y has the solution y i = x i y/x. Thus, we should set b t = 1 2p.

133 Asymptotic growth rate optimality of logarithmic utility That is, if we choose the proportion b t to allocate to each instrument so as to maximize the expected log return, ((1 p) log 2 (1 + b t ) + p log 2 (1 b t )), then we obtain the optimal exponent in the asymptotic growth rate, which is G = (1 p) log 2 (2(1 p)) + p log 2 (2p). Notice that if p is strictly less than 1/2, G > 0. That is, we have exponential growth. Compare this with the two individual alternatives: choosing instrument 1 gives no growth, whereas choosing instrument 2 gives expected wealth that grows exponentially, but it leads to ruin, almost surely.

134 Asymptotic growth rate optimality of logarithmic utility This result was first pointed out by Kelly [5]. Kelly viewed p as the probability that a one-bit message containing the future outcome X t was transmitted through a communication channel incorrectly, and then the optimal exponent G is equal to the channel capacity, ( ) 1 G = 1 (1 p) log 2 1 p + p log 1 2. p

135 Asymptotic growth rate optimality of logarithmic utility Maximizing expected log return is asymptotically optimal much more generally. To define the general result, suppose that, in period t, we need to distribute our wealth over m instruments. We allocate proportion b t,i to the ith, and assume the b t m, the m-simplex. Then, if the period t returns are X t,1,..., X t,m 0, the yield per dollar invested is b t X t, so that our initial capital of V t becomes V t+1 = V t b t X t. By a strategy, we mean a sequence of functions {b t } which, at time t, uses the allocation b t (X 1,..., X t 1 ) m.

136 Asymptotic growth rate optimality of logarithmic utility Definition If X t R m + denotes the random returns of m instruments during period t, we say that strategy b is log-optimal if b t (X 0,..., X t 1 ) = arg max b m E [log(b X t ) X 0,..., X t 1 ].

137 Asymptotic growth rate optimality of logarithmic utility Breiman [3] proved the following result for i.i.d. discrete-valued returns; Algoet and Cover [1] proved the general case. Theorem Suppose that the log-optimal strategy b has capital growth V 0, V1,..., V T over T periods and some strategy b has capital growth V 0, V 1,..., V T. Then almost surely 1 lim sup T T log V T VT 0. In particular, if the returns are i.i.d., then in each period the optimal strategy (at least, optimal to first order in the exponent) allocates its capital according to some fixed mixture b m. This mixture is the one that maximizes the expected logarithm of the one-period yield.

138 Asymptotic growth rate optimality of logarithmic utility This is an appealing property: if we are interested in what happens asymptotically, then we should use log as a utility function, and maximize the expected log return during each period.

139 Constantly rebalanced portfolios A constantly rebalanced portfolio (CRP) is an investment strategy defined by a mixture vector b m. At every time step, it allocates proportion b j of the total capital to instrument j. We have seen that, for i.i.d. returns, the asymptotic growth rate is maximized by a particular CRP. The Dow Jones Industrial Average measures the performance of another CRP (the one that allocates one thirtieth of its capital to each of thirty stocks). Investing in a single stock is another special case of a CRP. (As an illustration of the benefits provided by rebalancing, consider an i.i.d. market with two instruments and return vectors chosen uniformly from {(2, 1/2), (1/2, 2)}. Investing in any single instrument leads to a growth rate of 0, whereas a (1/2, 1/2) CRP will have wealth that increases by a factor of 5/4 in each period.)

140 Constantly rebalanced portfolios Now that we ve motivated CRPs, we ll drop all probabilistic assumptions and move back to an online setting. Suppose that the market is adversarial (a reasonable assumption), and consider the problem of competing with the best CRP in hindsight. That is, at each step t we must choose an allocation of our capital b t so that, after T rounds, the logarithm of our wealth is close to that of the best CRP.

141 Constantly rebalanced portfolios The following theorem is due to Cover [4] (the proof we give is due to Blum and Kalai [2]). It shows that there is a universal portfolio strategy, that is, one that competes with the best CRP. Theorem There is a strategy (call it b U ) for which log(v T ) log(v T (b )) (m 1) log(t + 1) 1, where b is the best CRP. The strategy is conceptually very simple. It involves distributing capital uniformly across all CRPs at each period.

142 Constantly rebalanced portfolios Consider competing with the m single instrument portfolios. We could just place our money uniformly across the m instruments at the start, and leave it there. Then we have m T log(v T ) = log X t,j (V 0 /m) max log j = max log j j=1 t=1 ( T ) X t,j (V 0 /m) t=1 ( T ) X t,j V 0 log(m), that is, our regret with respect to the best single instrument portfolio (in hindsight) is no more than log m. t=1

143 Constantly rebalanced portfolios To compete with the set of CRPs, we adopt a similar strategy: we allocate our capital uniformly over m, and then calculate the mixture b t that corresponds at time t to this initial distribution. Consider an infinitesimal region around a point b m. If µ is the uniform measure on m, the initial investment in CRP b is dµ(b)v 0. By time t 1, this has grown to V t 1 (b)dµ(b)v 0, and so this is the contribution to the overall mixture b t. And of course we need to appropriately normalize (by the total capital at time t 1): b t = m bv t 1 (b)dµ(b) m V t 1 (b)dµ(b).

144 Constantly rebalanced portfolios How does this strategy perform? Suppose that b is the best CRP in hindsight. Then the region around b contains very similar mixtures, and provided that there is enough volume of sufficiently similar CRPs, our strategy should be able to compete with b. Indeed, consider the set of mixtures of b with some other vector a m, For every b S, we have S = {(1 ɛ)b + ɛa : a m }. V 1 (b) V 0 = V 1((1 ɛ)b + ɛa) V 0 (1 ɛ)v 1(b ) V 0. Thus, after T steps, V T (b) V T (b ) (1 ɛ)t.

145 Constantly rebalanced portfolios Also, the proportion of initial wealth allocated to CRPs in S is µ(s) = µ({ɛa : a m }) = ɛ m 1. Combining these two facts, we have that ( ) VT (b log U ) ( V T (b log (1 ɛ) T ɛ m 1). ) Setting ɛ = 1/(T + 1) gives a regret of ( log (1 1/(T + 1)) T (T + 1) (m 1)) > 1 (m 1) log(t +1).

146 Constantly rebalanced portfolios There are other approaches to portfolio optimization based on the online prediction strategies that we have seen earlier. For instance, the exponential weights algorithm can be used in this setting, although it leads to T regret, rather than log T. Also, gradient descent approaches have also been investigated. For a Newton update method, logarithmic regret bounds have been proved.

147 Paul H. Algoet and Thomas M. Cover. Asymptotic optimality and asymptotic equipartition properties of log-optimum investment. The Annals of Probability, 16(2): , Avrim Blum and Adam Kalai. Universal portfolios with and without transaction costs. Machine Learning, 35: , Leo Breiman. Optimal gambling systems for favorable games. In Proc. Fourth Berkeley Symp. Math. Statist. Probab., volume 1, pages Univ. California Press, Thomas M. Cover. Universal portfolios. Mathematical Finance, 1(1):1 29, Jr. J. L. Kelly. A new interpretation of information rate. J. Oper. Res. Soc., 57: , 1956.

148 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss.

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.