Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Size: px
Start display at page:

Download "Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley"

Transcription

1 Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley

2 Online Learning Repeated game: Aim: minimize ˆL n = Decision method plays a t World reveals l t L n l t (a t ). t=1 For example, aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = n t=1 = ˆL n L n. l t (a t ) min a A Data can be adversarially chosen. n l t (a) t=1

3 max l 1 Online Learning Minimax regret is the value of the game: ( n min min max a 1 a n l n t=1 l t (a t ) min a A ) n l t (a). t=1

4 Online Learning: Motivations 1. Adversarial model is appropriate for Computer security. Computational finance. 2. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 3. Studying the adversarial model sometimes reveals the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases, and in particular between their dependence on the complexity of the class of prediction rules. 4. There are significant overlaps in the design of methods for the two problems: Regularization plays a central role. Many online prediction strategies have a natural interpretation as a Bayesian method.

5 Computer Security: Spam Detection

6 Computer Security: Spam Detection Here, the action a t might be a classification rule, and l t is the indicator for a particular being incorrectly classified (e.g., spam allowed through). The sender can determine if an is delivered (or detected as spam), and try to modify it. An adversarial model allows an arbitrary sequence. We cannot hope for good classification accuracy in an absolute sense; regret is relative to a comparison class. Minimizing regret ensures that the spam detection accuracy is close to the best performance in retrospect on the particular spam sequence.

7 Computer Security: Spam Detection Suppose we consider features of messages from some set X (e.g., information about the header, about words in the message, about attachments). The decision method s action a t is a mapping from X to [0, 1] (think of the value as an estimated probability that the message is spam). At each round, the adversary chooses a feature vector x t X and a label y t {0, 1}, and the loss is defined as l t (a t ) = (y t a t (x t )) 2. The regret is then the excess squared error, over the best achievable on the data sequence: n t=1 l t (a t ) min a A n l t (a) = t=1 n t=1 (y t a t (x t )) 2 min a A n (y t a(x t )) 2. t=1

8 Computer Security: Web Spam Detection Web Spam Challenge (

9 Computer Security: Detecting Denial of Service ACM

10 Computational Finance: Portfolio Optimization

11 Computational Finance: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). Here, the action a t is a distribution on instruments, and l t might be the negative logarithm of the portfolio s increase, a t r t, where r t is the vector of relative price increases. We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions.

12 Computational Finance: Portfolio Optimization The decision method s action a t is a distribution on the m instruments, a t m = {a [0, 1] m : i a i = 1}. At each round, the adversary chooses a vector of returns r t R m +; the ith component is the ratio of the price of instrument i at time t to its price at the previous time, and the loss is defined as l t (a t ) = log (a t r t ). The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: n t=1 l t (a t ) min a A n t=1 l t (a) = max a A n log(a r t ) t=1 n log(a t r t ). t=1

13 Online Learning: Motivations 2. Online algorithms are also effective in probabilistic settings. Easy to convert an online algorithm to a batch algorithm. Easy to show that good online performance implies good i.i.d. performance, for example.

14 Online Learning: Motivations 3. Understanding statistical prediction methods. Many statistical methods, based on probabilistic assumptions, can be effective in an adversarial setting. Analyzing their performance in adversarial settings provides perspective on their robustness. We would like violations of the probabilistic assumptions to have a limited impact.

15 Key Points Online Learning: repeated game. aim to minimize regret. Data can be adversarially chosen. Motivations: Often appropriate (security, finance). Algorithms also effective in probabilistic settings. Can provide insight into statistical prediction methods.

16 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss.

17 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

18 Prediction with Expert Advice Suppose we are predicting whether it will rain tomorrow. We have access to a set of m experts, who each make a forecast of 0 or 1. Can we ensure that we predict almost as well as the best expert? Here, A = {1,..., m}. There are m experts, and each has a forecast sequence f1 i, f 2 i,... from {0, 1}. At round t, the adversary chooses an outcome y t {0, 1}, and sets { l t (i) = 1[ft i 1 if f i y t ] = t y t, 0 otherwise.

19 max l 1 Online Learning Minimax regret is the value of the game: ( n min min max a 1 a n l n t=1 l t (a t ) min a A ) n l t (a). t=1 n ˆL n = l t (a t ), t=1 L n = min a A n l t (a). t=1

20 Prediction with Expert Advice An easier game: suppose that the adversary is constrained to choose the sequence y t so that some expert incurs no loss (L n = 0), that is, there is an i {1,..., m} such that for all t, y t = ft i. How should we predict?

21 Prediction with Expert Advice: Halving Define the set of experts who have been correct so far: C t = {i : l 1 (i) = = l t 1 (i) = 0}. Choose a t any element of { ( )} i : ft i = majority {f j t : j C t }. Theorem This strategy has regret no more than log 2 m.

22 Prediction with Expert Advice: Halving Theorem The halving strategy has regret no more than log 2 m. Proof. If it makes a mistake (that is, l t (a t ) = 1), then the minority of {f j t : j C t } is correct, so at least half of the experts are eliminated: C t+1 C t 2. And otherwise C t+1 C t (because C t never increases). Thus, ˆL n = n l t (a t ) t=1 log 2 C 1 C n+1 = log 2 m log 2 C n+1 log 2 m.

23 Prediction with Expert Advice The proof follows a pattern we shall see again: find some measure of progress (here, C t ) that changes monotonically when excess loss is incurred (here, it halves), is somehow constrained (here, it cannot fall below 1, because there is an expert who predicts perfectly). What if there is no perfect expert? Maintaining C t makes no sense.

24 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

25 Prediction with Expert Advice: Mixed Strategies We have m experts. Allow a mixed strategy, that is, a t chosen from the simplex m the set of distributions on {1,..., m}, { } m m = a [0, 1] m : a i = 1. We can think of the strategy as choosing an element of {1,..., m} randomly, according to a distribution a t. Or we can think of it as playing an element a t of m, and incurring the expected loss, l t (a t ) = i=1 m atl i t (e i ), i=1 where l t (e i ) [0, 1] is the loss incurred by expert i. (e i denotes the vector with a single 1 in the ith coordinate, and the rest zeros.)

26 Prediction with Expert Advice: Exponential Weights Maintain a set of (unnormalized) weights over experts: w i 0 = 1, w i t+1 = w i t exp ( ηl t (e i )). Here, η > 0 is a parameter of the algorithm. Choose a t as the normalized vector, a t = 1 m i=1 w i t w t.

27 Prediction with Expert Advice: Exponential Weights Theorem The exponential weights strategy with parameter has regret satisfying η = 8 ln m n ˆL n L n n ln m 2.

28 Exponential Weights: Proof Idea We use a measure of progress: W t = m wt i. i=1 1. W n grows at least as ( exp 2. W n grows no faster than ( exp η min i η ) n l t (e i ). t=1 ) n l t (a t ). t=1

29 Exponential Weights: Proof 1 ln W n+1 W 1 = ln ( m i=1 w i n+1 ) ln m ( m ( = ln exp η i=1 t ( ( ln max exp i ( = η min i = ηl n ln m. t η t ) l t (e i ) l t (e i ) l t (e i ) ln m )) )) ln m ln m

30 Exponential Weights: Proof 2 ln W t+1 W t ( m i=1 = ln exp( ηl t(e i ))wt i i η l t(e i )wt i i w t i = ηl t (a t ) + η2 8, i w i t + η2 8 where we have used Hoeffding s inequality: for a random variable X [a, b] and λ R, ln (Ee ) λx λex + λ2 (b a) 2. 8 )

31 Aside: Proof of Hoeffding s inequality Define A(λ) = log (Ee ) λx ( ) = log e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statistic x. Since P has bounded support, A(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in [a, b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (where X λ has the same distribution as X) gives A(λ) λex + λ2 8 (b a)2.

32 Exponential Weights: Proof Thus, ηl n ln m ln W n+1 W 1 ˆL n L n ln m η ηˆl n + nη ηn 8. Choosing the optimal η gives the result: Theorem The exponential weights strategy with parameter η = n ln m 8 ln m/n has regret no more than 2.

33 Key Points For a finite set of actions (experts): If one is perfect (zero loss), halving algorithm gives per round regret of ln m n. Exponential weights gives per round regret of ( ) ln m O. n

34 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? 2. Do we need to know n to set η?

35 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? Replace Hoeffding: ln Ee λx λex + λ2 8, with Bernstein : (for X [0, 1]). ln Ee λx (e λ 1)EX.

36 Exponential Weights: Proof 2 Thus ln W ( m t+1 i=1 = ln exp( ηl t(e i ))w i ) t W t i w t i ( e η 1 ) l t (a t ). ˆL n η 1 e η L n + ln m 1 e η. For example, if L n = 0 and η is large, we obtain a regret bound of roughly ln m/n again. And η large is like the halving algorithm (it puts roughly equal weight on all experts that have zero loss so far).

37 Prediction with Expert Advice: Refinements 2. Do we need to know n to set η? We used the optimal setting η = 8 ln m/n. But can this regret bound be achieved uniformly across time? Yes; using a time-varying η t = 8 ln m/t gives the same rate (worse constants). It is also possible to set η as a function of L t, the best cumulative loss so far, to give the improved bound for small losses uniformly across time (worse constants).

38 Prediction with Expert Advice: Refinements 3. We can interpret the exponential weights strategy as computing a Bayesian posterior. Consider ft i [0, 1], y t {0, 1}, and l i t = f t i y t. Then consider a Bayesian prior that is uniform on m distributions. Given the ith distribution, y t is a Bernoulli random variable with parameter e η(1 f i t ) e η(1 f t i ) + e ηf t i Then exponential weights is computing the posterior distribution over the m distributions..

39 Prediction with Expert Advice: Refinements 4. We could work with arbitrary convex losses on m : We defined loss as linear in a: l t (a) = i a i l t (e i ). We could replace this with any bounded convex function on m. The only change in the proof is an equality becomes an inequality: i η l t(e i )wt i i w t i ηl t (a t ).

40 Prediction with Expert Advice: Refinements But note that the exponential weights strategy only competes with the corners of the simplex: Theorem For convex functions l t : m [0, 1], the exponential weights strategy, with η = 8 ln m/n, satisfies n t=1 l t (a t ) min i n l t (e i ) + t=1 n ln m 2.

41 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Bayesian interpretation Convex (versus linear) losses 5. Statistical prediction with a finite class.

42 Probabilistic Prediction Setting Let s consider a probabilistic formulation of a prediction problem. There is a sample of size n drawn i.i.d. from an unknown probability distribution P on X Y: (X 1, Y 1 ),..., (X n, Y n ). Some method chooses ˆf : X Y. It suffers regret El(ˆf (X), Y ) min El(f (X), Y ). f F Here, F is a class of functions from X to Y.

43 Probabilistic Setting: Zero Loss Theorem If some f F has El(f (X), Y ) = 0, then choosing } ˆf Cn = {f F : Êl(f (X), Y ) = 0 leads to regret that is ( ) log F O. n

44 Probabilistic Setting: Zero Loss Proof. Pr(El(ˆf ) ɛ) Pr( f F : Êl(f ) = 0, El(ˆf ) ɛ) F (1 ɛ) n F e nɛ. Integrating the tail bound Pr(El(ˆf )n/ ln F x) 1 e x gives El(ˆf ) c ln F /n.

45 Probabilistic Setting Theorem Choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), leads to regret that is ( ) log F O. n

46 Probabilistic Setting Proof. By the triangle inequality and the definition of ˆf, Elf min Elˆf f F El f 2E sup f F Êl f. E sup f F El f Êl f = E sup f F E sup f F 2E sup f F 2 max X i,y i 2 log F 2. n EÊ l f Êl f 1 n t 1 ɛ t l n f (X t, Y t ) t t ɛ t ( lf (X t, Y t ) l f (X t, Y t ) ) l(f (X i, Y i )) 2 2 log F n

47 Probabilistic Setting: Key Points For a finite function class If one is perfect (zero loss), choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), gives per round regret of ln F n. In any case, this ˆf has per round regret of ( ) ln F O. n just as in the adversarial setting.

48 Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions. 5. Statistical prediction with a finite class. Converting online to batch. Online convex optimization. Log loss.

49 Online to Batch Conversion Suppose we have an online strategy that, given observations l 1,..., l t 1, produces a t = A(l 1,..., l t 1 ). Can we convert this to a method that is suitable for a probabilistic setting? That is, if the l t are chosen i.i.d., can we use A s choices a t to come up with a â A so that is small? El 1 (â) min a A El 1 (a) Consider the following simple randomized method: 1. Pick T uniformly from {0,..., n}. 2. Let â = A(l T +1,..., l n ).

50 Online to Batch Conversion Theorem If A has a regret bound of C n+1 for sequences of length n + 1, then for any stationary process generating the l 1,..., l n+1, this method satisfies El n+1 (â) min a A El n (a) C n+1 n + 1. (Notice that the expectation averages also over the randomness of the method.)

51 Online to Batch Conversion Proof. El n+1 (â) = El n+1 (A(l T +1,..., l n )) = E 1 n l n+1 (A(l t+1,..., l n )) n + 1 = E 1 n + 1 t=0 n l n t+1 (A(l 1,..., l n t )) t=0 = E 1 n+1 l t (A(l 1,..., l t 1 )) n + 1 t=1 ( E 1 n + 1 min a ) n+1 l t (a) + C n+1 t=1 min a El t (a) + C n+1 n + 1.

52 Online to Batch Conversion The theorem is for the expectation over the randomness of the method. For a high probability result, we could 1. Choose â = 1 n n t=1 a t, provided A is convex and the l t are all convex. 2. Choose â = arg min a t ( 1 n t n s=t+1 ) log(n/δ) l s (a t ) + c. n t In both cases, the analysis involves concentration of martingale sequences. The second (more general) approach does not recover the C n /n result: the penalty has the wrong form when C n = o( n).

53 Key Point: Online to Batch Conversion An online strategy with regret bound C n can be converted to a batch method. The regret per trial in the probabilistic setting is bounded by the regret per trial in the adversarial setting.

54 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization 5. Regret bounds Log loss.

55 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

56 Online Convex Optimization A = convex subset of R d. L = set of convex real functions on A. For example, l t (a) = (x t a y t ) 2. l t (a) = x t a y t.

57 Online Convex Optimization: Example Choosing a t to minimize past losses, a t = arg min t 1 a A s=1 l s(a), can fail. ( fictitious play, follow the leader ) Suppose A = [ 1, 1], L = {a v a : v 1}. Consider the following sequence of losses: a 1 = 0, l 1 (a) = 1 2 a, a 2 = 1, l 2 (a) = a, a 3 = 1, l 3 (a) = a, a 4 = 1, l 4 (a) = a, a 5 = 1, l 5 (a) = a,.. a = 0 shows L n 0, but ˆL n = n 1.

58 Online Convex Optimization: Example Choosing a t to minimize past losses can fail. The strategy must avoid overfitting, just as in probabilistic settings. Similar approaches (regularization; Bayesian inference) are applicable in the online setting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement.

59 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = arg min x a. a A Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n.

60 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = {a R d : a 1}, L = {a v a : v 1}. D = 2, G 1. Regret is no more than 2 n. (And O( n) is optimal.)

61 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = m, L = {a v a : v 1}. D = 2, G m. Regret is no more than 2 mn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( m versus log m).

62 Online Convex Optimization: Gradient Method Proof. Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fix a A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 + η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, n (l t (a t ) l t (a)) t=1 n l t (a t ) (a t a) t=1 a 1 a 2 a n+1 a 2 2η + η n l t (a t ) 2 2 t=1

63 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

64 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion a t+1 = arg min a A ( η ) t l s (a) a 2 s=1 corresponds to the gradient step a t+1 = a t η l t (a t ).

65 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

66 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). s=1 R keeps the sequence of a t s stable: it diminishes l t s influence. We can view the choice of a t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close to a t. This is a perspective that motivated many algorithms in the literature. We ll investigate why regularized minimization can be viewed this way.

67 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 + ηl t, so that a t+1 = arg min a A ( η ) t l s (a) + R(a) s=1 = arg min a A Φ t(a).

68 Bregman Divergence Definition For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a, b R d, as D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). D Φ (a, b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b.

69 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a, b) = 1 ( ) 1 2 a 2 2 b 2 + b (a b) = 1 2 a b 2, the squared euclidean norm.

70 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a [0, ) d, the unnormalized negative entropy, Φ(a) = d i=1 a i (ln a i 1), has D Φ (a, b) = i = i (a i (ln a i 1) b i (ln b i 1) ln b i (a i b i )) ( a i ln a ) i + b i a i, b i the unnormalized KL divergence. Thus, for a d, Φ(a) = i a i ln a i has D φ (a, b) = i a i ln a i b i.

71 Bregman Divergence When the range of Φ is A R d, in addition to differentiability and strict convexity, we make two more assumptions: The interior of A is convex, For a sequence approaching the boundary of A, Φ(a n ). We say that such a Φ is a Legendre function.

72 Bregman Divergence Properties: 1. D Φ 0, D Φ (a, a) = D A+B = D A + D B. 3. Bregman projection, Π Φ A (b) = arg min a A D Φ (a, b) is uniquely defined for closed, convex A. 4. Generalized Pythagorus: for closed, convex A, b = Π Φ A (b), and a A, D Φ (a, b) D Φ (a, a ) + D Φ (a, b). 5. a D Φ (a, b) = Φ(a) Φ(b). 6. For l linear, D Φ+l = D Φ. 7. For Φ the Legendre dual of Φ, Φ = ( Φ) 1, D Φ (a, b) = D Φ ( φ(b), φ(a)).

73 Legendre Dual For a Legendre function Φ : A R, the Legendre dual is Φ (u) = sup (u v Φ(v)). v A Φ is Legendre. dom(φ ) = Φ(int dom Φ). Φ = ( Φ) 1. D Φ (a, b) = D Φ ( φ(b), φ(a)). Φ = Φ.

74 Legendre Dual Example For Φ = p, the Legendre dual is Φ = q, where 1/p + 1/q = 1. Example For Φ(a) = d i=1 ea i, so Φ(a) = (e a 1,..., e a d ), ( Φ) 1 (u) = Φ (u) = (ln u 1,..., ln u d ), and Φ (u) = i u i(ln u i 1).

75 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

76 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a) + D Φt 1 (a, ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a) + R(a). s=1

77 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a) + D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + D Φt 1 (a, ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a) + a D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, So ã t+1 minimizes Φ t.

78 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem For we have a t+1 = arg min Φ t(a), a A ã t+1 = arg min Φ t (a), a R d a t+1 = Π Φ t A (ã t+1 ).

79 Properties of Regularization Methods Proof. Let a t+1 denote ΠΦ t A (ã t+1 ). First, by definition of a t+1, Conversely, But Φ t (ã t+1 ) = 0, so Thus, Φ t (a t+1 ) Φ t(a t+1 ). Φ t (a t+1 ) Φ t (a t+1 ). D Φt (a t+1, ã t+1 ) D Φt (a t+1, ã t+1 ). D Φt (a, ã t+1 ) = Φ t (a) Φ t (ã t+1 ).

80 Properties of Regularization Methods Example For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a) + R(a) a A s=1 ( ) = Π R A arg min (ηl t (a) + D R (a, ã t )), a R d because adding a linear function to Φ does not change D Φ.

81 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

82 Properties of Regularization Methods: Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Theorem Any strategy for online linear optimization, with regret satisfying n t=1 g t a t min a A n g t a C n (g 1,..., g n ) t=1 can be used to construct a strategy for online convex optimization, with regret n t=1 l t (a t ) min a A n l t (a) C n ( l 1 (a 1 ),..., l n (a n )). t=1 Proof. Convexity implies l t (a t ) l t (a) l t (a t ) (a t a).

83 Properties of Regularization Methods: Linear Loss Key Point: We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, we can work with linear l t.

84 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem The decisions can be written ã t+1 = arg min a R d ( η ) t g s a + R(a) s=1 ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1.

85 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η t g s, s=1 t 1 R(ã t ) = η g s, s=1 so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ).

86 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

87 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

88 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret against any a A of n n l t (a t ) l t (a) = D R(a, a 1 ) D Φn (a, a n+1 ) + 1 η η t=1 and thus t=1 ˆL n inf a R d ( n t=1 ) l t (a) + D R(a, a 1 ) + 1 η η n D Φt (a t, a t+1 ), t=1 n D Φt (a t, a t+1 ). So the sizes of the steps D Φt (a t, a t+1 ) determine the regret bound. t=1

89 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret ( n ) ˆL n inf l t (a) + D R(a, a 1 ) + 1 n D Φt (a t, a t+1 ). a R d η η t=1 Notice that we can write t=1 D Φt (a t, a t+1 ) = D Φ t ( Φ t (a t+1 ), Φ t (a t )) = D Φ t (0, Φ t 1 (a t ) + η l t (a t )) = D Φ t (0, η l t (a t )). So it is the size of the gradient steps, D Φ t (0, η l t (a t )), that determines the regret.

90 Regularization Methods: Regret Bounds Example Suppose R = Then we have ˆL n L n + a a 1 2 2η + η 2 n g t 2. t=1 And if g t G and a a 1 D, choosing η appropriately gives ˆL n L n DG n.

91 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

92 Regularization Methods: Regret Bounds Seeing the future gives small regret: Theorem For all a A, n l t (a t+1 ) t=1 n l t (a) 1 η (R(a) R(a 1)). t=1

93 Regularization Methods: Regret Bounds Proof. Since a t+1 minimizes Φ t, η t l s (a) + R(a) η s=1 t l s (a t+1 ) + R(a t+1 ) s=1 t 1 = ηl t (a t+1 ) + η l s (a t+1 ) + R(a t+1 ) s=1 t 1 ηl t (a t+1 ) + η l s (a t ) + R(a t ). η s=1 t l s (a s+1 ) + R(a 1 ). s=1

94 Regularization Methods: Regret Bounds Theorem For all a A, n l t (a t+1 ) t=1 n l t (a) 1 η (R(a) R(a 1)). t=1 Thus, if a t and a t+1 are close, then regret is small: Corollary For all a A, n (l t (a t ) l t (a)) t=1 n (l t (a t ) l t (a t+1 )) + 1 η (R(a) R(a 1)). t=1 So how can we control the increments l t (a t ) l t (a t+1 )?

95 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

96 Regularization Methods: Regret Bounds Definition We say R is strongly convex wrt a norm if, for all a, b, R(a) R(b) + R(b) (a b) a b 2. For linear losses and strongly convex regularizers, the dual norm of the gradient is small: Theorem If R is strongly convex wrt a norm, and l t (a) = g t a, then a t a t+1 η g t, where is the dual norm to : v = sup{ v a : a A, a 1}.

97 Regularization Methods: Regret Bounds Proof. R(a t ) R(a t+1 ) + R(a t+1 ) (a t a t+1 ) a t a t+1 2, R(a t+1 ) R(a t ) + R(a t ) (a t+1 a t ) a t a t+1 2. Combining, a t a t+1 2 ( R(a t ) R(a t+1 )) (a t a t+1 ) Hence, a t a t+1 R(a t ) R(a t+1 ) = ηg t.

98 Regularization Methods: Regret Bounds This leads to the regret bound: Corollary For linear losses, if R is strongly convex wrt, then for all a A, n (l t (a t ) l t (a)) η t=1 n g t η (R(a) R(a 1)). t=1 Thus, for g t G and R(a) R(a 1 ) D 2, choosing η appropriately gives regret no more than 2GD n.

99 Regularization Methods: Regret Bounds Example Consider R(a) = 1 2 a 2, a 1 = 0, and A contained in a Euclidean ball of diameter D. Then R is strongly convex wrt and =. And the mapping between primal and dual spaces is the identity. So if sup a A l t (a) G, then regret is no more than 2GD n.

100 Regularization Methods: Regret Bounds Example Consider A = m, R(a) = i a i ln a i. Then the mapping between primal and dual spaces is R(a) = ln(a) (component-wise). And the divergence is the KL divergence, D R (a, b) = i a i ln(a i /b i ). And R is strongly convex wrt 1 (check!). Suppose that g t 1. Also, R(a) R(a 1 ) ln m, so the regret is no more than 2 n ln m.

101 Regularization Methods: Regret Bounds Example A = m, R(a) = i a i ln a i. What are the updates? a t+1 = Π R A(ã t+1 ) = Π R A( R ( R(ã t ) ηg t )) = Π R A( R (ln(ã t exp( ηg t ))) = Π R A(ã t exp( ηg t )), where the ln and exp functions are applied component-wise. This is exponentiated gradient: mirror descent with R = ln. It is easy to check that the projection corresponds to normalization, Π R A (ã) = ã/ a 1.

102 Regularization Methods: Regret Bounds Notice that when the losses are linear, exponentiated gradient is exactly the exponential weights strategy we discussed for a finite comparison class. Compare R(a) = i a i ln a i with R(a) = 1 2 a 2, for g t 1, A = m : O( n ln m) versus O( mn).

103 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Strong convexity Examples (gradient, exponentiated gradient) Extensions

104 Regularization Methods: Extensions Instead of ( a t+1 = arg min ηlt (a) + D (a, ã Φt 1 t ) ), a A we can use ( a t+1 = arg min ηlt (a) + D (a, a Φt 1 t) ). a A And analogous results apply. For instance, this is the approach used by the first gradient method we considered. We can get faster rates with stronger assumptions on the losses...

105 Regularization Methods: Varying η Theorem Define For any a R d, ˆL n n l t (a) t=1 a t+1 = arg min a R d n t=1 If we linearize the l t, we have ˆL n n l t (a) t=1 n t=1 ( n ) η t l t (a) + R(a). t=1 But what if l t are strongly convex? 1 η t ( DΦt (a t, a t+1 ) + D Φt 1 (a, a t ) D Φt (a, a t+1 ) ). 1 η t (D R (a t, a t+1 ) + D R (a, a t ) D R (a, a t+1 )).

106 Regularization Methods: Strongly Convex Losses Theorem If l t is σ-strongly convex wrt R, that is, for all a, b R d, l t (a) l t (b) + l t (b) (a b) + σ 2 D R(a, b), then for any a R d, this strategy with η t = 2 tσ ˆL n n l t (a) t=1 n t=1 1 η t D R (a t, a t+1 ). has regret

107 Strongly Convex Losses: Proof idea n (l t (a t ) l t (a)) t=1 n t=1 n t=1 t=1 ( l t (a t ) (a t a) σ ) 2 D R(a, a t ) 1 ( D R (a t, a t+1 ) + D R (a, a t ) D R (a, a t+1 ) η ) tσ η t 2 D R(a, a t ) n 1 D R (a t, a t+1 ) + η t ( 1 + σ η 1 2 n t=2 ) D R (a, a 1 ). ( 1 1 σ ) D R (a, a t ) η t η t 1 2 And choosing η t appropriately eliminates the second and third terms.

108 Strongly Convex Losses Example For R(a) = 1 2 a 2, we have ˆL n L n 1 2 n t=1 1 η t η t l t 2 n t=1 G 2 ( ) G 2 tσ = O σ log n.

109 Strongly Convex Losses Key Point: When the loss is strongly convex wrt the regularizer, the regret rate can be faster; in the case of quadratic R (and l t ), it is O(log n), versus O( n).

110 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss. Three views of log loss. Normalized maximum likelihood. Sequential investment. Constantly rebalanced portfolios.

111 Log Loss A family of decision problems with several equivalent interpretations: Maximizing long term rate of growth in portfolio optimization. Minimizing redundancy in data compression. Minimizing likelihood ratio in sequential probability assignment. See Nicolò Cesa-Bianchi and Gàbor Lugosi, Prediction, Learning and Games, Chapters 9, 10.

112 Log Loss Consider a finite outcome space Y = {1,..., m}. The comparison class A is a set of sequences f 1, f 2,... of maps f t : Y t Y. We write f t (y t y 1,..., y t 1 ), notation that is suggestive of a conditional probability distribution. The adversary chooses, at round t, a value y t Y, and the loss function for a particular sequence f is l t (f ) = ln(f t (y t y 1,..., y t 1 )).

113 Log Loss: Notation y n = y1 n = (y 1,..., y n ), n f n (y n ) = f t (y t y t 1 ), a n (y n ) = t=1 n a t (y t y t 1 ). t=1 Again, this notation is suggestive of probability distributions. Check: f n (y n ) 0 f n (y n ) = 1. y n Y n

114 Log Loss: Three applications Sequential probability assignment. Gambling/investment. Data compression.

115 Log Loss: Sequential Probability Assignment Think of y t as the indicator for the event that it rains on day t. Minimizing log loss is forecasting Pr(y t y t 1 ) sequentially: L n = inf f F ˆL n = t=1 n 1 ln f t (y t y t 1 ) t=1 n 1 ln a t (y t y t 1 ) L n ˆL n = sup ln f n(y n ) f F a n (y n ), which is the worst ratio of log likelihoods.

116 Log Loss: Gambling Suppose we are investing our initial capital C in proportions a t (1),..., a t (m) across m horses. If horse i wins, it pays odds o t (i) 0. In that case, our capital becomes Ca t (i)o t (i). Let y t {1,..., m} denote the winner of race t. Suppose that a t (y y 1,..., y t 1 ) depends on the previous winners. Then our capital goes from C to C n a t (y t y t 1 )o t (y t ). t=1

117 Log Loss: Gambling Compared to a set F of experts (who also start with capital C), the ratio of the best expert s final capital to ours is sup f F C n t=1 f t(y t y t 1 )o t (y t ) C n t=1 a t(y t y t 1 )o t (y t ) f n (y n ) = sup f F a n (y n ) ( = exp sup ln f n(y n ) f F a n (y n ) ).

118 Log Loss: Data Compression We can identify probability distributions with codes, and view ln p(y n ) as the length (in nats) of an optimal sequentially constructed codeword encoding the sequence y n, under the assumption that y n is generated by p. Then ln p n (y n ( ) inf ln fn (y n ) ) = ˆL L f F is the redundancy (excess length) of the code with respect to a family F of codes.

119 Log Loss: Optimal Prediction The minimax regret for a class F is V n (F) = inf a sup ln sup f F f n(y n ) y n Y n a n (y n. ) For a class F and n > 0, define the normalized maximum likelihood strategy a by a n(y n ) = sup f F f n (y n ) x n Y n sup f F f n(x n ).

120 Log Loss: Optimal Prediction Theorem 1. a is the unique strategy that satisfies sup ln sup f F f n(y n ) y n Y n an(y n = V n (F). ) 2. For all y n Y n, ln sup f F f n(y n ) a n(y n ) = ln sup f n (x n ). x n Y n f F

121 Log Loss: Optimal Prediction Proof. 2. By the definition of a n, ln sup f F f n(y n ) a n(y n ) = ln sup x n Y n f F f n (x n ). 1. For any other a, there must be a y n Y n with a n (y n ) < a n(y n ). Then ln sup f F f n(y n ) a n (y n ) > ln sup f F f n(y n ) an(y n, ) which implies the sup over y n is bigger than its value for a.

122 Log Loss: Optimal Prediction How do we compute the normalized maximum likelihood strategy? an(y n sup ) = f F f n (y n ) x n Y n sup f F f n(x n ). This a n is a probability distribution on Y n. We can calculate it sequentially via a t (y t y t 1 ) = a t (y t ) a t 1 (y t 1 ), where a t (y t ) = y n t+1 Yn t a n(y n ).

123 Log Loss: Optimal Prediction In general, these are big sums. The normalized maximum likelihood strategy does not exist if we cannot sum sup f F f n (x n ) over x n Y \. We need to know the horizon n: it is not possible to extend the strategy for n 1 to the strategy for n. In many cases, there are efficient strategies that approximate the performance of the optimal (normalized maximum likelihood) strategy.

124 Log Loss: Minimax Regret Example Suppose F = m. Then we have V n (F) = ln sup f n (y n ) y n Y n f F ln f n (y n ) y n Y n f F = ln f n (y n ) f F y n Y n = ln N.

125 Log Loss: Minimax Regret Example Consider the class F of all constant experts: For Y = 2, f t (y y t 1 ) = f t (y). V n (F) = 1 2 ln n ln π 2 + o(1).

126 Minimax Regret: Proof Idea V n (F) = ln sup y n Y n f F f n (y n ). Suppose that f (1) = q, f (0) = 1 q. Clearly, f n (y n ) depends only on the number n 1 of 1s in y n, and it s easy to check that the maximizing value of q is n 1 /n, so sup f n (y n ) = max (1 f F q q)n n 1 q n 1 = n 1 =1 n 1 ( n n1 n ) n n1 ( n1 ) n1. n Thus (using Stirling s approximation), n 1 ( ) ( ) n n n n1 n1 ( n1 ) n1 V n (F) = ln n n. ( ) nπ = ln (1 + o(1)). 2

127 Log Loss Three views of log loss. Normalized maximum likelihood. Sequential investment. Constantly rebalanced portfolios.

128 Sequential Investment Suppose that we have n financial instruments (let s call them 1, 2,..., n), and at each period we need to choose how to spread our capital. We invest a proportion p i in instrument i (with p i 0 and i p i = 1). During the period, the value of instrument i increases by a factor of x i 0 and so our wealth increases by a factor of p x = n p i x i. i=1 For instance, x 1 = 1 and x 2 {0, 2} corresponds to a choice between doing nothing and placing a fair bet at even odds.

129 Asymptotic growth rate optimality of logarithmic utility Logarithmic utility has the attractive property that, if the vectors of market returns X 1, X 2,... X t,... are random, then maximizing expected log wealth leads to the optimal asymptotic growth rate. We ll illustrate with a simple example, and then state a general result. Suppose that we are betting on two instruments many times. Their one-period returns (that is, the ratio of the instrument s value after period t to that before period t) satisfy Pr(X t,1 = 1) = 1, Pr(X t,2 = 0) = p, Pr(X t,2 = 2) = 1 p. Clearly, one is risk free, and the other has two possible outcomes: complete loss of the investment, and doubling of the investment.

130 Asymptotic growth rate optimality of logarithmic utility For instance, suppose that we start with wealth at t = 0 of V 0 > 0, and 0 < p < 1. If we bet all of our money on instrument 2 at each step, then after T rounds we end up with expected wealth of EV T = (2(1 p)) T V 0, and this is the maximum value of expected wealth over all strategies. But with probability one, we will eventually have wealth zero if we follow this strategy. What should we do?

131 Asymptotic growth rate optimality of logarithmic utility Suppose that, for period t, we bet a fraction b t of our wealth on instrument 2. Then if we define W t = 1[X t,2 = 2] (that is, we win the bet), then we have V t+1 = (1 + b t ) W t (1 b) 1 W t V t. Consider the asymptotic growth rate of wealth, (This extracts the exponent.) 1 G = lim T T log V T 2. V 0

132 Asymptotic growth rate optimality of logarithmic utility By the weak law of large numbers, we have ( 1 T G = lim T T = lim T ( 1 T t=1 (W t log 2 (1 + b t ) + (1 W t ) log 2 (1 b t )) ) T ((1 p) log 2 (1 + b t ) + p log 2 (1 b t )). t=1 For what values of b t is this maximized? Well, the concavity of log 2, together with Jensen s inequality, implies that, for all x i 0 with i x i = x, ) max s.t. xi log y i yi = y has the solution y i = x i y/x. Thus, we should set b t = 1 2p.

133 Asymptotic growth rate optimality of logarithmic utility That is, if we choose the proportion b t to allocate to each instrument so as to maximize the expected log return, ((1 p) log 2 (1 + b t ) + p log 2 (1 b t )), then we obtain the optimal exponent in the asymptotic growth rate, which is G = (1 p) log 2 (2(1 p)) + p log 2 (2p). Notice that if p is strictly less than 1/2, G > 0. That is, we have exponential growth. Compare this with the two individual alternatives: choosing instrument 1 gives no growth, whereas choosing instrument 2 gives expected wealth that grows exponentially, but it leads to ruin, almost surely.

134 Asymptotic growth rate optimality of logarithmic utility This result was first pointed out by Kelly [5]. Kelly viewed p as the probability that a one-bit message containing the future outcome X t was transmitted through a communication channel incorrectly, and then the optimal exponent G is equal to the channel capacity, ( ) 1 G = 1 (1 p) log 2 1 p + p log 1 2. p

135 Asymptotic growth rate optimality of logarithmic utility Maximizing expected log return is asymptotically optimal much more generally. To define the general result, suppose that, in period t, we need to distribute our wealth over m instruments. We allocate proportion b t,i to the ith, and assume the b t m, the m-simplex. Then, if the period t returns are X t,1,..., X t,m 0, the yield per dollar invested is b t X t, so that our initial capital of V t becomes V t+1 = V t b t X t. By a strategy, we mean a sequence of functions {b t } which, at time t, uses the allocation b t (X 1,..., X t 1 ) m.

136 Asymptotic growth rate optimality of logarithmic utility Definition If X t R m + denotes the random returns of m instruments during period t, we say that strategy b is log-optimal if b t (X 0,..., X t 1 ) = arg max b m E [log(b X t ) X 0,..., X t 1 ].

137 Asymptotic growth rate optimality of logarithmic utility Breiman [3] proved the following result for i.i.d. discrete-valued returns; Algoet and Cover [1] proved the general case. Theorem Suppose that the log-optimal strategy b has capital growth V 0, V1,..., V T over T periods and some strategy b has capital growth V 0, V 1,..., V T. Then almost surely 1 lim sup T T log V T VT 0. In particular, if the returns are i.i.d., then in each period the optimal strategy (at least, optimal to first order in the exponent) allocates its capital according to some fixed mixture b m. This mixture is the one that maximizes the expected logarithm of the one-period yield.

138 Asymptotic growth rate optimality of logarithmic utility This is an appealing property: if we are interested in what happens asymptotically, then we should use log as a utility function, and maximize the expected log return during each period.

139 Constantly rebalanced portfolios A constantly rebalanced portfolio (CRP) is an investment strategy defined by a mixture vector b m. At every time step, it allocates proportion b j of the total capital to instrument j. We have seen that, for i.i.d. returns, the asymptotic growth rate is maximized by a particular CRP. The Dow Jones Industrial Average measures the performance of another CRP (the one that allocates one thirtieth of its capital to each of thirty stocks). Investing in a single stock is another special case of a CRP. (As an illustration of the benefits provided by rebalancing, consider an i.i.d. market with two instruments and return vectors chosen uniformly from {(2, 1/2), (1/2, 2)}. Investing in any single instrument leads to a growth rate of 0, whereas a (1/2, 1/2) CRP will have wealth that increases by a factor of 5/4 in each period.)

140 Constantly rebalanced portfolios Now that we ve motivated CRPs, we ll drop all probabilistic assumptions and move back to an online setting. Suppose that the market is adversarial (a reasonable assumption), and consider the problem of competing with the best CRP in hindsight. That is, at each step t we must choose an allocation of our capital b t so that, after T rounds, the logarithm of our wealth is close to that of the best CRP.

141 Constantly rebalanced portfolios The following theorem is due to Cover [4] (the proof we give is due to Blum and Kalai [2]). It shows that there is a universal portfolio strategy, that is, one that competes with the best CRP. Theorem There is a strategy (call it b U ) for which log(v T ) log(v T (b )) (m 1) log(t + 1) 1, where b is the best CRP. The strategy is conceptually very simple. It involves distributing capital uniformly across all CRPs at each period.

142 Constantly rebalanced portfolios Consider competing with the m single instrument portfolios. We could just place our money uniformly across the m instruments at the start, and leave it there. Then we have m T log(v T ) = log X t,j (V 0 /m) max log j = max log j j=1 t=1 ( T ) X t,j (V 0 /m) t=1 ( T ) X t,j V 0 log(m), that is, our regret with respect to the best single instrument portfolio (in hindsight) is no more than log m. t=1

143 Constantly rebalanced portfolios To compete with the set of CRPs, we adopt a similar strategy: we allocate our capital uniformly over m, and then calculate the mixture b t that corresponds at time t to this initial distribution. Consider an infinitesimal region around a point b m. If µ is the uniform measure on m, the initial investment in CRP b is dµ(b)v 0. By time t 1, this has grown to V t 1 (b)dµ(b)v 0, and so this is the contribution to the overall mixture b t. And of course we need to appropriately normalize (by the total capital at time t 1): b t = m bv t 1 (b)dµ(b) m V t 1 (b)dµ(b).

144 Constantly rebalanced portfolios How does this strategy perform? Suppose that b is the best CRP in hindsight. Then the region around b contains very similar mixtures, and provided that there is enough volume of sufficiently similar CRPs, our strategy should be able to compete with b. Indeed, consider the set of mixtures of b with some other vector a m, For every b S, we have S = {(1 ɛ)b + ɛa : a m }. V 1 (b) V 0 = V 1((1 ɛ)b + ɛa) V 0 (1 ɛ)v 1(b ) V 0. Thus, after T steps, V T (b) V T (b ) (1 ɛ)t.

145 Constantly rebalanced portfolios Also, the proportion of initial wealth allocated to CRPs in S is µ(s) = µ({ɛa : a m }) = ɛ m 1. Combining these two facts, we have that ( ) VT (b log U ) ( V T (b log (1 ɛ) T ɛ m 1). ) Setting ɛ = 1/(T + 1) gives a regret of ( log (1 1/(T + 1)) T (T + 1) (m 1)) > 1 (m 1) log(t +1).

146 Constantly rebalanced portfolios There are other approaches to portfolio optimization based on the online prediction strategies that we have seen earlier. For instance, the exponential weights algorithm can be used in this setting, although it leads to T regret, rather than log T. Also, gradient descent approaches have also been investigated. For a Newton update method, logarithmic regret bounds have been proved.

147 Paul H. Algoet and Thomas M. Cover. Asymptotic optimality and asymptotic equipartition properties of log-optimum investment. The Annals of Probability, 16(2): , Avrim Blum and Adam Kalai. Universal portfolios with and without transaction costs. Machine Learning, 35: , Leo Breiman. Optimal gambling systems for favorable games. In Proc. Fourth Berkeley Symp. Math. Statist. Probab., volume 1, pages Univ. California Press, Thomas M. Cover. Universal portfolios. Mathematical Finance, 1(1):1 29, Jr. J. L. Kelly. A new interpretation of information rate. J. Oper. Res. Soc., 57: , 1956.

148 Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization. Log loss.

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Online Prediction Peter Bartlett

Online Prediction Peter Bartlett Online Prediction Peter Bartlett Statistics and EECS UC Berkeley and Mathematical Sciences Queensland University of Technology Online Prediction Repeated game: Cumulative loss: ˆL n = Decision method plays

More information

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

CS281B/Stat241B. Statistical Learning Theory. Lecture 1. CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Sequential Investment, Universal Portfolio Algos and Log-loss

Sequential Investment, Universal Portfolio Algos and Log-loss 1/37 Sequential Investment, Universal Portfolio Algos and Log-loss Chaitanya Ryali, ECE UCSD March 3, 2014 Table of contents 2/37 1 2 3 4 Definitions and Notations 3/37 A market vector x = {x 1,x 2,...,x

More information

Linear Bandits. Peter Bartlett

Linear Bandits. Peter Bartlett Linear Bandits. Peter Bartlett Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution.

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

Efficient Algorithms for Universal Portfolios

Efficient Algorithms for Universal Portfolios Efficient Algorithms for Universal Portfolios Adam Kalai CMU Department of Computer Science akalai@cs.cmu.edu Santosh Vempala y MIT Department of Mathematics and Laboratory for Computer Science vempala@math.mit.edu

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Online learning CMPUT 654. October 20, 2011

Online learning CMPUT 654. October 20, 2011 Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Weighted Majority and the Online Learning Approach

Weighted Majority and the Online Learning Approach Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Meta Algorithms for Portfolio Selection. Technical Report

Meta Algorithms for Portfolio Selection. Technical Report Meta Algorithms for Portfolio Selection Technical Report Department of Computer Science and Engineering University of Minnesota 4-192 EECS Building 200 Union Street SE Minneapolis, MN 55455-0159 USA TR

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Algorithms, Games, and Networks January 17, Lecture 2

Algorithms, Games, and Networks January 17, Lecture 2 Algorithms, Games, and Networks January 17, 2013 Lecturer: Avrim Blum Lecture 2 Scribe: Aleksandr Kazachkov 1 Readings for today s lecture Today s topic is online learning, regret minimization, and minimax

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

A New Interpretation of Information Rate

A New Interpretation of Information Rate A New Interpretation of Information Rate reproduced with permission of AT&T By J. L. Kelly, jr. (Manuscript received March 2, 956) If the input symbols to a communication channel represent the outcomes

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

LEM. New Results on Betting Strategies, Market Selection, and the Role of Luck. Giulio Bottazzi Daniele Giachini

LEM. New Results on Betting Strategies, Market Selection, and the Role of Luck. Giulio Bottazzi Daniele Giachini LEM WORKING PAPER SERIES New Results on Betting Strategies, Market Selection, and the Role of Luck Giulio Bottazzi Daniele Giachini Institute of Economics, Scuola Sant'Anna, Pisa, Italy 2018/08 February

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

An Algorithms-based Intro to Machine Learning

An Algorithms-based Intro to Machine Learning CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!

More information

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms

Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Worst-Case Analysis of the Perceptron and Exponentiated Update Algorithms Tom Bylander Division of Computer Science The University of Texas at San Antonio San Antonio, Texas 7849 bylander@cs.utsa.edu April

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Internal Regret in On-line Portfolio Selection

Internal Regret in On-line Portfolio Selection Internal Regret in On-line Portfolio Selection Gilles Stoltz (gilles.stoltz@ens.fr) Département de Mathématiques et Applications, Ecole Normale Supérieure, 75005 Paris, France Gábor Lugosi (lugosi@upf.es)

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation October 31, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Online Prediction: Bayes versus Experts

Online Prediction: Bayes versus Experts Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Geometry and Optimization of Relative Arbitrage

Geometry and Optimization of Relative Arbitrage Geometry and Optimization of Relative Arbitrage Ting-Kam Leonard Wong joint work with Soumik Pal Department of Mathematics, University of Washington Financial/Actuarial Mathematic Seminar, University of

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

CS261: Problem Set #3

CS261: Problem Set #3 CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

The Free Matrix Lunch

The Free Matrix Lunch The Free Matrix Lunch Wouter M. Koolen Wojciech Kot lowski Manfred K. Warmuth Tuesday 24 th April, 2012 Koolen, Kot lowski, Warmuth (RHUL) The Free Matrix Lunch Tuesday 24 th April, 2012 1 / 26 Introduction

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality

CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality CS264: Beyond Worst-Case Analysis Lecture #20: From Unknown Input Distributions to Instance Optimality Tim Roughgarden December 3, 2014 1 Preamble This lecture closes the loop on the course s circle of

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

Linear Classifiers and the Perceptron

Linear Classifiers and the Perceptron Linear Classifiers and the Perceptron William Cohen February 4, 2008 1 Linear classifiers Let s assume that every instance is an n-dimensional vector of real numbers x R n, and there are only two possible

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Minimax risk bounds for linear threshold functions

Minimax risk bounds for linear threshold functions CS281B/Stat241B (Spring 2008) Statistical Learning Theory Lecture: 3 Minimax risk bounds for linear threshold functions Lecturer: Peter Bartlett Scribe: Hao Zhang 1 Review We assume that there is a probability

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information