Online Prediction Peter Bartlett

Size: px
Start display at page:

Download "Online Prediction Peter Bartlett"

Transcription

1 Online Prediction Peter Bartlett Statistics and EECS UC Berkeley and Mathematical Sciences Queensland University of Technology

2 Online Prediction Repeated game: Cumulative loss: ˆL n = Decision method plays a t A World reveals l t L l t (a t ). Aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = l t (a t ) } {{ } ˆL n min a A Data can be adversarially chosen. l t (a). } {{ } L n

3 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a).

4 Online Prediction: Motivations 1. Adversarial model is often appropriate, e.g., in Computer security. Computational finance. 2. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 3. Studying the adversarial model can reveal the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases. 4. There are significant overlaps in the design of methods for the two problems: Regularization plays a central role. Many online prediction strategies have a natural interpretation as a Bayesian method.

5 Computer Security: Spam Detection

6 Computer Security: Spam Detection Here, the action a t might be a classification rule, and l t is the indicator for a particular being incorrectly classified (e.g., spam allowed through). The sender can determine if an is delivered (or detected as spam), and try to modify it. An adversarial model allows an arbitrary sequence. We cannot hope for good classification accuracy in an absolute sense; regret is relative to a comparison class. Minimizing regret ensures that the spam detection accuracy is close to the best performance in retrospect on the particular sequence.

7 Computer Security: Spam Detection Suppose we consider features of messages from some set X (e.g., information about the header, about words in the message, about attachments). The decision method s action a t is a mapping from X to [0, 1] (think of the value as an estimated probability that the message is spam). At each round, the adversary chooses a feature vector x t X and a label y t {0, 1}, and the loss is defined as l t (a t ) = (y t a t (x t )) 2. The regret is then the excess squared error, over the best achievable on the data sequence: l t (a t ) min a A l t (a) = (y t a t (x t )) 2 min a A (y t a(x t )) 2.

8 Computational Finance: Portfolio Optimization

9 Computational Finance: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). Here, the action a t is a distribution on instruments, and l t might be the negative logarithm of the portfolio s increase, a t r t, where r t is the vector of relative price increases. We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions.

10 Computational Finance: Portfolio Optimization The decision method s action a t is a distribution on the m instruments, a t m = {a [0, 1] m : i a i = 1}. At each round, the adversary chooses a vector of returns r t R m +; the ith component is the ratio of the price of instrument i at time t to its price at the previous time, and the loss is defined as l t (a t ) = log (a t r t ). The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: l t (a t ) min a A l t (a) = max a A log(a r t ) log(a t r t ).

11 Online Prediction: Motivations 2. Online algorithms are also effective in probabilistic settings. Easy to convert an online algorithm to a batch algorithm. Easy to show that good online performance implies good i.i.d. performance, for example.

12 Online Prediction: Motivations 3. Understanding statistical prediction methods. Many statistical methods, based on probabilistic assumptions, can be effective in an adversarial setting. Analyzing their performance in adversarial settings provides perspective on their robustness. We would like violations of the probabilistic assumptions to have a limited impact.

13 Key Points Online Prediction: repeated game. aim to minimize regret. Data can be adversarially chosen. Motivations: Often appropriate (security, finance). Algorithms also effective in probabilistic settings. Can provide insight into statistical prediction methods.

14 Synopsis A finite comparison class: A = {1,..., m}. An easy start. Online, adversarial versus batch, probabilistic. Similar bounds. Optimal regret: dual game. Rademacher averages for probabilistic. Sequential Rademacher averages for adversarial. Online convex optimization. Regularization methods.

15 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.

16 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

17 Prediction with Expert Advice Suppose we are predicting whether it will rain tomorrow. We have access to a set of m experts, who each make a forecast of 0 or 1. Can we ensure that we predict almost as well as the best expert? Here, A = {1,..., m}. There are m experts, and each has a forecast sequence f1 i, f 2 i,... from {0, 1}. At round t, the adversary chooses an outcome y t {0, 1}, and sets { l t (i) = 1[ft i 1 if f i y t ] = t y t, 0 otherwise.

18 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a). ˆL n = l t (a t ), L n = min a A l t (a).

19 Prediction with Expert Advice An easier game: suppose that the adversary is constrained to choose the sequence y t so that some expert incurs no loss (L n = 0), that is, there is an i {1,..., m} such that for all t, y t = ft i. How should we predict?

20 Prediction with Expert Advice: Halving Define the set of experts who have been correct so far: C t = {i : l 1 (i) = = l t 1 (i) = 0}. Choose a t any element of { ( )} i : ft i = majority {f j t : j C t }. Theorem This strategy has regret no more than log 2 m.

21 Prediction with Expert Advice: Halving Theorem The halving strategy has regret no more than log 2 m. Proof. If it makes a mistake (that is, l t (a t ) = 1), then the minority of {f j t : j C t } is correct, so at least half of the experts are eliminated: C t+1 C t 2. And otherwise C t+1 C t (because C t never increases). Thus, ˆL n = l t (a t ) log 2 C 1 C n+1 = log 2 m log 2 C n+1 log 2 m.

22 Prediction with Expert Advice The proof follows a pattern we shall see again: find some measure of progress (here, C t ) that changes monotonically when excess loss is incurred (here, it halves), is somehow constrained (here, it cannot fall below 1, because there is an expert who predicts perfectly). What if there is no perfect expert?

23 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

24 Prediction with Expert Advice: Mixed Strategies We have m experts. Allow a mixed strategy, that is, a t chosen from the simplex m the set of distributions on {1,..., m}, { } m m = a [0, 1] m : a i = 1. We can think of the strategy as choosing an element of {1,..., m} randomly, according to a distribution a t. Or we can think of it as playing an element a t of m, and incurring the expected loss, l t (a t ) = i=1 m atl i t (e i ), i=1 where l t (e i ) [0, 1] is the loss incurred by expert i. (e i denotes the vector with a single 1 in the ith coordinate, and the rest zeros.)

25 Prediction with Expert Advice: Exponential Weights Maintain a set of (unnormalized) weights over experts: w i 1 = 1, w i t+1 = w i t exp ( ηl t (e i )). Here, η > 0 is a parameter of the algorithm. Choose a t as the normalized vector, a t = 1 m i=1 w i t w t.

26 Prediction with Expert Advice: Exponential Weights Theorem The exponential weights strategy with parameter has regret satisfying η = 8 ln m n ˆL n L n n ln m 2.

27 Exponential Weights: Proof Idea We use a measure of progress: W t = m wt i. i=1 1. W n grows at least as ( exp 2. W n grows no faster than ( exp η min i η ) l t (e i ). ) l t (a t ).

28 Exponential Weights: Proof 1 ln W n+1 W 1 = ln ( m i=1 w i n+1 ) ln m ( m ( = ln exp η i=1 t ( ( ln max exp i ( = η min i = ηl n ln m. t η t ) l t (e i ) l t (e i ) l t (e i ) ln m )) )) ln m ln m

29 Exponential Weights: Proof 2 ln W t+1 W t ( m i=1 = ln exp( ηl t(e i ))wt i i η l t(e i )wt i i w t i = ηl t (a t ) + η2 8, i w i t + η2 8 where we have used Hoeffding s inequality: for a random variable X [a, b] and λ R, ln (Ee ) λx λex + λ2 (b a) 2. 8 )

30 Aside: Proof of Hoeffding s inequality Define ( ) ( A(λ) = log Ee λx = log ) e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statistic x. Since P has bounded support, A(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in [a, b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (where X λ has the same distribution as X) gives A(λ) λex + λ2 8 (b a)2.

31 Exponential Weights: Proof Thus, ηl n ln m ln W n+1 W 1 ˆL n L n ln m η ηˆl n + nη ηn 8. Choosing the optimal η gives the result: Theorem The exponential weights strategy with parameter η = n ln m 8 ln m/n has regret no more than 2.

32 Key Points For a finite set of actions (experts): If one action is perfect (i.e., has zero loss), the halving algorithm gives per round regret of ln m n. Exponential weights gives per round regret of ( ) ln m O. n

33 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? 2. Do we need to know n to set η?

34 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? Replace Hoeffding: with: ln Ee λx λex + λ2 8, ln Ee λx (e λ 1)EX. (for X [0, 1]: linear upper bound on e λx ).

35 Exponential Weights: Proof 2 Thus ln W ( m t+1 i=1 = ln exp( ηl t(e i ))w i ) t W t i w t i ( e η 1 ) l t (a t ). ˆL n η 1 e η L n + ln m 1 e η. For example, if L n = 0 and η is large, we obtain a regret bound of roughly ln m again. And η large is like the halving algorithm (it puts equal weight on all experts that have zero loss so far).

36 Prediction with Expert Advice: Refinements 2. Do we need to know n to set η? We used the optimal setting η = 8 ln m/n. But can this regret bound be achieved uniformly across time? Yes; using a time-varying η t = 8 ln m/t gives the same rate (worse constants). It is also possible to set η as a function of L t, the best cumulative loss so far, to give the improved bound for small losses uniformly across time (worse constants).

37 Prediction with Expert Advice: Refinements 3. We could work with arbitrary convex losses on m : We defined loss as linear in a: l t (a) = i a i l t (e i ). We could replace this with any bounded convex function on m. The only change in the proof is an equality becomes an inequality: i η l t(e i )wt i i w t i ηl t (a t ).

38 Prediction with Expert Advice: Refinements But note that the exponential weights strategy only competes with the corners of the simplex: Theorem For convex functions l t : m [0, 1], the exponential weights strategy, with η = 8 ln m/n, satisfies l t (a t ) min i l t (e i ) + n ln m 2.

39 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with outcome vector y and model p(y j) = h(y) exp( ηy j ). 2. An easy regret bound for Bayesian prediction shows that its regret, wrt this scaled log loss: l Bayes (p, y) = 1 η log E J p exp( ηy J ), is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.

40 Bayesian Update parameter space: Θ outcome space: Y Assume joint: p(θ, y) = π(θ) }{{} prior p(y θ) }{{} likelihood predictive distribution: ˆp t+1 (y) = p(y y 1,..., y t ) = p(y θ) p(θ y 1,..., y t ) dθ }{{} p t+1 (θ) update p t on Θ: p 1 (θ) = π(θ) cumulative log loss: p t+1 (θ) = p t (θ)p(y t θ) pt (θ )p(y t θ )dθ log ˆp t (y t ).

41 Suppose that the likelihood is Bayesian Interpretation p(y j) = h(y) exp( ηy j ), for j = 1,..., m and y R m. Then the Bayes update is: p t+1 (j) = 1 Z p t(j) exp( ηy j t ), (where Z is normalization). If y j t is the loss of expert j, this is the exponential weights algorithm.

42 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ).

43 Performance of Bayesian Prediction: Proof First, hence ˆp 1 (y 1 ) ˆp n (y n ) }{{} exp( ˆL n) = p(y 1 )p(y 2 y 1 ) p(y n y 1,..., y n 1 ) = p(y 1,..., y n ) = p(y 1 θ) p(y n θ) dπ(θ), Θ }{{} exp( L n(θ)) exp( ˆL n ) exp( L n (θ))dπ(θ) S exp( L n (θ 0 )) dπ(θ), where S = {θ Θ : L n (θ) L n (θ 0 )}. Thus ˆL n L n (θ 0 ) ln(π(s)). S

44 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ). So if π(i) = 1/m, the exponential weights strategy s log loss is within log(m) of optimal. But what is the log loss here?

45 Bayesian Interpretation For a posterior p t on {1,..., m}, the predicted probability is ˆp t (y) = E J pt p(y J) = c(y)e J pt exp( ηy J ), So setting the loss as the negative log of the predicted probability is equivalent to defining the loss of a posterior p t with outcome y t as ηl Bayes with l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) (ignoring additive constants). Compare to the linear loss, l(p t, y t ) = E J pt y J t. These are equal at the corners of the simplex: l Bayes (δ j, y) = l(δ j, y) = y j.

46 Bayesian Interpretation The theorem shows that, with this loss and any prior, ( ) l(e j, y t ). ˆL Bayes,n min j log π(j) η But this is not the linear loss: l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) versus l(p t, y t ) = E J pt y J t. They coincide at the corners, p t = e j, and l Bayes is convex. What is the gap in Jensen s inequality?

47 Bayesian Interpretation l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) l(p t, y t ) = E J pt y J t. Hoeffding s inequality for X [a, b]: implies as before. A( η) = log (E exp( ηx)) ηex η2 8 (b a)2, ˆL n ˆL Bayes,n + ηn ( 8 min l(e j, y t ) π(j) ) + ηn j η 8 = min l(e j, y t ) + log m + ηn j η 8,

48 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with model p(y j) = h(y) exp( ηy j ). 2. Easy regret bound for Bayesian prediction shows that its regret wrt the scaled log loss l Bayes (p, y) = 1 η log E J p exp( ηy J ) is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.

49 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.

50 Probabilistic Prediction Setting Let s consider a probabilistic formulation of a prediction problem. There is a sample of size n drawn i.i.d. from an unknown probability distribution P on X Y: (X 1, Y 1 ),..., (X n, Y n ). Some method chooses ˆf : X Y. It suffers regret El(ˆf (X), Y ) min El(f (X), Y ). f F Here, F is a class of functions from X to Y.

51 Probabilistic Setting: Zero Loss Theorem If some f F has El(f (X), Y ) = 0, then choosing } ˆf Cn = {f F : Êl(f (X), Y ) = 0 leads to regret that is ( ) log F O. n

52 Probabilistic Setting: Zero Loss Proof. Pr(El(ˆf ) ɛ) Pr( f F : Êl(f ) = 0, El(f ) ɛ) F (1 ɛ) n F e nɛ. Integrating the tail bound Pr(El(ˆf )n ln F + x) e x gives El(ˆf ) c ln F /n.

53 Probabilistic Setting Theorem Choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), leads to regret that is ( ) log F O. n

54 Probabilistic Setting Proof. By the triangle inequality and the definition of ˆf,. min Elˆf f F El f 2E sup f F Êl f El f E sup f F = E sup f F E sup f F 1 (l(y t, f (X t )) Pl(Y, f (X))) n 1 ( P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( ɛ t l(yt, f (X t )) l(y t, f (X t )) ) n 1 ɛ t l(y t, f (X t )) n, 2E sup f F where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.

55 Aside: Rademacher Averages of a Finite Class Theorem: For V R n, E max v V i=1 ɛ i v i 2 ln V max v V v. Proof idea: Hoeffding s inequality. ( ) ( ) exp λe max ɛ i v i E exp λ max ɛ i v i v v i i ( E exp λ ) ɛ i v i v i = E exp(λɛ i v i ) v i exp(λ 2 vi 2 /2) v i ( ) V exp λ 2 max v 2 /2. v

56 Probabilistic Setting min Elˆf El 1 f 4E sup ɛ t l f F f F n f (X t, Y t ) t 2 log F 4 max l f (X i, Y i ) 2 X i,y i,f n t 2 log F 4. n

57 Probabilistic Setting: Key Points For a finite function class If one function has zero loss, choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), gives per round regret of ln F n. In any case, ˆf has per round regret of ( ) ln F O. n The same as the adversarial setting.

58 Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions. 5. Probabilistic prediction with a finite class. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.

59 Online to Batch Conversion Suppose we have an online strategy that, given observations l 1,..., l t 1, produces a t = A(l 1,..., l t 1 ). Can we convert this to a method that is suitable for a probabilistic setting? That is, if the l t are chosen i.i.d., can we use A s choices a t to come up with an â A so that is small? El 1 (â) min a A El 1 (a) Consider the following simple randomized method: 1. Pick T uniformly from {0,..., n}. 2. Let â = A(l T +1,..., l n ).

60 Online to Batch Conversion Theorem If A has a regret bound of C n+1 for sequences of length n + 1, then for any stationary process generating the l 1,..., l n+1, this method satisfies El n+1 (â) min a A El n (a) C n+1 n + 1. (Notice that the expectation averages also over the randomness of the method.)

61 Online to Batch Conversion Proof. El n+1 (â) = El n+1 (A(l T +1,..., l n )) = E 1 l n+1 (A(l t+1,..., l n )) n + 1 = E 1 n + 1 t=0 l n t+1 (A(l 1,..., l n t )) t=0 = E 1 n+1 l t (A(l 1,..., l t 1 )) n + 1 ( E 1 n + 1 min a ) n+1 l t (a) + C n+1 min a El t (a) + C n+1 n + 1.

62 Online to Batch Conversion The theorem is for the expectation over the randomness of the method. For a high probability result, we could 1. Choose â = 1 n n a t, provided A is convex and the l t are all convex. 2. Choose â = arg min a t ( 1 n t s=t+1 ) log(n/δ) l s (a t ) + c. n t In both cases, the analysis involves concentration of martingales.

63 Key Point: Online to Batch Conversion An online strategy with regret bound C n can be converted to a batch method. The regret per trial in the probabilistic setting is bounded by the regret per trial in the adversarial setting.

64 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. 1. Dual game. 2. Rademacher averages and sequential Rademacher averages. 3. Linear games. Online convex optimization.

65 Optimal Regret Joint work with Jake Abernethy, Alekh Agarwal, Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari. We have: a set of actions A, a set of loss functions L. At time t, Player chooses an action a t from A. Adversary chooses l t : A R from L. Player incurs loss l t (a t ). Regret is the value of the game: ( V n (A, L) = inf sup inf sup l t (a t ) inf a 1 a n l n a A l 1 ) l t (a).

66 Optimal Regret: Dual Game Theorem If A is compact and all l t are convex, continuous functions, then ( ) V n (A, L) = sup E inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a), P a t A a A where the supremum is over joint distributions P over sequences l 1,..., l n in L n. As we ll see, this follows from a minimax theorem. Dual game: adversary plays first by choosing P. Value of the game is the difference between minimal conditional expected loss and minimal empirical loss. If P were i.i.d., this would be the difference between the minimal expected loss and the minimal empirical loss.

67 Optimal Regret: Extensions We could replace L n by a set of sequences of loss functions: l 1 L 1, l 2 L 2 (l 1 ), l 3 L 3 (l 1, l 2 ),..., l n L n (l 1, l 2,..., l n 1 ). That is, the constraints on the adversary s choice l t could depend on previous choices l 1,..., l t 1. We can ensure convexity of the l t by allowing mixed strategies: replace A by the set of probability distributions P on A and replace l(a) by E a P l(a).

68 Dual Game: Proof Idea Theorem (Sion, 1957) If A is compact and for every b B, f (, b) is a convex-like, 1 lower semi-continuous function, and for every a A, f (a, ) is concave-like, then inf sup f (a, b) = sup a A b B b B inf f (a, b). a A We ll define B as the set of probability distributions on L and f (a, b) = c + E[l(a) + φ(l)], and we ll assume that A is compact and each l L is convex and continuous. 1 Convex-like [Fan, 1953]: a 1, a 2 A, α [0, 1], a A, αl(a 1 ) + (1 α)l(a 2 ) l(a).

69 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 P n ( ( l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ) l t (a), because allowing mixed strategies does not help the adversary.

70 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf supe a 1 a n l 1 P n = inf sup inf sup a 1 a n 1 l 1 ( ( l n 1 sup P n l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n ) l t (a) l t (a t ) inf a A by Sion s generalization of von Neumann s minimax theorem. ) l t (a),

71 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 l 1 P n = inf sup inf sup a 1 a n 1 ( ( l n 1 sup = inf sup inf sup a 1 l a n 1 1 ( sup P n P n 1 E P n ( n 1 l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n l t (a t ) + ) l t (a) l t (a t ) inf a A infe [l n (a n ) l 1,..., l n 1 ] inf a n a A ) l t (a) )) l t (a), splitting the sum and allowing the adversary a mixed strategy at round n 1.

72 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n 1 l 1 sup P n ( = inf sup sup a 1 l 1 sup P n ( P n 1 E ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A applying Sion s minimax theorem again. )) l t (a) )) l t (a),

73 Dual Game: Proof Idea V n (A, L) = inf sup sup a 1 l 1 sup P n = inf sup a 1. l 1 ( sup E Pn 1 n = sup E P inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A ( n 2 sup inf E l t (a t ) + P a n 2 n 2 ( ( t=n 1 )) l t (a) inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A )) l t (a) ) l t (a).

74 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.

75 Prediction in Probabilistic Settings i.i.d. (X, Y ), (X 1, Y 1 ),..., (X n, Y n ) P from X Y. Use data (X 1, Y 1 ),..., (X n, Y n ) to choose f n : X A with small risk, R(f n ) = Pl(Y, f n (X)), ideally not much larger than the minimum risk over some comparison class F: excess risk = R(f n ) inf R(f ). f F

76 Tools for the analysis of probabilistic problems For f n = arg min n f F l(y t, f (X t )), 1 R(f n ) inf Pl(Y, f (X)) 2 sup l(y t, f (X t )) Pl(Y, f (X)) f F f F n. So supremum of empirical process, indexed by F, gives upper bound on excess risk.

77 Tools for the analysis of probabilistic problems Typically, this supremum is concentrated about 1 P sup (l(y t, f (X t )) Pl(Y, f (X))) f F n = P sup 1 ( f F P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( E sup ɛ f F t l(yt, f (X t )) l(y t, f (X t )) ) n 1 2E sup ɛ f F t l(y t, f (X t )) n, where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.

78 Tools for the analysis of probabilistic problems That is, for f n = arg min n f F l(y t, f (X t )), with high probability, 1 R(f n ) inf Pl(Y, f (X)) ce sup ɛ f F f F t l(y t, f (X t )) n, where ɛ t are independent Rademacher (uniform ±1) random variables. Rademacher averages capture complexity of {(x, y) l(y, f (x)) : f F}: they measure how well functions align with a random (ɛ 1,..., ɛ n ). Rademacher averages are a key tool in analysis of many statistical methods: related to covering numbers (Dudley) and combinatorial dimensions (Vapnik-Chervonenkis, Pollard), for example. A related result applies in the online setting...

79 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to the bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1.

80 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â), where â minimizes t l t(a).

81 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â) sup Esup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)).

82 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ), where l t is a tangent sequence: conditionally independent of l t given l 1,..., l t 1, with the same conditional distribution.

83 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ) ( l t (a) l t (a) ), moving the supremum inside the expectation.

84 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )), for ɛ n { 1, 1}, since l n has the same conditional distribution, given l 1,..., l n 1, as l n.

85 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )) = sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup P E l1,...,l n 1 supe ɛn sup l n,l n a A ɛ n ( l n (a) l n (a) )). ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) +

86 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup E l1,...,l n 1 sup E ɛn sup P l n,l n a A ( ɛ n l n (a) l n (a) )). sup l 1,l 1 E ɛ1 sup E ɛn sup l n,l n a A ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) + ( ( ɛ t l t (a) l t (a) )).

87 Sequential Rademacher Averages: Proof Idea V n (A, L) sup l 1,l 1 = 2 sup l 1 E ɛ1 sup E ɛn sup l n,l n a A E ɛ1 sup E ɛn sup l n a A ( ( ɛ t l t (a) l t (a) )) ( ) ɛ t l t (a), since the two sums are identical (ɛ t and ɛ t have the same distribution).

88 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1 : sequential Rademacher averages.

89 Sequential Rademacher Averages: Example Consider step functions on R: f a : x 1[x a] l a (y, x) = 1[f a (x) y] L = {a 1[f a (x) y] : x R, y {0, 1}}. Fix a distribution on R {±1}, and consider the Rademacher averages, E sup ɛ t l a (Y t, X t ). a R

90 Rademacher Averages: Example For step functions on R, Rademacher averages are: E sup a R ɛ t l a (Y t, X t ) = E sup a R ɛ t l a (1, X t ) sup E sup x t a R = E max 0 i n+1 = O( n). ɛ t 1[x t < a] i ɛ t

91 Sequential Rademacher Averages: Example Consider the sequential Rademacher averages: sup l 1 = sup x 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. If ɛ t = 1, we d like to choose a such that x t < a. If ɛ t = 1, we d like to choose a such that x t a.

92 Sequential Rademacher Averages: Example Sequential Rademacher averages are sup x 1 E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. We can choose x 1 = 0 and, for t = 1,..., n, t 1 x t = 2 i ɛ i = x t (t 1) ɛ t 1. i=1 Then if we set we have ɛ t 1[x t < a] = a = x n + 2 n ɛ n, { 1 if ɛ t = 1, 0 otherwise, which is maximal.

93 Sequential Rademacher Averages: Example So the sequential Rademacher averages are sup l 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) = E Compare with the Rademacher averages: E sup a R 1[ɛ t = 1] = n 2. ɛ t l a (Y t, X t ) = O( n).

94 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.

95 Optimal Regret: Linear Games Loss is l(a) = c, a. Examples: Online linear optimization: L is a set of bounded linear functions on the bounded set A R d. Prediction with expert advice: A = m, L = [0, 1] m.

96 Optimal Regret: Linear Games Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a, {Z t } M C C a A where M C C is the set of martingales with differences in C C. If C is symmetric ( C = C), then V n (A, L) sup E sup Z n, a. {Z t } M C C a A

97 Linear Games: Proof Idea The sequential Rademacher averages can be written sup l 1 = sup c 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup c n a sup E sup Z n, a, {Z t } M C C a ɛ t c t, a where M C C is the set of martingales with differences in C C.

98 Linear Games: Proof Idea For the lower bound, consider the duality result: ( V n (A, L) = sup E P inf E [ c t, a t c 1,..., c t 1 ] inf a t A a A ) c t, a, where the supremum is over joint distributions of sequences c 1,..., c n. If we restrict P to the set M C C of martingales with differences in C = C C, the first term is zero and we have V n (A, L) sup {Z t } M C C E inf a A Z n, a = sup E sup Z n, a. {Z t } M C C a A

99 Linear Games: Convex Duality [Sham Kakade, Karthik Sridharan and Ambuj Tewari, 2009] Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a. {Z t } M C C a A The linear criterion brings to mind a dual norm. Suppose we have a norm defined on C. Then we can view A as a subset of the dual space of linear functions on C, with the dual norm a = sup { c, a : c C, c 1}.

100 Linear Games: Convex Duality We will measure the size of L using sup c C c. We will measure the size of A using sup a A R(a), for a strongly convex R. We ll call a function R : A R σ-strongly convex wrt if for all a, b A and α [0, 1], R(αa + (1 α)b) αr(a) + (1 α)r(b) σ 2 α(1 α) a b 2.

101 Linear Games: Convex Duality The Legendre dual of R is R (c) = sup ( c, a R(a)). a A If inf c C R(c) = 0, then R (0) = 0. If R is σ-strongly convex, then R is differentiable and σ-smooth wrt, that is, for all c, d C, R (c + d) R (c) + R (c), d + 1 2σ d 2.

102 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies V n (A, L) 2 sup E sup Z n, a {Z t } M C C a A 2n 2A σ.

103 Linear Games: Proof Idea The definition of the Legendre dual of R is R (c) = sup ( c, a R(a)). a A From the definition, for any λ > 0, 1 E sup a, Z n E sup a A a A λ (R(a) + R (λz n )) A2 λ + ER (λz n ). λ

104 Linear Games: Proof Idea Consider the evolution of R (λz t ). Because R is σ-smooth, E [R (λz t ) Z 1,..., Z t 1 ] R (λz t 1 ) + E [ R (λz t 1 ), Z t Z t 1 Z 1,..., Z t 1 ] + 1 [ ] 2σ E λ 2 Z t Z t 1 2 Z 1,...,, Z t 1 R (λz t 1 ) + λ2 2σ. Thus, R (λz n ) nλ 2 /(2σ).

105 Linear Games: Proof Idea for λ = 2A 2 σ/n. E sup a, Z n A2 a A λ + ER (λz n ) λ A2 λ + nλ 2σ = 2A n2σ

106 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies 2n V n (A, L) 2A σ.

107 Linear Games: Examples The p- and q- norms (with 1/p + 1/q = 1) on C = B p (1) and A = B q (A): c = c p, a = a q, Here, V n (A, L) A (p 1)n. R(a) = 1 2 a 2 q A 2, σ = 2(q 1).

108 Linear Games: Examples The - and 1-norms on C = [ 1, 1] d and A = d : c = c, a = a 1, R(a) = log d + i a i log a i log d, σ = 1. Here, V n (A, L) 2n log d. (The bounds in these examples are tight within small constant factors.)

109 Optimal Regret: Lower Bounds [Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari, 2010] For the case of prediction with absolute loss: l t (a t ) = y t a t (x t ), there are (almost) corresponding lower bounds: where R n (A) = sup x 1 c 1 R n (A) log 3/2 n V n c 2 R n (A), E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ).

110 Optimal Regret: Structural Results R n (A) = sup x 1 E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ). It is straightforward to verify that the following properties extend to these sequential Rademacher averages: A B implies R n (A) R n (B). R n (co(a)) = R n (A). R n (ca) = c R n (A). R n (φ(a)) φ Lip R n (A). R n (A + b) = R n (A).

111 Optimal Regret: Key Points Dual game: Adversary chooses a joint distribution to maximize the difference between the minimal conditional expected loss and the minimal empirical loss. Upper bound in terms of sequential Rademacher averages. Linear games: bound on a martingale using a strongly convex function.

112 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization. 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization 5. Regret bounds

113 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

114 Online Convex Optimization A = convex subset of R d. L = set of convex real functions on A. For example, l t (a) = (x t a y t ) 2. l t (a) = x t a y t. l t (a) = log (exp(a T (y t ) A(a))), for A(a) the log normalization of this exponential family, with sufficient statistic T (y).

115 Online Convex Optimization: Example Choosing a t to minimize past losses, a t = arg min t 1 a A s=1 l s(a), can fail. ( fictitious play, follow the leader ) Suppose A = [ 1, 1], L = {a v a : v 1}. Consider the following sequence of losses: a 1 = 0, l 1 (a) = 1 2 a, a 2 = 1, l 2 (a) = a, a 3 = 1, l 3 (a) = a, a 4 = 1, l 4 (a) = a, a 5 = 1, l 5 (a) = a,.. a = 0 shows L n 0, but ˆL n = n 1.

116 Online Convex Optimization: Example Choosing a t to minimize past losses can fail. The strategy must avoid overfitting, just as in probabilistic settings. Similar approaches (regularization; Bayesian inference) are applicable in the online setting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement.

117 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = arg min x a. a A Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n.

118 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = {a R d : a 1}, L = {a v a : v 1}. D = 2, G 1. Regret is no more than 2 n. (And O( n) is optimal.)

119 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = m, L = {a v a : v 1}. D = 2, G m. Regret is no more than 2 mn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( m versus log m).

120 Online Convex Optimization: Gradient Method Proof. Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fix a A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 + η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, (l t (a t ) l t (a)) l t (a t ) (a t a) a 1 a 2 a n+1 a 2 2η + η l t (a t ) 2 2

121 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

122 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion a t+1 = arg min a A ( η ) t l s (a) a 2 s=1 corresponds to the gradient step a t+1 = a t η l t (a t ).

123 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

124 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). s=1 R keeps the sequence of a t s stable: it diminishes l t s influence. We can view the choice of a t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close to a t. This is a perspective that motivated many algorithms in the literature. We ll investigate why regularized minimization can be viewed this way.

125 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 + ηl t, so that a t+1 = arg min a A ( η ) t l s (a) + R(a) s=1 = arg min a A Φ t(a).

126 Bregman Divergence Definition For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a, b R d, as D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). D Φ (a, b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b.

127 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a, b) = 1 ( ) 1 2 a 2 2 b 2 + b (a b) = 1 2 a b 2, the squared euclidean norm.

128 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a [0, ) d, the unnormalized negative entropy, Φ(a) = d i=1 a i (ln a i 1), has D Φ (a, b) = i = i (a i (ln a i 1) b i (ln b i 1) ln b i (a i b i )) ( a i ln a ) i + b i a i, b i the unnormalized KL divergence. Thus, for a d, Φ(a) = i a i ln a i has D φ (a, b) = i a i ln a i b i.

129 Bregman Divergence When the domain of Φ is A R d, in addition to differentiability and strict convexity, we make two more assumptions: The interior of A is convex, For a sequence approaching the boundary of A, Φ(a n ). We say that such a Φ is a Legendre function.

130 Bregman Divergence Properties: 1. D Φ 0, D Φ (a, a) = D A+B = D A + D B. 3. Bregman projection, Π Φ A (b) = arg min a A D Φ (a, b) is uniquely defined for closed, convex A. 4. Generalized Pythagorus: for closed, convex A, a = Π Φ A (b), and a A, D Φ (a, b) D Φ (a, a ) + D Φ (a, b). 5. a D Φ (a, b) = Φ(a) Φ(b). 6. For l linear, D Φ+l = D Φ. 7. For Φ the Legendre dual of Φ, Φ = ( Φ) 1, D Φ (a, b) = D Φ ( φ(b), φ(a)).

131 Legendre Dual For a Legendre function Φ : A R, the Legendre dual is Φ (u) = sup (u v Φ(v)). v A Φ is Legendre. dom(φ ) = Φ(int dom Φ). Φ = ( Φ) 1. D Φ (a, b) = D Φ ( φ(b), φ(a)). Φ = Φ.

132 Legendre Dual Example For Φ = p, the Legendre dual is Φ = q, where 1/p + 1/q = 1. Example For Φ(a) = d i=1 ea i, so Φ(a) = (e a 1,..., e a d ), ( Φ) 1 (u) = Φ (u) = (ln u 1,..., ln u d ), and Φ (u) = i u i(ln u i 1).

133 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

134 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a) + D Φt 1 (a, ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a) + R(a). s=1

135 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a) + D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + D Φt 1 (a, ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a) + a D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, So ã t+1 minimizes Φ t.

136 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem For we have a t+1 = arg min Φ t(a), a A ã t+1 = arg min Φ t (a), a R d a t+1 = Π Φ t A (ã t+1 ).

137 Properties of Regularization Methods Proof. Let a t+1 denote ΠΦ t A (ã t+1 ). First, by definition of a t+1, Conversely, But Φ t (ã t+1 ) = 0, so Thus, Φ t (a t+1 ) Φ t(a t+1 ). Φ t (a t+1 ) Φ t (a t+1 ). D Φt (a t+1, ã t+1 ) D Φt (a t+1, ã t+1 ). D Φt (a, ã t+1 ) = Φ t (a) Φ t (ã t+1 ).

138 Properties of Regularization Methods Example For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a) + R(a) a A s=1 ( ) = Π R A arg min (ηl t (a) + D R (a, ã t )), a R d because adding a linear function to Φ does not change D Φ.

139 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

140 Properties of Regularization Methods: Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Theorem Any strategy for online linear optimization, with regret satisfying g t a t min a A g t a C n (g 1,..., g n ) can be used to construct a strategy for online convex optimization, with regret l t (a t ) min a A l t (a) C n ( l 1 (a 1 ),..., l n (a n )). Proof. Convexity implies l t (a t ) l t (a) l t (a t ) (a t a).

141 Properties of Regularization Methods: Linear Loss Key Point: We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, we can work with linear l t.

142 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem The decisions can be written ã t+1 = arg min a R d ( η ) t g s a + R(a) s=1 ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1.

143 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η t g s, s=1 t 1 R(ã t ) = η g s, s=1 so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ).

144 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

145 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1

146 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret against any a A of l t (a t ) l t (a) = D R(a, a 1 ) D Φn (a, a n+1 ) + 1 η η and thus ˆL n inf a R d ( ) l t (a) + D R(a, a 1 ) + 1 η η D Φt (a t, a t+1 ), D Φt (a t, a t+1 ). So the sizes of the steps D Φt (a t, a t+1 ) determine the regret bound.

147 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret ( ) ˆL n inf l t (a) + D R(a, a 1 ) + 1 D Φt (a t, a t+1 ). a R d η η Notice that we can write D Φt (a t, a t+1 ) = D Φ t ( Φ t (a t+1 ), Φ t (a t )) = D Φ t (0, Φ t 1 (a t ) + η l t (a t )) = D Φ t (0, η l t (a t )). So it is the size of the gradient steps, D Φ t (0, η l t (a t )), that determines the regret.

148 Regularization Methods: Regret Bounds Example Suppose R = Then we have ˆL n L n + a a 1 2 2η + η 2 g t 2. And if g t G and a a 1 D, choosing η appropriately gives ˆL n L n DG n.

149 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

150 Regularization Methods: Regret Bounds Seeing the future gives small regret: Theorem For regularized minimization, that is, for all a A, a t+1 = arg min a A l t (a t+1 ) ( η ) t l s (a) + R(a), s=1 l t (a) 1 η (R(a) R(a 1)).

151 Regularization Methods: Regret Bounds Proof. Since a t+1 minimizes Φ t, η t l s (a) + R(a) η s=1 t l s (a t+1 ) + R(a t+1 ) s=1 t 1 = ηl t (a t+1 ) + η l s (a t+1 ) + R(a t+1 ) s=1 t 1 ηl t (a t+1 ) + η l s (a t ) + R(a t ). η s=1 t l s (a s+1 ) + R(a 1 ). s=1

152 Regularization Methods: Regret Bounds Theorem For all a A, l t (a t+1 ) l t (a) 1 η (R(a) R(a 1)). Thus, if a t and a t+1 are close, then regret is small: Corollary For all a A, (l t (a t ) l t (a)) (l t (a t ) l t (a t+1 )) + 1 η (R(a) R(a 1)). So how can we control the increments l t (a t ) l t (a t+1 )?

153 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions

154 Regularization Methods: Regret Bounds Definition We say R is strongly convex wrt a norm if, for all a, b, R(a) R(b) + R(b) (a b) a b 2. For linear losses and strongly convex regularizers, the dual norm of the gradient is small: Theorem If R is strongly convex wrt a norm, and l t (a) = g t a, then a t a t+1 η g t, where is the dual norm to : v = sup{ v a : a A, a 1}.

155 Regularization Methods: Regret Bounds Proof. R(a t ) R(a t+1 ) + R(a t+1 ) (a t a t+1 ) a t a t+1 2, R(a t+1 ) R(a t ) + R(a t ) (a t+1 a t ) a t a t+1 2. Combining, a t a t+1 2 ( R(a t ) R(a t+1 )) (a t a t+1 ) Hence, a t a t+1 R(a t ) R(a t+1 ) = ηg t.

156 Regularization Methods: Regret Bounds This leads to the regret bound: Corollary For linear losses, if R is strongly convex wrt, then for all a A, (l t (a t ) l t (a)) η g t η (R(a) R(a 1)). Thus, for g t G and R(a) R(a 1 ) D 2, choosing η appropriately gives regret no more than 2GD n.

157 Regularization Methods: Regret Bounds Example Consider R(a) = 1 2 a 2, a 1 = 0, and A contained in a Euclidean ball of diameter D. Then R is strongly convex wrt and =. And the mapping between primal and dual spaces is the identity. So if sup a A l t (a) G, then regret is no more than 2GD n.

158 Regularization Methods: Regret Bounds Example Consider A = m, R(a) = i a i ln a i. Then the mapping between primal and dual spaces is R(a) = ln(a) (component-wise). And the divergence is the KL divergence, D R (a, b) = i a i ln(a i /b i ). And R is strongly convex wrt 1. Suppose that g t 1. Also, R(a) R(a 1 ) ln m, so the regret is no more than 2 n ln m.

159 Regularization Methods: Regret Bounds Example A = m, R(a) = i a i ln a i. What are the updates? a t+1 = Π R A(ã t+1 ) = Π R A( R ( R(ã t ) ηg t )) = Π R A( R (ln(ã t exp( ηg t ))) = Π R A(ã t exp( ηg t )), where the ln and exp functions are applied component-wise. This is exponentiated gradient: mirror descent with R = ln. It is easy to check that the projection corresponds to normalization, Π R A (ã) = ã/ a 1.

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Online Learning Repeated game: Aim: minimize ˆL n = Decision method plays a t World reveals l t L n l t (a

More information

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

CS281B/Stat241B. Statistical Learning Theory. Lecture 1. CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Linear Bandits. Peter Bartlett

Linear Bandits. Peter Bartlett Linear Bandits. Peter Bartlett Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution.

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Online learning CMPUT 654. October 20, 2011

Online learning CMPUT 654. October 20, 2011 Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection

Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003

Statistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003 Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!

More information

Robustness and duality of maximum entropy and exponential family distributions

Robustness and duality of maximum entropy and exponential family distributions Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

Class 2 & 3 Overfitting & Regularization

Class 2 & 3 Overfitting & Regularization Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

Machine Learning Lecture 7

Machine Learning Lecture 7 Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11

Peter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11 Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of

More information

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities

l 1 -Regularized Linear Regression: Persistence and Oracle Inequalities l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Lecture 21: Minimax Theory

Lecture 21: Minimax Theory Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways

More information

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,

More information

Learning theory. Ensemble methods. Boosting. Boosting: history

Learning theory. Ensemble methods. Boosting. Boosting: history Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

Online Prediction: Bayes versus Experts

Online Prediction: Bayes versus Experts Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

The Moment Method; Convex Duality; and Large/Medium/Small Deviations Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential

More information

Worst-Case Bounds for Gaussian Process Models

Worst-Case Bounds for Gaussian Process Models Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis

More information

Weighted Majority and the Online Learning Approach

Weighted Majority and the Online Learning Approach Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew

More information

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization

Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The

More information

Sequential Investment, Universal Portfolio Algos and Log-loss

Sequential Investment, Universal Portfolio Algos and Log-loss 1/37 Sequential Investment, Universal Portfolio Algos and Log-loss Chaitanya Ryali, ECE UCSD March 3, 2014 Table of contents 2/37 1 2 3 4 Definitions and Notations 3/37 A market vector x = {x 1,x 2,...,x

More information

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)

LECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b) LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered

More information

Statistical Measures of Uncertainty in Inverse Problems

Statistical Measures of Uncertainty in Inverse Problems Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Lecture 16: FTRL and Online Mirror Descent

Lecture 16: FTRL and Online Mirror Descent Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Lecture 8 October Bayes Estimators and Average Risk Optimality

Lecture 8 October Bayes Estimators and Average Risk Optimality STATS 300A: Theory of Statistics Fall 205 Lecture 8 October 5 Lecturer: Lester Mackey Scribe: Hongseok Namkoong, Phan Minh Nguyen Warning: These notes may contain factual and/or typographic errors. 8.

More information

Relative Loss Bounds for Multidimensional Regression Problems

Relative Loss Bounds for Multidimensional Regression Problems Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Sequential prediction with coded side information under logarithmic loss

Sequential prediction with coded side information under logarithmic loss under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science

More information

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Computational Oracle Inequalities for Large Scale Model Selection Problems

Computational Oracle Inequalities for Large Scale Model Selection Problems for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract

More information

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION

STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization

More information

A Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi

A Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi A Mini-Course on Convex Optimization with a view toward designing fast algorithms Nisheeth K. Vishnoi i Preface These notes are largely based on lectures delivered by the author in the Fall of 2014. The

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013

Learning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013 Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information