Online Prediction Peter Bartlett
|
|
- Ashley Hutchinson
- 5 years ago
- Views:
Transcription
1 Online Prediction Peter Bartlett Statistics and EECS UC Berkeley and Mathematical Sciences Queensland University of Technology
2 Online Prediction Repeated game: Cumulative loss: ˆL n = Decision method plays a t A World reveals l t L l t (a t ). Aim to minimize regret, that is, perform well compared to the best (in retrospect) from some class: regret = l t (a t ) } {{ } ˆL n min a A Data can be adversarially chosen. l t (a). } {{ } L n
3 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a).
4 Online Prediction: Motivations 1. Adversarial model is often appropriate, e.g., in Computer security. Computational finance. 2. Adversarial model assumes little: It is often straightforward to convert a strategy for an adversarial environment to a method for a probabilistic environment. 3. Studying the adversarial model can reveal the deterministic core of a statistical problem: there are strong similarities between the performance guarantees in the two cases. 4. There are significant overlaps in the design of methods for the two problems: Regularization plays a central role. Many online prediction strategies have a natural interpretation as a Bayesian method.
5 Computer Security: Spam Detection
6 Computer Security: Spam Detection Here, the action a t might be a classification rule, and l t is the indicator for a particular being incorrectly classified (e.g., spam allowed through). The sender can determine if an is delivered (or detected as spam), and try to modify it. An adversarial model allows an arbitrary sequence. We cannot hope for good classification accuracy in an absolute sense; regret is relative to a comparison class. Minimizing regret ensures that the spam detection accuracy is close to the best performance in retrospect on the particular sequence.
7 Computer Security: Spam Detection Suppose we consider features of messages from some set X (e.g., information about the header, about words in the message, about attachments). The decision method s action a t is a mapping from X to [0, 1] (think of the value as an estimated probability that the message is spam). At each round, the adversary chooses a feature vector x t X and a label y t {0, 1}, and the loss is defined as l t (a t ) = (y t a t (x t )) 2. The regret is then the excess squared error, over the best achievable on the data sequence: l t (a t ) min a A l t (a) = (y t a t (x t )) 2 min a A (y t a(x t )) 2.
8 Computational Finance: Portfolio Optimization
9 Computational Finance: Portfolio Optimization Aim to choose a portfolio (distribution over financial instruments) to maximize utility. Other market players can profit from making our decisions bad ones. For example, if our trades have a market impact, someone can front-run (trade ahead of us). Here, the action a t is a distribution on instruments, and l t might be the negative logarithm of the portfolio s increase, a t r t, where r t is the vector of relative price increases. We might compare our performance to the best stock (distribution is a delta function), or a set of indices (distribution corresponds to Dow Jones Industrial Average, etc), or the set of all distributions.
10 Computational Finance: Portfolio Optimization The decision method s action a t is a distribution on the m instruments, a t m = {a [0, 1] m : i a i = 1}. At each round, the adversary chooses a vector of returns r t R m +; the ith component is the ratio of the price of instrument i at time t to its price at the previous time, and the loss is defined as l t (a t ) = log (a t r t ). The regret is then the log of the ratio of the maximum value the portfolio would have at the end (for the best mixture choice) to the final portfolio value: l t (a t ) min a A l t (a) = max a A log(a r t ) log(a t r t ).
11 Online Prediction: Motivations 2. Online algorithms are also effective in probabilistic settings. Easy to convert an online algorithm to a batch algorithm. Easy to show that good online performance implies good i.i.d. performance, for example.
12 Online Prediction: Motivations 3. Understanding statistical prediction methods. Many statistical methods, based on probabilistic assumptions, can be effective in an adversarial setting. Analyzing their performance in adversarial settings provides perspective on their robustness. We would like violations of the probabilistic assumptions to have a limited impact.
13 Key Points Online Prediction: repeated game. aim to minimize regret. Data can be adversarially chosen. Motivations: Often appropriate (security, finance). Algorithms also effective in probabilistic settings. Can provide insight into statistical prediction methods.
14 Synopsis A finite comparison class: A = {1,..., m}. An easy start. Online, adversarial versus batch, probabilistic. Similar bounds. Optimal regret: dual game. Rademacher averages for probabilistic. Sequential Rademacher averages for adversarial. Online convex optimization. Regularization methods.
15 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.
16 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.
17 Prediction with Expert Advice Suppose we are predicting whether it will rain tomorrow. We have access to a set of m experts, who each make a forecast of 0 or 1. Can we ensure that we predict almost as well as the best expert? Here, A = {1,..., m}. There are m experts, and each has a forecast sequence f1 i, f 2 i,... from {0, 1}. At round t, the adversary chooses an outcome y t {0, 1}, and sets { l t (i) = 1[ft i 1 if f i y t ] = t y t, 0 otherwise.
18 max l 1 Online Prediction Minimax regret is the value of the game: ( min min max a 1 a n l n l t (a t ) min a A ) l t (a). ˆL n = l t (a t ), L n = min a A l t (a).
19 Prediction with Expert Advice An easier game: suppose that the adversary is constrained to choose the sequence y t so that some expert incurs no loss (L n = 0), that is, there is an i {1,..., m} such that for all t, y t = ft i. How should we predict?
20 Prediction with Expert Advice: Halving Define the set of experts who have been correct so far: C t = {i : l 1 (i) = = l t 1 (i) = 0}. Choose a t any element of { ( )} i : ft i = majority {f j t : j C t }. Theorem This strategy has regret no more than log 2 m.
21 Prediction with Expert Advice: Halving Theorem The halving strategy has regret no more than log 2 m. Proof. If it makes a mistake (that is, l t (a t ) = 1), then the minority of {f j t : j C t } is correct, so at least half of the experts are eliminated: C t+1 C t 2. And otherwise C t+1 C t (because C t never increases). Thus, ˆL n = l t (a t ) log 2 C 1 C n+1 = log 2 m log 2 C n+1 log 2 m.
22 Prediction with Expert Advice The proof follows a pattern we shall see again: find some measure of progress (here, C t ) that changes monotonically when excess loss is incurred (here, it halves), is somehow constrained (here, it cannot fall below 1, because there is an expert who predicts perfectly). What if there is no perfect expert?
23 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.
24 Prediction with Expert Advice: Mixed Strategies We have m experts. Allow a mixed strategy, that is, a t chosen from the simplex m the set of distributions on {1,..., m}, { } m m = a [0, 1] m : a i = 1. We can think of the strategy as choosing an element of {1,..., m} randomly, according to a distribution a t. Or we can think of it as playing an element a t of m, and incurring the expected loss, l t (a t ) = i=1 m atl i t (e i ), i=1 where l t (e i ) [0, 1] is the loss incurred by expert i. (e i denotes the vector with a single 1 in the ith coordinate, and the rest zeros.)
25 Prediction with Expert Advice: Exponential Weights Maintain a set of (unnormalized) weights over experts: w i 1 = 1, w i t+1 = w i t exp ( ηl t (e i )). Here, η > 0 is a parameter of the algorithm. Choose a t as the normalized vector, a t = 1 m i=1 w i t w t.
26 Prediction with Expert Advice: Exponential Weights Theorem The exponential weights strategy with parameter has regret satisfying η = 8 ln m n ˆL n L n n ln m 2.
27 Exponential Weights: Proof Idea We use a measure of progress: W t = m wt i. i=1 1. W n grows at least as ( exp 2. W n grows no faster than ( exp η min i η ) l t (e i ). ) l t (a t ).
28 Exponential Weights: Proof 1 ln W n+1 W 1 = ln ( m i=1 w i n+1 ) ln m ( m ( = ln exp η i=1 t ( ( ln max exp i ( = η min i = ηl n ln m. t η t ) l t (e i ) l t (e i ) l t (e i ) ln m )) )) ln m ln m
29 Exponential Weights: Proof 2 ln W t+1 W t ( m i=1 = ln exp( ηl t(e i ))wt i i η l t(e i )wt i i w t i = ηl t (a t ) + η2 8, i w i t + η2 8 where we have used Hoeffding s inequality: for a random variable X [a, b] and λ R, ln (Ee ) λx λex + λ2 (b a) 2. 8 )
30 Aside: Proof of Hoeffding s inequality Define ( ) ( A(λ) = log Ee λx = log ) e λx dp(x), where X P. Then A is the log normalization of the exponential family random variable X λ with reference measure P and sufficient statistic x. Since P has bounded support, A(λ) < for all λ, and we know that A (λ) = E(X λ ), A (λ) = Var(X λ ). Since P has support in [a, b], Var(X λ ) (b a) 2 /4. Then a Taylor expansion about λ = 0 (where X λ has the same distribution as X) gives A(λ) λex + λ2 8 (b a)2.
31 Exponential Weights: Proof Thus, ηl n ln m ln W n+1 W 1 ˆL n L n ln m η ηˆl n + nη ηn 8. Choosing the optimal η gives the result: Theorem The exponential weights strategy with parameter η = n ln m 8 ln m/n has regret no more than 2.
32 Key Points For a finite set of actions (experts): If one action is perfect (i.e., has zero loss), the halving algorithm gives per round regret of ln m n. Exponential weights gives per round regret of ( ) ln m O. n
33 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? 2. Do we need to know n to set η?
34 Prediction with Expert Advice: Refinements 1. Does exponential weights strategy give the faster rate if L = 0? Replace Hoeffding: with: ln Ee λx λex + λ2 8, ln Ee λx (e λ 1)EX. (for X [0, 1]: linear upper bound on e λx ).
35 Exponential Weights: Proof 2 Thus ln W ( m t+1 i=1 = ln exp( ηl t(e i ))w i ) t W t i w t i ( e η 1 ) l t (a t ). ˆL n η 1 e η L n + ln m 1 e η. For example, if L n = 0 and η is large, we obtain a regret bound of roughly ln m again. And η large is like the halving algorithm (it puts equal weight on all experts that have zero loss so far).
36 Prediction with Expert Advice: Refinements 2. Do we need to know n to set η? We used the optimal setting η = 8 ln m/n. But can this regret bound be achieved uniformly across time? Yes; using a time-varying η t = 8 ln m/t gives the same rate (worse constants). It is also possible to set η as a function of L t, the best cumulative loss so far, to give the improved bound for small losses uniformly across time (worse constants).
37 Prediction with Expert Advice: Refinements 3. We could work with arbitrary convex losses on m : We defined loss as linear in a: l t (a) = i a i l t (e i ). We could replace this with any bounded convex function on m. The only change in the proof is an equality becomes an inequality: i η l t(e i )wt i i w t i ηl t (a t ).
38 Prediction with Expert Advice: Refinements But note that the exponential weights strategy only competes with the corners of the simplex: Theorem For convex functions l t : m [0, 1], the exponential weights strategy, with η = 8 ln m/n, satisfies l t (a t ) min i l t (e i ) + n ln m 2.
39 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with outcome vector y and model p(y j) = h(y) exp( ηy j ). 2. An easy regret bound for Bayesian prediction shows that its regret, wrt this scaled log loss: l Bayes (p, y) = 1 η log E J p exp( ηy J ), is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.
40 Bayesian Update parameter space: Θ outcome space: Y Assume joint: p(θ, y) = π(θ) }{{} prior p(y θ) }{{} likelihood predictive distribution: ˆp t+1 (y) = p(y y 1,..., y t ) = p(y θ) p(θ y 1,..., y t ) dθ }{{} p t+1 (θ) update p t on Θ: p 1 (θ) = π(θ) cumulative log loss: p t+1 (θ) = p t (θ)p(y t θ) pt (θ )p(y t θ )dθ log ˆp t (y t ).
41 Suppose that the likelihood is Bayesian Interpretation p(y j) = h(y) exp( ηy j ), for j = 1,..., m and y R m. Then the Bayes update is: p t+1 (j) = 1 Z p t(j) exp( ηy j t ), (where Z is normalization). If y j t is the loss of expert j, this is the exponential weights algorithm.
42 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ).
43 Performance of Bayesian Prediction: Proof First, hence ˆp 1 (y 1 ) ˆp n (y n ) }{{} exp( ˆL n) = p(y 1 )p(y 2 y 1 ) p(y n y 1,..., y n 1 ) = p(y 1,..., y n ) = p(y 1 θ) p(y n θ) dπ(θ), Θ }{{} exp( L n(θ)) exp( ˆL n ) exp( L n (θ))dπ(θ) S exp( L n (θ 0 )) dπ(θ), where S = {θ Θ : L n (θ) L n (θ 0 )}. Thus ˆL n L n (θ 0 ) ln(π(s)). S
44 Performance of Bayesian Prediction For the log loss, Bayesian prediction competes with any θ, provided that the prior probability of performance better than θ is not too small. Theorem For any π, any sequence y 1,..., y n, and any θ Θ, ˆL n L n (θ) ln ( π({θ : L n (θ ) L n (θ)}) ). So if π(i) = 1/m, the exponential weights strategy s log loss is within log(m) of optimal. But what is the log loss here?
45 Bayesian Interpretation For a posterior p t on {1,..., m}, the predicted probability is ˆp t (y) = E J pt p(y J) = c(y)e J pt exp( ηy J ), So setting the loss as the negative log of the predicted probability is equivalent to defining the loss of a posterior p t with outcome y t as ηl Bayes with l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) (ignoring additive constants). Compare to the linear loss, l(p t, y t ) = E J pt y J t. These are equal at the corners of the simplex: l Bayes (δ j, y) = l(δ j, y) = y j.
46 Bayesian Interpretation The theorem shows that, with this loss and any prior, ( ) l(e j, y t ). ˆL Bayes,n min j log π(j) η But this is not the linear loss: l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) versus l(p t, y t ) = E J pt y J t. They coincide at the corners, p t = e j, and l Bayes is convex. What is the gap in Jensen s inequality?
47 Bayesian Interpretation l Bayes (p t, y t ) = 1 ) (E η log J pt exp( ηyt J ) l(p t, y t ) = E J pt y J t. Hoeffding s inequality for X [a, b]: implies as before. A( η) = log (E exp( ηx)) ηex η2 8 (b a)2, ˆL n ˆL Bayes,n + ηn ( 8 min l(e j, y t ) π(j) ) + ηn j η 8 = min l(e j, y t ) + log m + ηn j η 8,
48 Bayesian Interpretation We can interpret the exponential weights strategy as Bayesian prediction: 1. Exponential weights is equivalent to a Bayesian update with model p(y j) = h(y) exp( ηy j ). 2. Easy regret bound for Bayesian prediction shows that its regret wrt the scaled log loss l Bayes (p, y) = 1 η log E J p exp( ηy J ) is no more than (1/η) log m. 3. This convex loss matches the linear loss at the corners of the simplex, and (from Hoeffding) differs from the linear loss by no more than η/8. 4. This implies the earlier regret bound for exponential weights.
49 Finite Comparison Class 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions: Exponential weights and L = 0 n unknown L unknown Convex (versus linear) losses Bayesian interpretation 5. Probabilistic prediction with a finite class.
50 Probabilistic Prediction Setting Let s consider a probabilistic formulation of a prediction problem. There is a sample of size n drawn i.i.d. from an unknown probability distribution P on X Y: (X 1, Y 1 ),..., (X n, Y n ). Some method chooses ˆf : X Y. It suffers regret El(ˆf (X), Y ) min El(f (X), Y ). f F Here, F is a class of functions from X to Y.
51 Probabilistic Setting: Zero Loss Theorem If some f F has El(f (X), Y ) = 0, then choosing } ˆf Cn = {f F : Êl(f (X), Y ) = 0 leads to regret that is ( ) log F O. n
52 Probabilistic Setting: Zero Loss Proof. Pr(El(ˆf ) ɛ) Pr( f F : Êl(f ) = 0, El(f ) ɛ) F (1 ɛ) n F e nɛ. Integrating the tail bound Pr(El(ˆf )n ln F + x) e x gives El(ˆf ) c ln F /n.
53 Probabilistic Setting Theorem Choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), leads to regret that is ( ) log F O. n
54 Probabilistic Setting Proof. By the triangle inequality and the definition of ˆf,. min Elˆf f F El f 2E sup f F Êl f El f E sup f F = E sup f F E sup f F 1 (l(y t, f (X t )) Pl(Y, f (X))) n 1 ( P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( ɛ t l(yt, f (X t )) l(y t, f (X t )) ) n 1 ɛ t l(y t, f (X t )) n, 2E sup f F where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.
55 Aside: Rademacher Averages of a Finite Class Theorem: For V R n, E max v V i=1 ɛ i v i 2 ln V max v V v. Proof idea: Hoeffding s inequality. ( ) ( ) exp λe max ɛ i v i E exp λ max ɛ i v i v v i i ( E exp λ ) ɛ i v i v i = E exp(λɛ i v i ) v i exp(λ 2 vi 2 /2) v i ( ) V exp λ 2 max v 2 /2. v
56 Probabilistic Setting min Elˆf El 1 f 4E sup ɛ t l f F f F n f (X t, Y t ) t 2 log F 4 max l f (X i, Y i ) 2 X i,y i,f n t 2 log F 4. n
57 Probabilistic Setting: Key Points For a finite function class If one function has zero loss, choosing ˆf to minimize the empirical risk, Êl(f (X), Y ), gives per round regret of ln F n. In any case, ˆf has per round regret of ( ) ln F O. n The same as the adversarial setting.
58 Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect predictions: log m regret. 3. Exponential weights strategy: n log m regret. 4. Refinements and extensions. 5. Probabilistic prediction with a finite class. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization.
59 Online to Batch Conversion Suppose we have an online strategy that, given observations l 1,..., l t 1, produces a t = A(l 1,..., l t 1 ). Can we convert this to a method that is suitable for a probabilistic setting? That is, if the l t are chosen i.i.d., can we use A s choices a t to come up with an â A so that is small? El 1 (â) min a A El 1 (a) Consider the following simple randomized method: 1. Pick T uniformly from {0,..., n}. 2. Let â = A(l T +1,..., l n ).
60 Online to Batch Conversion Theorem If A has a regret bound of C n+1 for sequences of length n + 1, then for any stationary process generating the l 1,..., l n+1, this method satisfies El n+1 (â) min a A El n (a) C n+1 n + 1. (Notice that the expectation averages also over the randomness of the method.)
61 Online to Batch Conversion Proof. El n+1 (â) = El n+1 (A(l T +1,..., l n )) = E 1 l n+1 (A(l t+1,..., l n )) n + 1 = E 1 n + 1 t=0 l n t+1 (A(l 1,..., l n t )) t=0 = E 1 n+1 l t (A(l 1,..., l t 1 )) n + 1 ( E 1 n + 1 min a ) n+1 l t (a) + C n+1 min a El t (a) + C n+1 n + 1.
62 Online to Batch Conversion The theorem is for the expectation over the randomness of the method. For a high probability result, we could 1. Choose â = 1 n n a t, provided A is convex and the l t are all convex. 2. Choose â = arg min a t ( 1 n t s=t+1 ) log(n/δ) l s (a t ) + c. n t In both cases, the analysis involves concentration of martingales.
63 Key Point: Online to Batch Conversion An online strategy with regret bound C n can be converted to a batch method. The regret per trial in the probabilistic setting is bounded by the regret per trial in the adversarial setting.
64 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. 1. Dual game. 2. Rademacher averages and sequential Rademacher averages. 3. Linear games. Online convex optimization.
65 Optimal Regret Joint work with Jake Abernethy, Alekh Agarwal, Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari. We have: a set of actions A, a set of loss functions L. At time t, Player chooses an action a t from A. Adversary chooses l t : A R from L. Player incurs loss l t (a t ). Regret is the value of the game: ( V n (A, L) = inf sup inf sup l t (a t ) inf a 1 a n l n a A l 1 ) l t (a).
66 Optimal Regret: Dual Game Theorem If A is compact and all l t are convex, continuous functions, then ( ) V n (A, L) = sup E inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a), P a t A a A where the supremum is over joint distributions P over sequences l 1,..., l n in L n. As we ll see, this follows from a minimax theorem. Dual game: adversary plays first by choosing P. Value of the game is the difference between minimal conditional expected loss and minimal empirical loss. If P were i.i.d., this would be the difference between the minimal expected loss and the minimal empirical loss.
67 Optimal Regret: Extensions We could replace L n by a set of sequences of loss functions: l 1 L 1, l 2 L 2 (l 1 ), l 3 L 3 (l 1, l 2 ),..., l n L n (l 1, l 2,..., l n 1 ). That is, the constraints on the adversary s choice l t could depend on previous choices l 1,..., l t 1. We can ensure convexity of the l t by allowing mixed strategies: replace A by the set of probability distributions P on A and replace l(a) by E a P l(a).
68 Dual Game: Proof Idea Theorem (Sion, 1957) If A is compact and for every b B, f (, b) is a convex-like, 1 lower semi-continuous function, and for every a A, f (a, ) is concave-like, then inf sup f (a, b) = sup a A b B b B inf f (a, b). a A We ll define B as the set of probability distributions on L and f (a, b) = c + E[l(a) + φ(l)], and we ll assume that A is compact and each l L is convex and continuous. 1 Convex-like [Fan, 1953]: a 1, a 2 A, α [0, 1], a A, αl(a 1 ) + (1 α)l(a 2 ) l(a).
69 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 P n ( ( l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ) l t (a), because allowing mixed strategies does not help the adversary.
70 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf supe a 1 a n l 1 P n = inf sup inf sup a 1 a n 1 l 1 ( ( l n 1 sup P n l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n ) l t (a) l t (a t ) inf a A by Sion s generalization of von Neumann s minimax theorem. ) l t (a),
71 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n l n l 1 = inf sup inf sup E a 1 a n l 1 l 1 P n = inf sup inf sup a 1 a n 1 ( ( l n 1 sup = inf sup inf sup a 1 l a n 1 1 ( sup P n P n 1 E P n ( n 1 l t (a t ) inf a A ) l t (a) l t (a t ) inf a A ( infe a n l t (a t ) + ) l t (a) l t (a t ) inf a A infe [l n (a n ) l 1,..., l n 1 ] inf a n a A ) l t (a) )) l t (a), splitting the sum and allowing the adversary a mixed strategy at round n 1.
72 Dual Game: Proof Idea V n (A, L) = inf sup inf sup a 1 a n 1 l 1 sup P n ( = inf sup sup a 1 l 1 sup P n ( P n 1 E ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A applying Sion s minimax theorem again. )) l t (a) )) l t (a),
73 Dual Game: Proof Idea V n (A, L) = inf sup sup a 1 l 1 sup P n = inf sup a 1. l 1 ( sup E Pn 1 n = sup E P inf E P a n 1 n 1 ( n 1 l t (a t ) + inf E [l n (a n ) l 1,..., l n 1 ] inf a n a A ( n 2 sup inf E l t (a t ) + P a n 2 n 2 ( ( t=n 1 )) l t (a) inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A inf E [l t (a t ) l 1,..., l t 1 ] inf a t a A )) l t (a) ) l t (a).
74 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.
75 Prediction in Probabilistic Settings i.i.d. (X, Y ), (X 1, Y 1 ),..., (X n, Y n ) P from X Y. Use data (X 1, Y 1 ),..., (X n, Y n ) to choose f n : X A with small risk, R(f n ) = Pl(Y, f n (X)), ideally not much larger than the minimum risk over some comparison class F: excess risk = R(f n ) inf R(f ). f F
76 Tools for the analysis of probabilistic problems For f n = arg min n f F l(y t, f (X t )), 1 R(f n ) inf Pl(Y, f (X)) 2 sup l(y t, f (X t )) Pl(Y, f (X)) f F f F n. So supremum of empirical process, indexed by F, gives upper bound on excess risk.
77 Tools for the analysis of probabilistic problems Typically, this supremum is concentrated about 1 P sup (l(y t, f (X t )) Pl(Y, f (X))) f F n = P sup 1 ( f F P l(yt, f (X t )) l(y t, f (X t )) ) n 1 ( E sup ɛ f F t l(yt, f (X t )) l(y t, f (X t )) ) n 1 2E sup ɛ f F t l(y t, f (X t )) n, where (X t, Y t ) are independent, with same distribution as (X, Y ), and ɛ t are independent Rademacher (uniform ±1) random variables.
78 Tools for the analysis of probabilistic problems That is, for f n = arg min n f F l(y t, f (X t )), with high probability, 1 R(f n ) inf Pl(Y, f (X)) ce sup ɛ f F f F t l(y t, f (X t )) n, where ɛ t are independent Rademacher (uniform ±1) random variables. Rademacher averages capture complexity of {(x, y) l(y, f (x)) : f F}: they measure how well functions align with a random (ɛ 1,..., ɛ n ). Rademacher averages are a key tool in analysis of many statistical methods: related to covering numbers (Dudley) and combinatorial dimensions (Vapnik-Chervonenkis, Pollard), for example. A related result applies in the online setting...
79 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to the bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1.
80 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â), where â minimizes t l t(a).
81 Sequential Rademacher Averages: Proof Idea V n (A, L) = sup E P sup E P ( ) inf E [l t(a t ) l 1,..., l t 1 ] inf l t (a) a t A a A ( ) E [l t (â) l 1,..., l t 1 ] l t (â) sup Esup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)).
82 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ), where l t is a tangent sequence: conditionally independent of l t given l 1,..., l t 1, with the same conditional distribution.
83 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A sup E sup P a A (E [l t (a) l 1,..., l t 1 ] l t (a)) ( [ ] E l t (a) l 1,..., l n lt (a) ) ( l t (a) l t (a) ), moving the supremum inside the expectation.
84 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )), for ɛ n { 1, 1}, since l n has the same conditional distribution, given l 1,..., l n 1, as l n.
85 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E sup P a A = sup E sup P a A ( l t (a) l t (a) ) ( n 1 ( l t (a) l t (a) ) ( + ɛ n l n (a) l n (a) )) = sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup P E l1,...,l n 1 supe ɛn sup l n,l n a A ɛ n ( l n (a) l n (a) )). ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) +
86 Sequential Rademacher Averages: Proof Idea V n (A, L) sup E l1,...,l n 1 E ln,l E n ɛ n sup P a A ( ɛ n l n (a) l n (a) )) sup E l1,...,l n 1 sup E ɛn sup P l n,l n a A ( ɛ n l n (a) l n (a) )). sup l 1,l 1 E ɛ1 sup E ɛn sup l n,l n a A ( n 1 ( l t (a) l t (a) ) + ( n 1 ( l t (a) l t (a) ) + ( ( ɛ t l t (a) l t (a) )).
87 Sequential Rademacher Averages: Proof Idea V n (A, L) sup l 1,l 1 = 2 sup l 1 E ɛ1 sup E ɛn sup l n,l n a A E ɛ1 sup E ɛn sup l n a A ( ( ɛ t l t (a) l t (a) )) ( ) ɛ t l t (a), since the two sums are identical (ɛ t and ɛ t have the same distribution).
88 Optimal Regret and Sequential Rademacher Averages Theorem V n (A, L) 2 sup l 1 E ɛ1 sup E ɛn sup l n a A ɛ t l t (a), where ɛ 1,..., ɛ n are independent Rademacher (uniform ±1-valued) random variables. Compare to bound involving Rademacher averages in the probabilistic setting: 1 excess risk ce sup ɛ t l(y t, f (X t )) f F n. In the adversarial case, the choice of l t is deterministic, and can depend on ɛ 1,..., ɛ t 1 : sequential Rademacher averages.
89 Sequential Rademacher Averages: Example Consider step functions on R: f a : x 1[x a] l a (y, x) = 1[f a (x) y] L = {a 1[f a (x) y] : x R, y {0, 1}}. Fix a distribution on R {±1}, and consider the Rademacher averages, E sup ɛ t l a (Y t, X t ). a R
90 Rademacher Averages: Example For step functions on R, Rademacher averages are: E sup a R ɛ t l a (Y t, X t ) = E sup a R ɛ t l a (1, X t ) sup E sup x t a R = E max 0 i n+1 = O( n). ɛ t 1[x t < a] i ɛ t
91 Sequential Rademacher Averages: Example Consider the sequential Rademacher averages: sup l 1 = sup x 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. If ɛ t = 1, we d like to choose a such that x t < a. If ɛ t = 1, we d like to choose a such that x t a.
92 Sequential Rademacher Averages: Example Sequential Rademacher averages are sup x 1 E ɛ1 sup E ɛn sup x n a ɛ t 1[x t < a]. We can choose x 1 = 0 and, for t = 1,..., n, t 1 x t = 2 i ɛ i = x t (t 1) ɛ t 1. i=1 Then if we set we have ɛ t 1[x t < a] = a = x n + 2 n ɛ n, { 1 if ɛ t = 1, 0 otherwise, which is maximal.
93 Sequential Rademacher Averages: Example So the sequential Rademacher averages are sup l 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) = E Compare with the Rademacher averages: E sup a R 1[ɛ t = 1] = n 2. ɛ t l a (Y t, X t ) = O( n).
94 Optimal Regret Dual game. Rademacher averages and sequential Rademacher averages. Linear games.
95 Optimal Regret: Linear Games Loss is l(a) = c, a. Examples: Online linear optimization: L is a set of bounded linear functions on the bounded set A R d. Prediction with expert advice: A = m, L = [0, 1] m.
96 Optimal Regret: Linear Games Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a, {Z t } M C C a A where M C C is the set of martingales with differences in C C. If C is symmetric ( C = C), then V n (A, L) sup E sup Z n, a. {Z t } M C C a A
97 Linear Games: Proof Idea The sequential Rademacher averages can be written sup l 1 = sup c 1 E ɛ1 sup E ɛn sup l n a ɛ t l t (a) E ɛ1 sup E ɛn sup c n a sup E sup Z n, a, {Z t } M C C a ɛ t c t, a where M C C is the set of martingales with differences in C C.
98 Linear Games: Proof Idea For the lower bound, consider the duality result: ( V n (A, L) = sup E P inf E [ c t, a t c 1,..., c t 1 ] inf a t A a A ) c t, a, where the supremum is over joint distributions of sequences c 1,..., c n. If we restrict P to the set M C C of martingales with differences in C = C C, the first term is zero and we have V n (A, L) sup {Z t } M C C E inf a A Z n, a = sup E sup Z n, a. {Z t } M C C a A
99 Linear Games: Convex Duality [Sham Kakade, Karthik Sridharan and Ambuj Tewari, 2009] Theorem For the linear loss class {l(a) = c, a : c C}, the regret satisfies V n (A, L) 2 sup E sup Z n, a. {Z t } M C C a A The linear criterion brings to mind a dual norm. Suppose we have a norm defined on C. Then we can view A as a subset of the dual space of linear functions on C, with the dual norm a = sup { c, a : c C, c 1}.
100 Linear Games: Convex Duality We will measure the size of L using sup c C c. We will measure the size of A using sup a A R(a), for a strongly convex R. We ll call a function R : A R σ-strongly convex wrt if for all a, b A and α [0, 1], R(αa + (1 α)b) αr(a) + (1 α)r(b) σ 2 α(1 α) a b 2.
101 Linear Games: Convex Duality The Legendre dual of R is R (c) = sup ( c, a R(a)). a A If inf c C R(c) = 0, then R (0) = 0. If R is σ-strongly convex, then R is differentiable and σ-smooth wrt, that is, for all c, d C, R (c + d) R (c) + R (c), d + 1 2σ d 2.
102 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies V n (A, L) 2 sup E sup Z n, a {Z t } M C C a A 2n 2A σ.
103 Linear Games: Proof Idea The definition of the Legendre dual of R is R (c) = sup ( c, a R(a)). a A From the definition, for any λ > 0, 1 E sup a, Z n E sup a A a A λ (R(a) + R (λz n )) A2 λ + ER (λz n ). λ
104 Linear Games: Proof Idea Consider the evolution of R (λz t ). Because R is σ-smooth, E [R (λz t ) Z 1,..., Z t 1 ] R (λz t 1 ) + E [ R (λz t 1 ), Z t Z t 1 Z 1,..., Z t 1 ] + 1 [ ] 2σ E λ 2 Z t Z t 1 2 Z 1,...,, Z t 1 R (λz t 1 ) + λ2 2σ. Thus, R (λz n ) nλ 2 /(2σ).
105 Linear Games: Proof Idea for λ = 2A 2 σ/n. E sup a, Z n A2 a A λ + ER (λz n ) λ A2 λ + nλ 2σ = 2A n2σ
106 Linear Games: Convex Duality Theorem For the linear loss class {l(a) = c, a : c C}, if R : A R is σ-strongly convex, satisfies inf c C R(c) = 0, and sup c = 1, c C sup R(a) = A 2, a A then the regret satisfies 2n V n (A, L) 2A σ.
107 Linear Games: Examples The p- and q- norms (with 1/p + 1/q = 1) on C = B p (1) and A = B q (A): c = c p, a = a q, Here, V n (A, L) A (p 1)n. R(a) = 1 2 a 2 q A 2, σ = 2(q 1).
108 Linear Games: Examples The - and 1-norms on C = [ 1, 1] d and A = d : c = c, a = a 1, R(a) = log d + i a i log a i log d, σ = 1. Here, V n (A, L) 2n log d. (The bounds in these examples are tight within small constant factors.)
109 Optimal Regret: Lower Bounds [Sasha Rakhlin, Karthik Sridharan and Ambuj Tewari, 2010] For the case of prediction with absolute loss: l t (a t ) = y t a t (x t ), there are (almost) corresponding lower bounds: where R n (A) = sup x 1 c 1 R n (A) log 3/2 n V n c 2 R n (A), E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ).
110 Optimal Regret: Structural Results R n (A) = sup x 1 E ɛ1 sup E ɛn sup x n a A ɛ t a(x t ). It is straightforward to verify that the following properties extend to these sequential Rademacher averages: A B implies R n (A) R n (B). R n (co(a)) = R n (A). R n (ca) = c R n (A). R n (φ(a)) φ Lip R n (A). R n (A + b) = R n (A).
111 Optimal Regret: Key Points Dual game: Adversary chooses a joint distribution to maximize the difference between the minimal conditional expected loss and the minimal empirical loss. Upper bound in terms of sequential Rademacher averages. Linear games: bound on a martingale using a strongly convex function.
112 Synopsis A finite comparison class: A = {1,..., m}. Online, adversarial versus batch, probabilistic. Optimal regret. Online convex optimization. 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization 5. Regret bounds
113 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
114 Online Convex Optimization A = convex subset of R d. L = set of convex real functions on A. For example, l t (a) = (x t a y t ) 2. l t (a) = x t a y t. l t (a) = log (exp(a T (y t ) A(a))), for A(a) the log normalization of this exponential family, with sufficient statistic T (y).
115 Online Convex Optimization: Example Choosing a t to minimize past losses, a t = arg min t 1 a A s=1 l s(a), can fail. ( fictitious play, follow the leader ) Suppose A = [ 1, 1], L = {a v a : v 1}. Consider the following sequence of losses: a 1 = 0, l 1 (a) = 1 2 a, a 2 = 1, l 2 (a) = a, a 3 = 1, l 3 (a) = a, a 4 = 1, l 4 (a) = a, a 5 = 1, l 5 (a) = a,.. a = 0 shows L n 0, but ˆL n = n 1.
116 Online Convex Optimization: Example Choosing a t to minimize past losses can fail. The strategy must avoid overfitting, just as in probabilistic settings. Similar approaches (regularization; Bayesian inference) are applicable in the online setting. First approach: gradient steps. Stay close to previous decisions, but move in a direction of improvement.
117 Online Convex Optimization: Gradient Method a 1 A, a t+1 = Π A (a t η l t (a t )), where Π A is the Euclidean projection on A, Π A (x) = arg min x a. a A Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n.
118 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = {a R d : a 1}, L = {a v a : v 1}. D = 2, G 1. Regret is no more than 2 n. (And O( n) is optimal.)
119 Online Convex Optimization: Gradient Method Theorem For G = max t l t (a t ) and D = diam(a), the gradient strategy with η = D/(G n) has regret satisfying ˆL n L n GD n. Example A = m, L = {a v a : v 1}. D = 2, G m. Regret is no more than 2 mn. Since competing with the whole simplex is equivalent to competing with the vertices (experts) for linear losses, this is worse than exponential weights ( m versus log m).
120 Online Convex Optimization: Gradient Method Proof. Define ã t+1 = a t η l t (a t ), a t+1 = Π A (ã t+1 ). Fix a A and consider the measure of progress a t a. a t+1 a 2 ã t+1 a 2 = a t a 2 + η 2 l t (a t ) 2 2η t (a t ) (a t a). By convexity, (l t (a t ) l t (a)) l t (a t ) (a t a) a 1 a 2 a n+1 a 2 2η + η l t (a t ) 2 2
121 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
122 Online Convex Optimization: A Regularization Viewpoint Suppose l t is linear: l t (a) = g t a. Suppose A = R d. Then minimizing the regularized criterion a t+1 = arg min a A ( η ) t l s (a) a 2 s=1 corresponds to the gradient step a t+1 = a t η l t (a t ).
123 Online Convex Optimization: Regularization Regularized minimization Consider the family of strategies of the form: a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1
124 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). s=1 R keeps the sequence of a t s stable: it diminishes l t s influence. We can view the choice of a t+1 as trading off two competing forces: making l t (a t+1 ) small, and keeping a t+1 close to a t. This is a perspective that motivated many algorithms in the literature. We ll investigate why regularized minimization can be viewed this way.
125 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance to the previous decision. The appropriate notion of distance is the Bregman divergence D Φt 1 : Define Φ 0 = R, Φ t = Φ t 1 + ηl t, so that a t+1 = arg min a A ( η ) t l s (a) + R(a) s=1 = arg min a A Φ t(a).
126 Bregman Divergence Definition For a strictly convex, differentiable Φ : R d R, the Bregman divergence wrt Φ is defined, for a, b R d, as D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). D Φ (a, b) is the difference between Φ(a) and the value at a of the linear approximation of Φ about b.
127 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a R d, the squared euclidean norm, Φ(a) = 1 2 a 2, has D Φ (a, b) = 1 ( ) 1 2 a 2 2 b 2 + b (a b) = 1 2 a b 2, the squared euclidean norm.
128 Bregman Divergence D Φ (a, b) = Φ(a) (Φ(b) + Φ(b) (a b)). Example For a [0, ) d, the unnormalized negative entropy, Φ(a) = d i=1 a i (ln a i 1), has D Φ (a, b) = i = i (a i (ln a i 1) b i (ln b i 1) ln b i (a i b i )) ( a i ln a ) i + b i a i, b i the unnormalized KL divergence. Thus, for a d, Φ(a) = i a i ln a i has D φ (a, b) = i a i ln a i b i.
129 Bregman Divergence When the domain of Φ is A R d, in addition to differentiability and strict convexity, we make two more assumptions: The interior of A is convex, For a sequence approaching the boundary of A, Φ(a n ). We say that such a Φ is a Legendre function.
130 Bregman Divergence Properties: 1. D Φ 0, D Φ (a, a) = D A+B = D A + D B. 3. Bregman projection, Π Φ A (b) = arg min a A D Φ (a, b) is uniquely defined for closed, convex A. 4. Generalized Pythagorus: for closed, convex A, a = Π Φ A (b), and a A, D Φ (a, b) D Φ (a, a ) + D Φ (a, b). 5. a D Φ (a, b) = Φ(a) Φ(b). 6. For l linear, D Φ+l = D Φ. 7. For Φ the Legendre dual of Φ, Φ = ( Φ) 1, D Φ (a, b) = D Φ ( φ(b), φ(a)).
131 Legendre Dual For a Legendre function Φ : A R, the Legendre dual is Φ (u) = sup (u v Φ(v)). v A Φ is Legendre. dom(φ ) = Φ(int dom Φ). Φ = ( Φ) 1. D Φ (a, b) = D Φ ( φ(b), φ(a)). Φ = Φ.
132 Legendre Dual Example For Φ = p, the Legendre dual is Φ = q, where 1/p + 1/q = 1. Example For Φ(a) = d i=1 ea i, so Φ(a) = (e a 1,..., e a d ), ( Φ) 1 (u) = Φ (u) = (ln u 1,..., ln u d ), and Φ (u) = i u i(ln u i 1).
133 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
134 Properties of Regularization Methods In the unconstrained case (A = R d ), regularized minimization is equivalent to minimizing the latest loss and the distance (Bregman divergence) to the previous decision. Theorem Define ã 1 via R(ã 1 ) = 0, and set ã t+1 = arg min a R d ( ηlt (a) + D Φt 1 (a, ã t ) ). Then ã t+1 = arg min a R d ( η ) t l s (a) + R(a). s=1
135 Properties of Regularization Methods Proof. By the definition of Φ t, ηl t (a) + D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + D Φt 1 (a, ã t ). The derivative wrt a is Φ t (a) Φ t 1 (a) + a D Φt 1 (a, ã t ) = Φ t (a) Φ t 1 (a) + Φ t 1 (a) Φ t 1 (ã t ) Setting to zero shows that Φ t (ã t+1 ) = Φ t 1 (ã t ) = = Φ 0 (ã 1 ) = R(ã 1 ) = 0, So ã t+1 minimizes Φ t.
136 Properties of Regularization Methods Constrained minimization is equivalent to unconstrained minimization, followed by Bregman projection: Theorem For we have a t+1 = arg min Φ t(a), a A ã t+1 = arg min Φ t (a), a R d a t+1 = Π Φ t A (ã t+1 ).
137 Properties of Regularization Methods Proof. Let a t+1 denote ΠΦ t A (ã t+1 ). First, by definition of a t+1, Conversely, But Φ t (ã t+1 ) = 0, so Thus, Φ t (a t+1 ) Φ t(a t+1 ). Φ t (a t+1 ) Φ t (a t+1 ). D Φt (a t+1, ã t+1 ) D Φt (a t+1, ã t+1 ). D Φt (a, ã t+1 ) = Φ t (a) Φ t (ã t+1 ).
138 Properties of Regularization Methods Example For linear l t, regularized minimization is equivalent to minimizing the last loss plus the Bregman divergence wrt R to the previous decision: ( ) t arg min η l s (a) + R(a) a A s=1 ( ) = Π R A arg min (ηl t (a) + D R (a, ã t )), a R d because adding a linear function to Φ does not change D Φ.
139 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent to minimizing latest loss and divergence from previous decision Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
140 Properties of Regularization Methods: Linear Loss We can replace l t by l t (a t ), and this leads to an upper bound on regret. Theorem Any strategy for online linear optimization, with regret satisfying g t a t min a A g t a C n (g 1,..., g n ) can be used to construct a strategy for online convex optimization, with regret l t (a t ) min a A l t (a) C n ( l 1 (a 1 ),..., l n (a n )). Proof. Convexity implies l t (a t ) l t (a) l t (a t ) (a t a).
141 Properties of Regularization Methods: Linear Loss Key Point: We can replace l t by l t (a t ), and this leads to an upper bound on regret. Thus, we can work with linear l t.
142 Regularization Methods: Mirror Descent Regularized minimization for linear losses can be viewed as mirror descent taking a gradient step in a dual space: Theorem The decisions can be written ã t+1 = arg min a R d ( η ) t g s a + R(a) s=1 ã t+1 = ( R) 1 ( R(ã t ) ηg t ). This corresponds to first mapping from ã t through R, then taking a step in the direction g t, then mapping back through ( R) 1 = R to ã t+1.
143 Regularization Methods: Mirror Descent Proof. For the unconstrained minimization, we have R(ã t+1 ) = η t g s, s=1 t 1 R(ã t ) = η g s, s=1 so R(ã t+1 ) = R(ã t ) ηg t, which can be written ã t+1 = R 1 ( R(ã t ) ηg t ).
144 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
145 Online Convex Optimization: Regularization Regularized minimization a t+1 = arg min a A ( η ) t l s (a) + R(a). The regularizer R : R d R is strictly convex and differentiable. s=1
146 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret against any a A of l t (a t ) l t (a) = D R(a, a 1 ) D Φn (a, a n+1 ) + 1 η η and thus ˆL n inf a R d ( ) l t (a) + D R(a, a 1 ) + 1 η η D Φt (a t, a t+1 ), D Φt (a t, a t+1 ). So the sizes of the steps D Φt (a t, a t+1 ) determine the regret bound.
147 Regularization Methods: Regret Theorem For A = R d, regularized minimization suffers regret ( ) ˆL n inf l t (a) + D R(a, a 1 ) + 1 D Φt (a t, a t+1 ). a R d η η Notice that we can write D Φt (a t, a t+1 ) = D Φ t ( Φ t (a t+1 ), Φ t (a t )) = D Φ t (0, Φ t 1 (a t ) + η l t (a t )) = D Φ t (0, η l t (a t )). So it is the size of the gradient steps, D Φ t (0, η l t (a t )), that determines the regret.
148 Regularization Methods: Regret Bounds Example Suppose R = Then we have ˆL n L n + a a 1 2 2η + η 2 g t 2. And if g t G and a a 1 D, choosing η appropriately gives ˆL n L n DG n.
149 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization and Bregman divergences 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
150 Regularization Methods: Regret Bounds Seeing the future gives small regret: Theorem For regularized minimization, that is, for all a A, a t+1 = arg min a A l t (a t+1 ) ( η ) t l s (a) + R(a), s=1 l t (a) 1 η (R(a) R(a 1)).
151 Regularization Methods: Regret Bounds Proof. Since a t+1 minimizes Φ t, η t l s (a) + R(a) η s=1 t l s (a t+1 ) + R(a t+1 ) s=1 t 1 = ηl t (a t+1 ) + η l s (a t+1 ) + R(a t+1 ) s=1 t 1 ηl t (a t+1 ) + η l s (a t ) + R(a t ). η s=1 t l s (a s+1 ) + R(a 1 ). s=1
152 Regularization Methods: Regret Bounds Theorem For all a A, l t (a t+1 ) l t (a) 1 η (R(a) R(a 1)). Thus, if a t and a t+1 are close, then regret is small: Corollary For all a A, (l t (a t ) l t (a)) (l t (a t ) l t (a t+1 )) + 1 η (R(a) R(a 1)). So how can we control the increments l t (a t ) l t (a t+1 )?
153 Online Convex Optimization 1. Problem formulation 2. Empirical minimization fails. 3. Gradient algorithm. 4. Regularized minimization Bregman divergence Regularized minimization equivalent and Bregman divergence from previous Constrained minimization equivalent to unconstrained plus Bregman projection Linearization Mirror descent 5. Regret bounds Unconstrained minimization Seeing the future Strong convexity Examples (gradient, exponentiated gradient) Extensions
154 Regularization Methods: Regret Bounds Definition We say R is strongly convex wrt a norm if, for all a, b, R(a) R(b) + R(b) (a b) a b 2. For linear losses and strongly convex regularizers, the dual norm of the gradient is small: Theorem If R is strongly convex wrt a norm, and l t (a) = g t a, then a t a t+1 η g t, where is the dual norm to : v = sup{ v a : a A, a 1}.
155 Regularization Methods: Regret Bounds Proof. R(a t ) R(a t+1 ) + R(a t+1 ) (a t a t+1 ) a t a t+1 2, R(a t+1 ) R(a t ) + R(a t ) (a t+1 a t ) a t a t+1 2. Combining, a t a t+1 2 ( R(a t ) R(a t+1 )) (a t a t+1 ) Hence, a t a t+1 R(a t ) R(a t+1 ) = ηg t.
156 Regularization Methods: Regret Bounds This leads to the regret bound: Corollary For linear losses, if R is strongly convex wrt, then for all a A, (l t (a t ) l t (a)) η g t η (R(a) R(a 1)). Thus, for g t G and R(a) R(a 1 ) D 2, choosing η appropriately gives regret no more than 2GD n.
157 Regularization Methods: Regret Bounds Example Consider R(a) = 1 2 a 2, a 1 = 0, and A contained in a Euclidean ball of diameter D. Then R is strongly convex wrt and =. And the mapping between primal and dual spaces is the identity. So if sup a A l t (a) G, then regret is no more than 2GD n.
158 Regularization Methods: Regret Bounds Example Consider A = m, R(a) = i a i ln a i. Then the mapping between primal and dual spaces is R(a) = ln(a) (component-wise). And the divergence is the KL divergence, D R (a, b) = i a i ln(a i /b i ). And R is strongly convex wrt 1. Suppose that g t 1. Also, R(a) R(a 1 ) ln m, so the regret is no more than 2 n ln m.
159 Regularization Methods: Regret Bounds Example A = m, R(a) = i a i ln a i. What are the updates? a t+1 = Π R A(ã t+1 ) = Π R A( R ( R(ã t ) ηg t )) = Π R A( R (ln(ã t exp( ηg t ))) = Π R A(ã t exp( ηg t )), where the ln and exp functions are applied component-wise. This is exponentiated gradient: mirror descent with R = ln. It is easy to check that the projection corresponds to normalization, Π R A (ã) = ã/ a 1.
Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Online Learning Repeated game: Aim: minimize ˆL n = Decision method plays a t World reveals l t L n l t (a
More informationCS281B/Stat241B. Statistical Learning Theory. Lecture 1.
CS281B/Stat241B. Statistical Learning Theory. Lecture 1. Peter Bartlett 1. Organizational issues. 2. Overview. 3. Probabilistic formulation of prediction problems. 4. Game theoretic formulation of prediction
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationU Logo Use Guidelines
Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity
More informationLinear Bandits. Peter Bartlett
Linear Bandits. Peter Bartlett Linear bandits. Exponential weights with unbiased loss estimates. Controlling loss estimates and their variance. Barycentric spanner. Uniform distribution. John s distribution.
More informationAdaptive Online Gradient Descent
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationThe No-Regret Framework for Online Learning
The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,
More informationOnline learning CMPUT 654. October 20, 2011
Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOnline Learning with Experts & Multiplicative Weights Algorithms
Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect
More informationFast Rates for Estimation Error and Oracle Inequalities for Model Selection
Fast Rates for Estimation Error and Oracle Inequalities for Model Selection Peter L. Bartlett Computer Science Division and Department of Statistics University of California, Berkeley bartlett@cs.berkeley.edu
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationLearning, Games, and Networks
Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationOnline Learning: Random Averages, Combinatorial Parameters, and Learnability
Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago
More information1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016
AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework
More informationStatistical Approaches to Learning and Discovery. Week 4: Decision Theory and Risk Minimization. February 3, 2003
Statistical Approaches to Learning and Discovery Week 4: Decision Theory and Risk Minimization February 3, 2003 Recall From Last Time Bayesian expected loss is ρ(π, a) = E π [L(θ, a)] = L(θ, a) df π (θ)
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationDecision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over
Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationRegret bounded by gradual variation for online convex optimization
Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date
More informationLogarithmic Regret Algorithms for Strongly Convex Repeated Games
Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights
More informationIntroduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16
600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationNaïve Bayes classification
Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss
More informationLecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora
princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are
More informationTHE WEIGHTED MAJORITY ALGORITHM
THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!
More informationRobustness and duality of maximum entropy and exponential family distributions
Chapter 7 Robustness and duality of maximum entropy and exponential family distributions In this lecture, we continue our study of exponential families, but now we investigate their properties in somewhat
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More information0.1 Motivating example: weighted majority algorithm
princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationTutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning
Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret
More informationCS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm
CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationNo-Regret Algorithms for Unconstrained Online Convex Optimization
No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationThe Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm
he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,
More informationClass 2 & 3 Overfitting & Regularization
Class 2 & 3 Overfitting & Regularization Carlo Ciliberto Department of Computer Science, UCL October 18, 2017 Last Class The goal of Statistical Learning Theory is to find a good estimator f n : X Y, approximating
More informationNaïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability
Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish
More informationMachine Learning Lecture 7
Course Outline Machine Learning Lecture 7 Fundamentals (2 weeks) Bayes Decision Theory Probability Density Estimation Statistical Learning Theory 23.05.2016 Discriminative Approaches (5 weeks) Linear Discriminant
More informationAgnostic Online learnability
Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes
More informationOptimization, Learning, and Games with Predictable Sequences
Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Kimball, B-11 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationPeter Hoff Minimax estimation November 12, Motivation and definition. 2 Least favorable prior 3. 3 Least favorable prior sequence 11
Contents 1 Motivation and definition 1 2 Least favorable prior 3 3 Least favorable prior sequence 11 4 Nonparametric problems 15 5 Minimax and admissibility 18 6 Superefficiency and sparsity 19 Most of
More informationl 1 -Regularized Linear Regression: Persistence and Oracle Inequalities
l -Regularized Linear Regression: Persistence and Oracle Inequalities Peter Bartlett EECS and Statistics UC Berkeley slides at http://www.stat.berkeley.edu/ bartlett Joint work with Shahar Mendelson and
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding
More informationLecture 21: Minimax Theory
Lecture : Minimax Theory Akshay Krishnamurthy akshay@cs.umass.edu November 8, 07 Recap In the first part of the course, we spent the majority of our time studying risk minimization. We found many ways
More informationUnconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations
JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,
More informationLearning theory. Ensemble methods. Boosting. Boosting: history
Learning theory Probability distribution P over X {0, 1}; let (X, Y ) P. We get S := {(x i, y i )} n i=1, an iid sample from P. Ensemble methods Goal: Fix ɛ, δ (0, 1). With probability at least 1 δ (over
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationOnline Learning with Predictable Sequences
JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We
More informationOnline Prediction: Bayes versus Experts
Marcus Hutter - 1 - Online Prediction Bayes versus Experts Online Prediction: Bayes versus Experts Marcus Hutter Istituto Dalle Molle di Studi sull Intelligenza Artificiale IDSIA, Galleria 2, CH-6928 Manno-Lugano,
More informationCS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs
CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More informationWorst-Case Bounds for Gaussian Process Models
Worst-Case Bounds for Gaussian Process Models Sham M. Kakade University of Pennsylvania Matthias W. Seeger UC Berkeley Abstract Dean P. Foster University of Pennsylvania We present a competitive analysis
More informationWeighted Majority and the Online Learning Approach
Statistical Techniques in Robotics (16-81, F12) Lecture#9 (Wednesday September 26) Weighted Majority and the Online Learning Approach Lecturer: Drew Bagnell Scribe:Narek Melik-Barkhudarov 1 Figure 1: Drew
More informationExtreme Abridgment of Boyd and Vandenberghe s Convex Optimization
Extreme Abridgment of Boyd and Vandenberghe s Convex Optimization Compiled by David Rosenberg Abstract Boyd and Vandenberghe s Convex Optimization book is very well-written and a pleasure to read. The
More informationSequential Investment, Universal Portfolio Algos and Log-loss
1/37 Sequential Investment, Universal Portfolio Algos and Log-loss Chaitanya Ryali, ECE UCSD March 3, 2014 Table of contents 2/37 1 2 3 4 Definitions and Notations 3/37 A market vector x = {x 1,x 2,...,x
More informationLECTURE 5 NOTES. n t. t Γ(a)Γ(b) pt+a 1 (1 p) n t+b 1. The marginal density of t is. Γ(t + a)γ(n t + b) Γ(n + a + b)
LECTURE 5 NOTES 1. Bayesian point estimators. In the conventional (frequentist) approach to statistical inference, the parameter θ Θ is considered a fixed quantity. In the Bayesian approach, it is considered
More informationStatistical Measures of Uncertainty in Inverse Problems
Statistical Measures of Uncertainty in Inverse Problems Workshop on Uncertainty in Inverse Problems Institute for Mathematics and Its Applications Minneapolis, MN 19-26 April 2002 P.B. Stark Department
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationLecture 16: FTRL and Online Mirror Descent
Lecture 6: FTRL and Online Mirror Descent Akshay Krishnamurthy akshay@cs.umass.edu November, 07 Recap Last time we saw two online learning algorithms. First we saw the Weighted Majority algorithm, which
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationLecture 8 October Bayes Estimators and Average Risk Optimality
STATS 300A: Theory of Statistics Fall 205 Lecture 8 October 5 Lecturer: Lester Mackey Scribe: Hongseok Namkoong, Phan Minh Nguyen Warning: These notes may contain factual and/or typographic errors. 8.
More informationRelative Loss Bounds for Multidimensional Regression Problems
Relative Loss Bounds for Multidimensional Regression Problems Jyrki Kivinen and Manfred Warmuth Presented by: Arindam Banerjee A Single Neuron For a training example (x, y), x R d, y [0, 1] learning solves
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationApplications of on-line prediction. in telecommunication problems
Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some
More informationSequential prediction with coded side information under logarithmic loss
under logarithmic loss Yanina Shkel Department of Electrical Engineering Princeton University Princeton, NJ 08544, USA Maxim Raginsky Department of Electrical and Computer Engineering Coordinated Science
More informationChapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)
Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1) Detection problems can usually be casted as binary or M-ary hypothesis testing problems. Applications: This chapter: Simple hypothesis
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationBeyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization
More informationComputational Oracle Inequalities for Large Scale Model Selection Problems
for Large Scale Model Selection Problems University of California at Berkeley Queensland University of Technology ETH Zürich, September 2011 Joint work with Alekh Agarwal, John Duchi and Clément Levrard.
More informationTheory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!
Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training
More informationFORMULATION OF THE LEARNING PROBLEM
FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we
More informationFundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner
Fundamentals CS 281A: Statistical Learning Theory Yangqing Jia Based on tutorial slides by Lester Mackey and Ariel Kleiner August, 2011 Outline 1 Probability 2 Statistics 3 Linear Algebra 4 Optimization
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V
More informationOnline Optimization : Competing with Dynamic Comparators
Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online
More informationIterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk
Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk John Lafferty School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 lafferty@cs.cmu.edu Abstract
More informationSTATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION
STATISTICAL BEHAVIOR AND CONSISTENCY OF CLASSIFICATION METHODS BASED ON CONVEX RISK MINIMIZATION Tong Zhang The Annals of Statistics, 2004 Outline Motivation Approximation error under convex risk minimization
More informationA Mini-Course on Convex Optimization with a view toward designing fast algorithms. Nisheeth K. Vishnoi
A Mini-Course on Convex Optimization with a view toward designing fast algorithms Nisheeth K. Vishnoi i Preface These notes are largely based on lectures delivered by the author in the Fall of 2014. The
More informationA survey: The convex optimization approach to regret minimization
A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization
More informationLearning Theory. Ingo Steinwart University of Stuttgart. September 4, 2013
Learning Theory Ingo Steinwart University of Stuttgart September 4, 2013 Ingo Steinwart University of Stuttgart () Learning Theory September 4, 2013 1 / 62 Basics Informal Introduction Informal Description
More informationBayesian Methods: Naïve Bayes
Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More information