Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
Agenda 1. Motivating examples 2. Unified framework & why it makes sense 3. Algorithms & main results / techniques 4. Research directions & open questions
Section 1: Motivating Examples Inherently adversarial & online
Sequential Spam Classification - online & adversarial learning observe n features (words): a " R % Predict label b' " { 1,1}, feedback b t Objective: average error à best linear model in hindsight min 4 log(1 + e <= > 1 @ A > ) 1 2 3 "
Spam Classifier Selection Stream of emails: log 1 + e <= 1@A + λ D x G a R H log 1 + e <= 1@A + λ G x G log 1 + e <= 1@A + λ % x G Objective: make predictions with accuracy à best classifier in hindsight
Universal portfolio selection Price relatives: MNOPQ%R STQMU r J i = OSU%Q%R STQMU for commodity i r t 1.5 x t 1/2 Distribution of wealth x t 1 0 Market - No Statistical assumptions 1 0.5 1/2 Wealth multiplier = 0 (½*1.5 + 0*1 +½*1 + 0 * ½) = 1 ¼ log W t+1 = log W t + log(r > t x t ) Price Relatives vector r t Objective: choose portfolios s.t. avg. log wealth à avg. log wealth of best CRP
Constant rebalancing portfolio 0.1 2 0.1 2 0.1 2 0.1 2 0.1 2 Single currency depreciates exponentially The 50/50 Constant Rebalanced Portfolio (CRP) makes 5% every day
Recommendation systems 18000 movies X 1 1 X X X X X X X 480000 users X X -1 X X X X 1 X -1 X -1 X X X X X X X X 1 X X X X X X -1 X Objective: ratings with accuracy à best low rank matrix Computation challenge: optimizing over low rank matrices
Ad Selection Yahoo, Google etc display ads on pages, clicks generate revenue Abstraction: Observe a stream of users Given a user, select one ad to display Feedback: click (or not) on the ad shown Objective: average revenue per ad à best policy in a given class Additional difficulty: partial (bandit) feedback (feedback only for ad actually shown)
2. Methodology: Online Convex Optimization definition & relation to PAC learning go through examples argue it makes sense
Statistical (PAC) learning Nature: i.i.d from distribution D over A B = {(a, b)} (a 1, b 1 ) (a M, b M ) learner: Hypothesis h Loss, e.g. ll h, a, b = h a b G h 1 h 2 err h = E A,= l [ll(h, a, b ] h N Hypothesis class H: X - > Y is learnable if ε, δ > 0 exists algorithm s.t. after seeing m examples, for m = poly(δ, ε, dimension(h)) finds h s.t. w.p. 1- δ: err(h) apple min h 2H err(h )+
Online Learning in Games Iteratively, for t = 1,2,, T Player: h " H Adversary: (a ",b " ) A Loss ll(h ",(a ", b " )) H A x B Goal: minimize (average, expected) regret: 1 T " X `(h t, (a t,b t ) t # X min `(h, (a t,b t ))! 0 h 2H T!1 t Can we learn in games efficiently? Regret à o(t)? (i.e. as fundamental theorem of stat. learning)
Online vs. Batch (PAC, Statistical) Online convex optimization & regret minimization Learning against an adversary Regret Streaming setting: process examples sequentially Less assumptions, stronger guarantees Harder algorithmically Batch learning, PAC/statistical learning Nature is benign, i.i.d. samples from unknown distribution Generalization error Batch: process dataset multiple times Weaker guarantees (changing distribution, etc.) Easier algorithmically
Online Convex Optimization Adversary Convex cost function f t Online Player Point x t in convex set K in R n loss f t (x t ) Total loss Σ t f t (x t ) Access to K,f t? Regret = X t f t (x t ) min x 2K X f t (x ) t
Prediction from expert advice Decision set = set of all distributions over n experts K = n = {x 2 R n, X i x i =1, x i 0} Cost functions? Let: c t (i) = loss of expert i in round t f t (x) =c > t x = X i c t (i)x(i) f t (x) = expected loss if choosing experts by distribution x t Regret = difference in # expected loss vs. best expert: Regret = X t f t (x t ) min x 2K X f t (x ) t
Parameterized Hypothesis Classes H - parameterized by vector x in convex set K R % In round t, define loss function based on example (a t, b t ) 1. Online Linear Spam Filtering: K = x x ω Loss function f " x = a " { x b " G 2. Online Soft Margin SVM: K = R n Loss function f " x = max 0, 1 b " a " { x + λ x G 3. Online Matrix Completion: K = X R % %, X k matrices with bounded nuclear norm At time t, if a t = (i t, j t ), then loss function f " x = x i ", j " b " G
Universal portfolio selection Recall change in log- wealth: log W t+1 = log W t + log(r > t x t ) Thus: K = n = {x 2 R n, X i x i =1, x i 0} f t (x) = log(r > t x t ) A Universal Portfolio algorithm: Regret T = 1 T X f t (x t ) t 1 T min X f t (x ) 7! 0 x t
Bandit Online Convex Optimization Same as OCO, except f t is not revealed to learner, only f t (x t ) is. E.g. Ad selection: a bandit version of prediction with expert advice Decision set K = set of distributions over possible ads to be shown K = n = {x 2 R n, X i x i =1, x i 0} Loss function: let c t (i) = 0 if ad i would be clicked if it were shown, and 1 otherwise f t (x) =c > t x = X i c t (i)x(i)
Algorithms randomized weighted majority follow- the- leader online gradient descent, regularization? log- regret, Online- Newton- Step Bandit algorithms (FKM, barriers, volumetric spanners) Projection- free methods (FW algorithm, online FW) Part 1: Foundations
Prediction from expert advice a R H 0 if correct 1 for mistake N experts (different regularizations) Can we perform as good as the best expert? (non- stochastic, adversarial setting)
Weighted Majority [Littlestone- Warmuth 89] Binary classification: {0, 1} loss n experts x " h D x " h G.. x " h % Initially, all x " h D = 1 round t, predict by weighted majority Update weights: x "ƒd h = 1 η x " h if expert h errs x "ƒd h = x " h otherwise Thm: Total- errors 2 1 + η Best- Expert + G % ˆ
Weighted Majority: Analysis Total weight: Φ " = x " (h) Œ 1. Alg mistake à at least half weight on experts who erred 2. If h * is best expert, Thus: t+1 apple 1 2 t (1 )+ 1 2 t apple t (1 2 ) T x T (h )=(1 ) #(h errors) (1 ) #(best expert errors) apple T apple 0 (1 errors) 2 )#(alg #(alg errors) apple 2(1 + )#(best expert errors) + 2 log n
Randomized Weighted Majority [LW 89] general: expert h time t: loss = c " h [0,1] n experts x " h D x " h G.. x " h % Initially, all x " h D = 1 round t, predict h w.p.: Update weights: x "ƒd h = e <ˆM > x " 1 > 1 > ( ) h Thm: For η = Ž { Regret = O( T logn )
Reminder: Online Convex Optimization Adversary Convex cost function f t Online Player Point x t in convex set K in R n loss f t (x t ) Total loss Σ t f t (x t ) Access to f t? Benchmark? Regret = X t f t (x t ) min x 2K X f t (x ) t
Minimize regret: best- in- hindsight Most natural: Regret = X t Provably works: (even for non- convex sets [Kalai- Vempala 05]) Is x t close to x t+1? f t (x t ) Xt 1 x t = arg min{ f i (x)} i=1 min x 2K X f t (x ) t x t+1 = arg min x2k ( tx ) f i (x) i=1 Decision may be unstable! (teaser: this can be fixed w. regularization) x t = arg min{ tx f i (x)+ 1 R(x)} i=1
Offline: Steepest Descent Move in the direction of steepest descent, which is: [rf(x)] i = @ @x i f(x) p * p 3 p 2 p 1
Online gradient descent [Zinkevich 03] y t+1 = x t rf t (x t ) x t+1 = arg min x2k ky t+1 x t k Theorem: Regret = O T
Analysis Observation 1: ky t+1 x k 2 = kx t x k 2 2 r t (x x t )+ 2 kr t k 2 Observation 2: (Pythagoras) Thus: kx t+1 x kappleky t+1 x k kx t+1 x k 2 applekx t x k 2 2 r t (x x t )+ 2 kr t k 2 X Convexity: [f t (x t ) f t (x )] apple X r t (x t x ) t t " f " (x " ) apple 1 (kx t x k 2 kx t+1 x k 2 )+ X t kr t k 2 apple 1 kx 1 x k 2 + TG = O( p T )
Lower bound 2 experts, T iterations: First expert has random loss in {- 1,1} Second expert loss = first * - 1 Expected loss = 0 (any algorithm) Regret = (think of first expert only) Regret = ( p T ) E[ #1 0 s #( 1) 0 s ] = ( p T )
Algorithms randomized weighted majority follow- the- leader online gradient descent, regularization? log- regret, Online- Newton- Step Bandit algorithms (FKM, barriers, volumetric spanners) Projection- free methods (FW algorithm, online FW) Part 2 coming up