Learnability, Stability, Regularization and Strong Convexity

Size: px

Start display at page:

Download "Learnability, Stability, Regularization and Strong Convexity"

Kristian Richard
5 years ago
Views:

1 Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago (2008)

2 What is this tutorial about? Characterizing learnability What makes a problem learnable Necessary and sufficient conditions Role of stability Universal learning rules Anything learnable is learnable using XXX Only need to consider rules of the form XXX Relationship between online and statistical

3 Statistical Learning Relationship to Online Plan Characterizing (statistical) learnability Stability as the master property Stability in online learning Convex Problems Strong convexity as the master property

4 The General Learning Setting aka Stochastic Optimization min F w = E z~d f w, z w W Known objective function f: W Ω R Unknown distribution D over Z Ω Unknown distribution over random functions w f(w, z) Given an iid sample z 1, z 2, D, Given iid samples of random functions f, z 1, f, z 2, Vapnik95

5 General Learning: Examples Minimize F(w)=E z [f(w;z)] based on sample z 1,z 2,,z n Supervised learning: z = (x,y) w specifies a perdictor h w : X! Y f( w ; (x,y) ) = loss(h w (x),y) e.g. linear prediction: w, (x) 2 R d, f(w;(x,y) = loss(w (x),y) Unsupervised learning, e.g. k-means clustering: = x 2 R d w = ( [1],, [k]) 2 R d k specifies k cluster centers f( ( [1],, [k]) ; x ) = min j [j]-x 2 Density estimation: w specifies probability density p w (x) f( w ; x ) = -log p w (x) Optimization in uncertain environment, e.g.: z = traffic delays on each road segment w = route chosen (indicator over road segments in route) f( w ; z ) = hw,zi = total delay along route

6 General Learning Setting: Learnability Minimize F w = E z [f w, z ] based on sample z 1,, z m Problem specified by f, and domain W Problem is learnable if: There is a learning rule w(z 1,, z m ) for choosing w based on z 1,, z m s.t. for every ε > 0 and large enough sample size m, for any distribution D: E z1,,z m D F w inf F w + ε w W F(w ) Supervised Learning: When f(w;(x,y)) = loss(h w (x),y): learnable agnostic PAC learnable

7 Online Learning (Optimization) Adversary: Learner: f( ;z 1 ) f( ;z 2 ) f( ;z 3 ) w 1 w 2 w 3. Known function f, Unknown sequence z 1, z 2, Online learning rule: w i (z 1,, z i 1 ) Goal: i f(w i, z i ) Differences vs stochastic setting: Any sequence not necessarily iid No distinction between train and test

8 Online and Stochastic Regret Problem specified by f(, ) and domain W Online Regret: for any sequence, 1 m m i=1 f(w i (z 1,, z i 1 ), z i ) inf w W 1 m m i=1 f w, z i + Reg(m) Statistical Regret: for any distribution D, E z1,,z m D F D w z 1,, z m inf w W F D w + ε(m) Online-To-Batch: F w F w w(z 1,, z m ) = w i with prob 1/m E F w F w + Reg m

9 Online vs Statistical Online Learnable Statistically Learnable Thresholds: θ 0,1, h θ x = x θ x, y, x 0,1, y ±1 f θ, x, y = y h θ x = y sign(x θ) θ Learnable with excess error ε m = O 1 ε 2 statistically Can force learner to always make a mistake on adversarial sequence (Reg m = 1)

10 What is Learnable? Is a problem, specified by f, and W, learnable? I.e. can we drive regret / excess error ε to zero? Using what learning rule?

11 Supervised Classification f(w;(x,y)) = loss(h w (x),y): Binary Hypothesis Classes: h w (x)2±1 { h w w 2 W } has finite VC dimension Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Online Learnable Learnable using ERM: ^w = arg min ^F(w) F( ^w) n!1! F(w? ) Learnable (using some rule): F( ~w) n!1! F(w? )

12 Supervised Classification f(w;(x,y)) = loss(h w (x),y): Real Valued h w (with appropriate loss function) { h w w 2 W } has finite fat-shattering dimension Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Online Learnable Learnable using ERM: ^w = arg min ^F(w) F( ^w) n!1! F(w? ) Learnable (using some rule): F( ~w) n!1! F(w? ) [Alon Ben-David Cesa-Bianchi Haussler 93]

13 Beyond Supervised Learning Supervised learning: f w, x, y = loss(h w x, y) Combinatorial necessary and sufficient condition of learning ERM universal (if learnable, can do it with ERM) General learning / stochastic optimization: f(w, z)????

14 Convex Lipchitz Problems W convex bounded subset of Hilbert space (e.g. R d ) w W w 2 B For each z, f(w, z) convex Lipschitz w.r.t w f w, z f w, z L w w 2 Online Gradient Descent: Stochastic Setting? Reg m B2 L 2 m

15 Stochastic Convex Optimization Generalized Linear Objectives L = l g R Lipchitz in w f w, z = g( w, φ z ; z ) l g -Lipchitz R Uniform convergence: sup w W F w F w O B 2 L 2 log 1/δ m w.p. 1 δ E F w F w + O B 2 L 2 m Lipschitz-Bounded Problems Learnable in statistical setting? Of course, via online-to-batch! Using ERM? Is there uniform convergence? Finite VC?

16 Empirical Minimizer Not Consistent! f(w; ) = w W = {w 2 R d : w 1 } 2 {0,1} d Elementwise product ~ Uniform({0,1} d ), i.e. [i] ~ iid uniform 0/1 If d>>2 m (think of d=1) then with high probability there exists a coordinate j that is never on in the sample, i.e. i [j]=0 for all i=1..m ^F (e j ) = 0 F (e j ) = 1=2 sup w2w F (w) ^F(w) 1=2 No uniform convergence! e j is an empirical minimizer with F(e j ) = ½, far from F(w * )=F(0)=0

17 Empirical Minimizer Not Consistent! f(w;(,x)) = (w-x) 2 {0,1} d x2 R d, x 1 ~ Uniform({0,1} d ), i.e. [i] ~ iid uniform 0/1 If d>>2 m (think of d=1) then with high probability there exists a coordinate j that is never on in the sample, i.e. i [j]=0 for all i=1..m ^F (e j ) = 0 F (e j ) = 1=2 sup w2w F (w) ^F(w) 1=2 No uniform convergence! e j is an empirical minimizer with F(e j ) = ½, far from F(w * )=F(0)=0

18 general setting general setting general setting { z!f(w;z) w 2 W } has finite fat-shattering dimension Supervised learning Uniform convergence: sup F (w) ^F n!1 (w)! 0 w2w Supervised learning Learnable with ERM: F( ^w) n!1! F(w? ) Online Learnable Supervised learning Learnable (using some rule): F( ~w) n!1! F(w? )

19 Stochastic Convex Optimization and Supervised Learning For supervised learning objectives, e.g. f(w; x, y = yh w x 1 + (y ±1) f(w; x, y = h w x y 2 learnabiliy is equivalent to uniform convergence. What types of predictors h w yield convex problems? When y = +1, h w (x) should be convex in w When y = 1, h w (x) should be concave in w ) h w (x) should be linear in w Convex Supervised Learning Generalized Linear objective (y2{-1,+1})

20 Stochastic Convex Optimization Empirical minimization might not be consistent Learnable using specific procedural rule (online-to-batch conversion of online gradient descent)??????????

21 Strongly Convex Objective f(w, z) is -strongly convex in w iff: f αw + 1 α w αf w + 1 α f w λ 2 α 1 α w w 2 2 Equivalent to w 2 f w, z λ If f(w, z) is λ-convex and L-Lipschitz w.r.t. w Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] Reg O L2 log(m) λm Stochastic Setting? ERM?

22 Strongly Convex Objective f(w, z) is -strongly convex in w iff: f αw + 1 α w αf w + 1 α f w λ 2 α 1 α w w 2 2 Equivalent to w 2 f w, z λ If f(w, z) is λ-convex and L-Lipschitz w.r.t. w Online Gradient Descent [Hazan Kalai Kale Agarwal 2006] Reg O L2 log(m) λm Empirical Risk Minimization: E F w F w + O L2 λm

23 Strong Convexity and Stability Definition: rule w(z 1, z m ) is β(m)-stable if: f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) f is λ-strongly convex and L-Lipschitz f w z 1,, z m 1, z m f w z 1,, z m, z m β = 4L2 λm Symmetric w is β-stable E F w m 1 E F w m + β Conclusion: E F w For ERM: E F w E F w = F w β m

24 { z!f(w;z) w 2 W } has finite fat-shattering dimension Supervised learning Uniform convergence: not even local sup F (w) ^F n!1 (w)! 0 w2w Supervised learning Empirical minimizer is consistent: F( ^w) n!1! F(w? ) Online Learnable Supervised learning Solvable (using some algorithm): F( ~w) n!1! F(w? )

25 Back to Weak Convexity f(w, z) L-Lipchitz (and convex), w 2 B Use Regularized ERM: w λ = arg min w W F w + λ 2 w 2 2 Setting λ = L2 B 2 m : E F w λ F w + O L 2 B 2 m Key: strongly convex regularizer ensures stability

26 The Role of Regularization Typical view of regularization: Adding regularization term effectively constrains domain to lower complexity domain W r = w w r Learning guarantees (e.g. for SVMs, LASSO) are actually for empirical minimization inside W r, and are based on uniform convergence in W r. In our case: No uniform convergence in W r, for any r>0 No uniform convergence even of regularized loss Cannot solve stochastic optimization problem by restricting to W r, for any r. What regularization buys is stability

27 Stability Characterizes Learnability Theorem: Learnable with (symmetric) ERM w iff D E f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) For some β m 0 Theorem: Learnable iff symetric w s.t. D: w is an almost ERM : E F w F w ε(m) w is stable: E f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) For some ε m 0, β m 0 Finite fat-shat dimension Uniform Convergence Stable ERM Learnable with ERM supervised learning Stable AERM Learnable

28 Stability in Online Learning Reminder: rule w(z 1, z m ) is β(m)-stable if f w z 1,, z m 1, z m f w z 1,, z m, z m β(m) Follow The Leader (FTL): w m z 1,, z m 1 Be The Leader (BTL): w m z 1,, z m 1 Reg BTL m 0 = arg min w = arg min w i=1 m 1 f(w, z i ) i=1 m f(w, z i ) If the ERM is β(m)-stable: Reg FTL m Reg BTL m + 1 m i β i 1 m i β i E.g. for λ-strongly convex and L-Lipchitz: Reg FTL m L2 B 2 log m m

29 Regularization and Stability How do we deal with non-strongly convex objective? Follow The Regularized Leader (FTRL): w m z 1,, z m 1 = arg min w i=1 m 1 f(w, z i ) + λr(w) Be The Regularized Leader (BTRL): w m z 1,, z m 1 = arg min w i=1 m f w, z i + λr(w) Reg BTRL m λr(w) If the Regularized ERM is β(λ, m)-stable: Reg FTRL m Reg BTRL m + 1 m i β λ, i λr w + 1 m i β λ, i Follow The Almost Leader: 1 s.t. m 1 f( w m, z i ) inf w m m 1 i=1 w 1 m 1 i=1 m 1 f(w, z i ) + ε(m) Reg FTAL m Reg BTAL m + 1 m i β i 1 m i i ε i + 1 m i β i

30 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min1st η order i, w model + 1 of w F(w) w w 2 i 2 2 around w i, based on = w i η i i = f(w i, z i ) only valid near w i, so don t go too far F(w) F w i + F w i, w w i w i

31 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min w η i, w w w i 2 2 = w i η i F(w) F w i + F w i, w w i w i

32 Regularization in Online Gradient Descent w i+1 arg min w F w i + i, w w i + 1 2η w w i 2 2 = arg min w η i, w w w i 2 2 = w i η i Replace with other regularizer? Effect of regularization: performance depends on w 2

33 Online Mirror Descent Main ingredient: R(w) strongly convex w.r.t. some w R αw + 1 α w αr w + 1 α R w λ 2 α 1 α w w 2 Mirror Descent update: w i+1 arg min w W η f(w i, z i ), w + D R (w w i ) = Π W R R 1 R w i η i Bregman Divergence R w B 2 Reg m B2 L 2 m f w, z L-Lipchitz w.r.t w : f w, z f w, z L w w i.e. f L For generalized linear: φ L Example: with R w = entropy w = KL(w unif) and w 1, MD=Multiplicative Weights / Exponential Gradient

34 Strong Convexity as the Master Property R w strongly convex w.r.t w Uniform convergence of x w, x w B} arg min F w + λr(w) stable Online Mirror Descent RERM FTRL stat learnability of generalized linear (including supervised) stat learnability of f f L} over { w B} online learnability of f f L} over { w B}

35 Answers and Questions Statistical setting: Stable AERM (Almost ERM) necessary and sufficient Is stable RERM (Regularized ERM) necessary? i.e. do we always have w λ = arg min F w + λr(w) Online general setting: Stable FTRL (Follow the Regularized Leader) is sufficient Is stable FTAL (Follow the Almost Leader) sufficient? Stability/regularization necessary and sufficient? Convex problems: Strong convexity necessary and sufficient for online learning Strong convexity implies stability Online Mirror Descent (and FTRL) universal Strong convexity sufficient and almost necessary for stochastic learning Direct analysis of Mirror Descent in terms of stability?

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic