Computational and Statistical Learning Theory

Size: px

Start display at page:

Download "Computational and Statistical Learning Theory"

Silvia Short
5 years ago
Views:

1 Computational and Statistical Learning Theory TTIC Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification

2 Stochastic (Sub)-Gradient Descent online2stochastic [Zinkevich 03] [Cesa-Binachi et al 02] [Nemirovski Yudin 78] Online Mirror Descent [Shalev-Shwatz Singer 07] [Nemirovski Yudin 78] Online Gradient Descent Stochastic Gradient Descent Stochastic Mirror Descent Optimize F w = E z D f w, z s.t. w W 1. Initialize w 1 = 0 W 2. At iteration t = 1,2,3, 1. Sample z t D (Obtain g t s.t. E g t F(w t )) 2. w t+1 = Π W w t η t f w t, z t 3. Return w m = 1 σ m t=1 m w t g t If f w, z 2 G then with appropriate step size: E F w m inf w W, w 2 B F w + O B 2 G 2 m Similarly, also Stochastic Mirror Descent

3 Stochastic Optimization min F(w) = E w W z~d [f w, z ] based only on stochastic information on F Only unbiased estimates of F w, F(w) No direct access to F E.g., fixed f(w, z) but D unknown Optimize F(w) based on iid sample z 1, z 2,, z m ~D g = f(w, z t ) is unbiased estimate of F w Traditional applications Optimization under uncertainty Uncertainty about network performance Uncertainty about client demands Uncertainty about system behavior in control problems Complex systems where its easier to sample then integrate over z

4 Valdimir Vapnik Machine Learning is Stochastic Optimization Arkadi Nemirovskii min h L h = E z D l h, z = E x,y D loss h x, y Optimization variable: predictor h Objective: generalization error L(h) Stochasticity over z = (x, y) General Learning Stochastic Optimization:

5 Stochastic Optimization Focus on computational efficiency Generally assumes unlimited sampling - as in monte-carlo methods for complicated objectives Optimization variable generally a vector in a normed space - complexity control through norm Mostly convex objectives vs Statistical Learning Focus on sample size What can be done with a fixed number of samples? Abstract hypothesis classes - linear predictors, but also combinatorial hypothesis classes - generic measures of complexity such as VC-dim, fat shattering, Radamacher Also non-convex classes and loss functions Arkadi Nemirovskii Valdimir Vapnik

6 Two Approaches to Stochastic Optimization / Learning min F(w) = E z~d[f w, z ] w W Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA): Collect sample z 1,,z m Minimize F S w = 1 m σ i f w, z i Analysis typically based on Uniform Convergence Stochastic Approximation (SA): [Robins Monro 1951] Update w t based on z t E.g., based on g t = f(w, z t ) E.g.: stochastic gradient descent Online-to-batch conversion of online algorithm

7 SA/SGD for Machine Learning In learning with ERM, need to optimize w = arg min w W L S(w) L S w = 1 m σ i l(w, z i ) L S w is expensive to evaluate exactly O(md) time Cheap to get unbiased gradient estimate O(d) time SGD guarantee: i Unif(1 m) E g = σ i 1 m l w, z i g = l w, z i = L S (w) E L S ഥw T inf L S w + sup l sup w 2 w W T

8 SGD for SVM min L S w s. t. w 2 B Use g t = w loss hinge w t, φ it x ; y it for random i t Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t If w t+1 2 > B, then w t+1 B Return ഥw T = 1 σ T T t=1 w t w t+1 w t+1 2 φ x 2 G g t 2 G L S ഥw T L S w + B2 G 2 (in expectation over randomness in algorithm) T

9 Stochastic vs Batch min L S w s. t. w 2 B w w g 1 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) x 1,y 1 x 2,y 2 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) w w g 2 g 3 = loss(w, (x 3, y 3 ) ) x 3,y 3 g 3 = loss(w, (x 3, y 3 ) ) w w g 3 w w g 4 w w g 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, (x 5, y 5 ) ) x 4,y 4 x 5,y 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, x 5, y 5 ) L S w = 1 m σg i w w g m 1 w w g g m = loss(w, (x m, y m ) ) m x m,y m g m = loss(w, (x m, y m ) ) w w g i

10 Stochastic vs Batch Intuitive argument: if only taking simple gradient steps, better to be stochastic To get L S w L S w + ε opt : Batch GD SGD #iter cost/iter runtime B 2 G 2 2 /ε opt md md B2 G 2 ε2 opt B 2 G 2 2 /ε opt d d B2 G 2 ε2 opt Comparison to methods with a log 1/ε opt dependence that use the structure of L S (w) (not only local access)? How small should ε opt be? What about L w, which is what we really care about?

11 Overall Analysis of L D (w) Recall for ERM: L D w L D w + 2sup w = arg min w B L S(w) w L D w + L S w For ε opt suboptimal ERM ഥw: L D ഥw L D w + 2sup L D w L S w + L S ഥw L S w ε aprox w w = arg min w B L D(w) ε est 2 B2 G 2 m ε opt B2 G 2 T Take ε opt ε est, i.e. #iter T sample size m To ensure L D w L D w + ε: T, m = O B2 G 2 ε 2

12 Direct Online-to-Batch: SGD on L D (w) min w L D(w) use g t = w hinge y w, φ x E g t Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T T = 1 T σ t=1 w t = L D w for random y, x~d L D ഥw T inf w 2 B L D w + B2 G 2 T m = T = O B2 G 2 ε 2

13 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t min w L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need to project nor require w B Implicit regularization via early stopping Can have T > m iterations Need to project to w B Explicit regularization via w

14 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t η t = min w B 2 /G 2 t L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t L ഥw T L w + B2 G 2 T L ഥw T L w + 2 B2 G 2 m + B2 G 2 T w = arg min w B L(w)

15 SGD for Machine Learning min w Direct SA (online2batch) Approach: L w SGD on RERM: min L S w + λ 2 w Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need shrink w Implicit regularization via early stopping Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 w t+1 λw (t) Return ഥw T = 1 σ T T t=1 w t Can have T > m iterations Need to shrink w Explicit regularization via w

16 SGD vs ERM w = arg min w B L S(w) w = arg min w B L(w) w 0 T > m w/ proj w ഥw m w O B 2 G 2 m arg min L S(w) w (overfit)

17 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense

18 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense Still: fresh samples are better ) With a larger training set, can reduce generalization error faster ) Larger training set means less runtime to get target generalization error

19 Online Optimization vs Stochastic Approximation In both Online Setting and Stochastic Approximation Receive samples sequentially Update w after each sample But, in Online Setting: Objective is empirical regret, i.e. behavior on observed instances z t chosen adversarialy (no distribution involved) As opposed on Stochastic Approximation: Objective is E l w, z, i.e. behavior on future samples i.i.d. samples z t Stochastic Approximation is a computational approach, Online Learning is an analysis setup E.g. Follow the leader

20 Part II: Realizable vs Agnostic Rates

21 Realizable vs Agnostic Rates Recall for finite hypothesis classes: log H +log2 δ 2m L D h inf L D(h) + 2 h H But in the realizable case, if inf L D(h) = 0: h H L D h log H +log1 δ m m = O log H ε m = O log H ε 2 Also for VC-classes, in general m = O VCdim realizable case m = O VCdim log 1/ε ε ε 2 while in the What happens if L = inf h H L D(h) is low, but not zero?

22 Estimating the Bias of a Coin p pƹ log 2 δ 2m p pƹ 2 p log 2 δ m + 2 log 2 δ 3m

23 Optimistic VC bound (aka L -bound, multiplicative bound) h = arg min h H L S(h) L = inf h H L(h) For a hypothesis class with VC-dim D, w.p. 1-δ over n samples: L h L + 2 L = inf α D log2emτ D +log m 2 δ 1 + α L α D log2emτ + 4 D +log 2 δ m D log2emτ 4 D +log 2 δ m Sample complexity to get L h L + ε: m ε = O D ε L + ε ε log 1 ε Extends to bounded real-valued loss in terms of VC-subgraph dim

24 From Parametric to Scale Sensitive L h = E loss h x, y h H Instead of VC-dim or VC-subgraph-dim ( #params), rely on metric scale to control complexity, e.g.: H = w w, x w 2 B } R Learning depends on: Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity Scale sensitivity of loss (bound on derivatives or margin ) For Hwith Rademacher Complexity R m, and loss G: L h L + 2GR m + log 2 δ 2m R m R m L + O G 2 R + log 2 δ 2m = R m H = B2 sup x 2 m

25 Non-Parametric Optimistic Rate for Smooth Loss Theorem: for any H with (worst case) Rademacher Complexity R m (H), and any smooth loss with loss loss b, w.p. 1 δ over n samples: H, Sample complexity

26 Parametric vs Non-Parametric Parametric dim(h) D, h 1 Scale-Sensitive R n H R n Lipschitz: φ G (e.g. hinge, l 1 ) G D m + L GD m G 2 R m Smooth: φ H (e.g. logistic, Huber, smoothed hinge) H D m + L HD m H R m + L HR m Smooth & strongly convex: μ φ H (e.g. square loss) H μ H D m H R m + L HR m Min-max tight up to poly-log factors

27 Optimistic Learning Guarantees L h 1 + α L α O R m m ε O R ε L + ε ε Parametric classes Scale-sensitive classes with smooth loss Perceptron gurantee Margin Bounds Stability-based guarantees with smooth loss Online Learning/Optimization with smooth loss Non-param (scale sensitive) classes with non-smooth loss Online Learning/Optimization with non-smooth loss

28 Why Optimistic Guarantees? L h 1 + α L α O R m m ε O R ε L + ε ε Optimistic regime typically relevant regime: Approximation error L Estimation error ε If ε L, better to spend energy on lowering approx. error (use more complex class) Often important in highlighting true phenomena

29 Part III: Nearest Neighbor Classification

30 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i?

31 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i As learning rule: NN S = h where h(x) = y arg min i ρ x,x i

32 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? xx x x 12 φ x φ(x ) 2 φ x = (5x 1, x 2 )

33 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? φ x φ(x ) 2 φ x = (5x 1, x 2 ) x x 1 x x

34 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? Option 2: Special-purpose distance measure on x E.g. edit distance, deformation measure, etc

35 Nearest Neighbor Learning Guarantee Optimal predictor: h = arg min L D (h) h +1, η x > 0.5 x = ቊ 1, η x < 0.5 η x = P D (y = 1 x) For the NN rule with ρ x, x = φ x φ x 2, and φ: X [0,1] d : E S D m L NN S 2L h + 4c D d d+1 m η x η x c D ρ(x, x )

36 Data Fit / Complexity Tradeoff E S D m L NN S 2L(h ) + 4c D d d+1 m k-nearest Neighbor: predict according to majority among k closest point from S.

37 k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 k=5 k=12 k=100 k=200

38 k-nearest Neighbor Guarantee For k-nn with ρ x, x = φ x φ x 2, and φ: X [0,1] d : η x η x c D ρ(x, x ) E S D m L NN k S k L h + 6c D d + k d+1 m Should increase k with sample size m Above theory suggests k m L h 2/3 m 2 3(d+1) Universal Learning: for any smooth D and representation φ (with continuous P(y φ x )), if we increase k slowly enough, we will eventually converge to optimal L(h ) Very non-uniform: sample complexity depends not only on h, but also on D

39 Uniform and Non-Uniform Learnability Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a δ learning rule A such that ε, δ > 0, m ε, δ, D, h, S D m ε,δ, L D A S L D h + ε Definition: A hypothesis class H is non-uniformly learnable if there exists a δ learning rule A such that ε, δ > 0, h, m ε, δ, h, D, S D m ε,δ,h, L D A S L D h + ε Definition: A hypothesis class H is consistently learnable if there exists a δ learning rule A such that ε, δ > 0, h D, m ε, δ, h, D, S D m ε,δ,h,d, L D A S L D h + ε Realizable/Optimistic Guarantees: D dependence through L D (h)

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik