Computational and Statistical Learning Theory

Size: px
Start display at page:

Download "Computational and Statistical Learning Theory"

Transcription

1 Computational and Statistical Learning Theory TTIC Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification

2 Stochastic (Sub)-Gradient Descent online2stochastic [Zinkevich 03] [Cesa-Binachi et al 02] [Nemirovski Yudin 78] Online Mirror Descent [Shalev-Shwatz Singer 07] [Nemirovski Yudin 78] Online Gradient Descent Stochastic Gradient Descent Stochastic Mirror Descent Optimize F w = E z D f w, z s.t. w W 1. Initialize w 1 = 0 W 2. At iteration t = 1,2,3, 1. Sample z t D (Obtain g t s.t. E g t F(w t )) 2. w t+1 = Π W w t η t f w t, z t 3. Return w m = 1 σ m t=1 m w t g t If f w, z 2 G then with appropriate step size: E F w m inf w W, w 2 B F w + O B 2 G 2 m Similarly, also Stochastic Mirror Descent

3 Stochastic Optimization min F(w) = E w W z~d [f w, z ] based only on stochastic information on F Only unbiased estimates of F w, F(w) No direct access to F E.g., fixed f(w, z) but D unknown Optimize F(w) based on iid sample z 1, z 2,, z m ~D g = f(w, z t ) is unbiased estimate of F w Traditional applications Optimization under uncertainty Uncertainty about network performance Uncertainty about client demands Uncertainty about system behavior in control problems Complex systems where its easier to sample then integrate over z

4 Valdimir Vapnik Machine Learning is Stochastic Optimization Arkadi Nemirovskii min h L h = E z D l h, z = E x,y D loss h x, y Optimization variable: predictor h Objective: generalization error L(h) Stochasticity over z = (x, y) General Learning Stochastic Optimization:

5 Stochastic Optimization Focus on computational efficiency Generally assumes unlimited sampling - as in monte-carlo methods for complicated objectives Optimization variable generally a vector in a normed space - complexity control through norm Mostly convex objectives vs Statistical Learning Focus on sample size What can be done with a fixed number of samples? Abstract hypothesis classes - linear predictors, but also combinatorial hypothesis classes - generic measures of complexity such as VC-dim, fat shattering, Radamacher Also non-convex classes and loss functions Arkadi Nemirovskii Valdimir Vapnik

6 Two Approaches to Stochastic Optimization / Learning min F(w) = E z~d[f w, z ] w W Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA): Collect sample z 1,,z m Minimize F S w = 1 m σ i f w, z i Analysis typically based on Uniform Convergence Stochastic Approximation (SA): [Robins Monro 1951] Update w t based on z t E.g., based on g t = f(w, z t ) E.g.: stochastic gradient descent Online-to-batch conversion of online algorithm

7 SA/SGD for Machine Learning In learning with ERM, need to optimize w = arg min w W L S(w) L S w = 1 m σ i l(w, z i ) L S w is expensive to evaluate exactly O(md) time Cheap to get unbiased gradient estimate O(d) time SGD guarantee: i Unif(1 m) E g = σ i 1 m l w, z i g = l w, z i = L S (w) E L S ഥw T inf L S w + sup l sup w 2 w W T

8 SGD for SVM min L S w s. t. w 2 B Use g t = w loss hinge w t, φ it x ; y it for random i t Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t If w t+1 2 > B, then w t+1 B Return ഥw T = 1 σ T T t=1 w t w t+1 w t+1 2 φ x 2 G g t 2 G L S ഥw T L S w + B2 G 2 (in expectation over randomness in algorithm) T

9 Stochastic vs Batch min L S w s. t. w 2 B w w g 1 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) x 1,y 1 x 2,y 2 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) w w g 2 g 3 = loss(w, (x 3, y 3 ) ) x 3,y 3 g 3 = loss(w, (x 3, y 3 ) ) w w g 3 w w g 4 w w g 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, (x 5, y 5 ) ) x 4,y 4 x 5,y 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, x 5, y 5 ) L S w = 1 m σg i w w g m 1 w w g g m = loss(w, (x m, y m ) ) m x m,y m g m = loss(w, (x m, y m ) ) w w g i

10 Stochastic vs Batch Intuitive argument: if only taking simple gradient steps, better to be stochastic To get L S w L S w + ε opt : Batch GD SGD #iter cost/iter runtime B 2 G 2 2 /ε opt md md B2 G 2 ε2 opt B 2 G 2 2 /ε opt d d B2 G 2 ε2 opt Comparison to methods with a log 1/ε opt dependence that use the structure of L S (w) (not only local access)? How small should ε opt be? What about L w, which is what we really care about?

11 Overall Analysis of L D (w) Recall for ERM: L D w L D w + 2sup w = arg min w B L S(w) w L D w + L S w For ε opt suboptimal ERM ഥw: L D ഥw L D w + 2sup L D w L S w + L S ഥw L S w ε aprox w w = arg min w B L D(w) ε est 2 B2 G 2 m ε opt B2 G 2 T Take ε opt ε est, i.e. #iter T sample size m To ensure L D w L D w + ε: T, m = O B2 G 2 ε 2

12 Direct Online-to-Batch: SGD on L D (w) min w L D(w) use g t = w hinge y w, φ x E g t Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T T = 1 T σ t=1 w t = L D w for random y, x~d L D ഥw T inf w 2 B L D w + B2 G 2 T m = T = O B2 G 2 ε 2

13 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t min w L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need to project nor require w B Implicit regularization via early stopping Can have T > m iterations Need to project to w B Explicit regularization via w

14 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t η t = min w B 2 /G 2 t L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t L ഥw T L w + B2 G 2 T L ഥw T L w + 2 B2 G 2 m + B2 G 2 T w = arg min w B L(w)

15 SGD for Machine Learning min w Direct SA (online2batch) Approach: L w SGD on RERM: min L S w + λ 2 w Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need shrink w Implicit regularization via early stopping Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 w t+1 λw (t) Return ഥw T = 1 σ T T t=1 w t Can have T > m iterations Need to shrink w Explicit regularization via w

16 SGD vs ERM w = arg min w B L S(w) w = arg min w B L(w) w 0 T > m w/ proj w ഥw m w O B 2 G 2 m arg min L S(w) w (overfit)

17 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense

18 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense Still: fresh samples are better ) With a larger training set, can reduce generalization error faster ) Larger training set means less runtime to get target generalization error

19 Online Optimization vs Stochastic Approximation In both Online Setting and Stochastic Approximation Receive samples sequentially Update w after each sample But, in Online Setting: Objective is empirical regret, i.e. behavior on observed instances z t chosen adversarialy (no distribution involved) As opposed on Stochastic Approximation: Objective is E l w, z, i.e. behavior on future samples i.i.d. samples z t Stochastic Approximation is a computational approach, Online Learning is an analysis setup E.g. Follow the leader

20 Part II: Realizable vs Agnostic Rates

21 Realizable vs Agnostic Rates Recall for finite hypothesis classes: log H +log2 δ 2m L D h inf L D(h) + 2 h H But in the realizable case, if inf L D(h) = 0: h H L D h log H +log1 δ m m = O log H ε m = O log H ε 2 Also for VC-classes, in general m = O VCdim realizable case m = O VCdim log 1/ε ε ε 2 while in the What happens if L = inf h H L D(h) is low, but not zero?

22 Estimating the Bias of a Coin p pƹ log 2 δ 2m p pƹ 2 p log 2 δ m + 2 log 2 δ 3m

23 Optimistic VC bound (aka L -bound, multiplicative bound) h = arg min h H L S(h) L = inf h H L(h) For a hypothesis class with VC-dim D, w.p. 1-δ over n samples: L h L + 2 L = inf α D log2emτ D +log m 2 δ 1 + α L α D log2emτ + 4 D +log 2 δ m D log2emτ 4 D +log 2 δ m Sample complexity to get L h L + ε: m ε = O D ε L + ε ε log 1 ε Extends to bounded real-valued loss in terms of VC-subgraph dim

24 From Parametric to Scale Sensitive L h = E loss h x, y h H Instead of VC-dim or VC-subgraph-dim ( #params), rely on metric scale to control complexity, e.g.: H = w w, x w 2 B } R Learning depends on: Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity Scale sensitivity of loss (bound on derivatives or margin ) For Hwith Rademacher Complexity R m, and loss G: L h L + 2GR m + log 2 δ 2m R m R m L + O G 2 R + log 2 δ 2m = R m H = B2 sup x 2 m

25 Non-Parametric Optimistic Rate for Smooth Loss Theorem: for any H with (worst case) Rademacher Complexity R m (H), and any smooth loss with loss loss b, w.p. 1 δ over n samples: H, Sample complexity

26 Parametric vs Non-Parametric Parametric dim(h) D, h 1 Scale-Sensitive R n H R n Lipschitz: φ G (e.g. hinge, l 1 ) G D m + L GD m G 2 R m Smooth: φ H (e.g. logistic, Huber, smoothed hinge) H D m + L HD m H R m + L HR m Smooth & strongly convex: μ φ H (e.g. square loss) H μ H D m H R m + L HR m Min-max tight up to poly-log factors

27 Optimistic Learning Guarantees L h 1 + α L α O R m m ε O R ε L + ε ε Parametric classes Scale-sensitive classes with smooth loss Perceptron gurantee Margin Bounds Stability-based guarantees with smooth loss Online Learning/Optimization with smooth loss Non-param (scale sensitive) classes with non-smooth loss Online Learning/Optimization with non-smooth loss

28 Why Optimistic Guarantees? L h 1 + α L α O R m m ε O R ε L + ε ε Optimistic regime typically relevant regime: Approximation error L Estimation error ε If ε L, better to spend energy on lowering approx. error (use more complex class) Often important in highlighting true phenomena

29 Part III: Nearest Neighbor Classification

30 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i?

31 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i As learning rule: NN S = h where h(x) = y arg min i ρ x,x i

32 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? xx x x 12 φ x φ(x ) 2 φ x = (5x 1, x 2 )

33 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? φ x φ(x ) 2 φ x = (5x 1, x 2 ) x x 1 x x

34 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? Option 2: Special-purpose distance measure on x E.g. edit distance, deformation measure, etc

35 Nearest Neighbor Learning Guarantee Optimal predictor: h = arg min L D (h) h +1, η x > 0.5 x = ቊ 1, η x < 0.5 η x = P D (y = 1 x) For the NN rule with ρ x, x = φ x φ x 2, and φ: X [0,1] d : E S D m L NN S 2L h + 4c D d d+1 m η x η x c D ρ(x, x )

36 Data Fit / Complexity Tradeoff E S D m L NN S 2L(h ) + 4c D d d+1 m k-nearest Neighbor: predict according to majority among k closest point from S.

37 k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 k=5 k=12 k=100 k=200

38 k-nearest Neighbor Guarantee For k-nn with ρ x, x = φ x φ x 2, and φ: X [0,1] d : η x η x c D ρ(x, x ) E S D m L NN k S k L h + 6c D d + k d+1 m Should increase k with sample size m Above theory suggests k m L h 2/3 m 2 3(d+1) Universal Learning: for any smooth D and representation φ (with continuous P(y φ x )), if we increase k slowly enough, we will eventually converge to optimal L(h ) Very non-uniform: sample complexity depends not only on h, but also on D

39 Uniform and Non-Uniform Learnability Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a δ learning rule A such that ε, δ > 0, m ε, δ, D, h, S D m ε,δ, L D A S L D h + ε Definition: A hypothesis class H is non-uniformly learnable if there exists a δ learning rule A such that ε, δ > 0, h, m ε, δ, h, D, S D m ε,δ,h, L D A S L D h + ε Definition: A hypothesis class H is consistently learnable if there exists a δ learning rule A such that ε, δ > 0, h D, m ε, δ, h, D, S D m ε,δ,h,d, L D A S L D h + ε Realizable/Optimistic Guarantees: D dependence through L D (h)

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Day 3: Classification, logistic regression

Day 3: Classification, logistic regression Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm

More information

The sample complexity of agnostic learning with deterministic labels

The sample complexity of agnostic learning with deterministic labels The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis

More information

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017 Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:

More information

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Machine Learning. Linear Models. Fabio Vandin October 10, 2017 Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Introduction to Machine Learning (67577) Lecture 5

Introduction to Machine Learning (67577) Lecture 5 Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai

More information

Dan Roth 461C, 3401 Walnut

Dan Roth   461C, 3401 Walnut CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015

Machine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015 Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and

More information

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

COMS 4771 Introduction to Machine Learning. Nakul Verma

COMS 4771 Introduction to Machine Learning. Nakul Verma COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric

More information

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14 Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our

More information

Linear classifiers: Overfitting and regularization

Linear classifiers: Overfitting and regularization Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

Support Vector Machines. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017 Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Part of the slides are adapted from Ziko Kolter

Part of the slides are adapted from Ziko Kolter Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!

Regression and Classification with Linear Models CMPSCI 383 Nov 15, 2011! Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

The Perceptron algorithm

The Perceptron algorithm The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Discriminative Models

Discriminative Models No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity

More information

Overparametrization for Landscape Design in Non-convex Optimization

Overparametrization for Landscape Design in Non-convex Optimization Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,

More information

Lecture 4: Linear predictors and the Perceptron

Lecture 4: Linear predictors and the Perceptron Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.

More information

Generalization bounds

Generalization bounds Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question

More information

Generalization theory

Generalization theory Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI

Sample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines

More information

PAC-learning, VC Dimension and Margin-based Bounds

PAC-learning, VC Dimension and Margin-based Bounds More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based

More information

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh

Generalization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds

More information

ECE 5424: Introduction to Machine Learning

ECE 5424: Introduction to Machine Learning ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data

Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given

More information

IFT Lecture 7 Elements of statistical learning theory

IFT Lecture 7 Elements of statistical learning theory IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

CSC242: Intro to AI. Lecture 21

CSC242: Intro to AI. Lecture 21 CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012 Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)

More information

Machine Learning

Machine Learning Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem

More information

Machine Learning Theory (CS 6783)

Machine Learning Theory (CS 6783) Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information