Computational and Statistical Learning Theory
|
|
- Silvia Short
- 5 years ago
- Views:
Transcription
1 Computational and Statistical Learning Theory TTIC Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification
2 Stochastic (Sub)-Gradient Descent online2stochastic [Zinkevich 03] [Cesa-Binachi et al 02] [Nemirovski Yudin 78] Online Mirror Descent [Shalev-Shwatz Singer 07] [Nemirovski Yudin 78] Online Gradient Descent Stochastic Gradient Descent Stochastic Mirror Descent Optimize F w = E z D f w, z s.t. w W 1. Initialize w 1 = 0 W 2. At iteration t = 1,2,3, 1. Sample z t D (Obtain g t s.t. E g t F(w t )) 2. w t+1 = Π W w t η t f w t, z t 3. Return w m = 1 σ m t=1 m w t g t If f w, z 2 G then with appropriate step size: E F w m inf w W, w 2 B F w + O B 2 G 2 m Similarly, also Stochastic Mirror Descent
3 Stochastic Optimization min F(w) = E w W z~d [f w, z ] based only on stochastic information on F Only unbiased estimates of F w, F(w) No direct access to F E.g., fixed f(w, z) but D unknown Optimize F(w) based on iid sample z 1, z 2,, z m ~D g = f(w, z t ) is unbiased estimate of F w Traditional applications Optimization under uncertainty Uncertainty about network performance Uncertainty about client demands Uncertainty about system behavior in control problems Complex systems where its easier to sample then integrate over z
4 Valdimir Vapnik Machine Learning is Stochastic Optimization Arkadi Nemirovskii min h L h = E z D l h, z = E x,y D loss h x, y Optimization variable: predictor h Objective: generalization error L(h) Stochasticity over z = (x, y) General Learning Stochastic Optimization:
5 Stochastic Optimization Focus on computational efficiency Generally assumes unlimited sampling - as in monte-carlo methods for complicated objectives Optimization variable generally a vector in a normed space - complexity control through norm Mostly convex objectives vs Statistical Learning Focus on sample size What can be done with a fixed number of samples? Abstract hypothesis classes - linear predictors, but also combinatorial hypothesis classes - generic measures of complexity such as VC-dim, fat shattering, Radamacher Also non-convex classes and loss functions Arkadi Nemirovskii Valdimir Vapnik
6 Two Approaches to Stochastic Optimization / Learning min F(w) = E z~d[f w, z ] w W Empirical Risk Minimization (ERM) / Sample Average Approximation (SAA): Collect sample z 1,,z m Minimize F S w = 1 m σ i f w, z i Analysis typically based on Uniform Convergence Stochastic Approximation (SA): [Robins Monro 1951] Update w t based on z t E.g., based on g t = f(w, z t ) E.g.: stochastic gradient descent Online-to-batch conversion of online algorithm
7 SA/SGD for Machine Learning In learning with ERM, need to optimize w = arg min w W L S(w) L S w = 1 m σ i l(w, z i ) L S w is expensive to evaluate exactly O(md) time Cheap to get unbiased gradient estimate O(d) time SGD guarantee: i Unif(1 m) E g = σ i 1 m l w, z i g = l w, z i = L S (w) E L S ഥw T inf L S w + sup l sup w 2 w W T
8 SGD for SVM min L S w s. t. w 2 B Use g t = w loss hinge w t, φ it x ; y it for random i t Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t If w t+1 2 > B, then w t+1 B Return ഥw T = 1 σ T T t=1 w t w t+1 w t+1 2 φ x 2 G g t 2 G L S ഥw T L S w + B2 G 2 (in expectation over randomness in algorithm) T
9 Stochastic vs Batch min L S w s. t. w 2 B w w g 1 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) x 1,y 1 x 2,y 2 g 1 = loss(w, (x 1, y 1 ) ) g 2 = loss(w, (x 2, y 2 ) ) w w g 2 g 3 = loss(w, (x 3, y 3 ) ) x 3,y 3 g 3 = loss(w, (x 3, y 3 ) ) w w g 3 w w g 4 w w g 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, (x 5, y 5 ) ) x 4,y 4 x 5,y 5 g 4 = loss(w, (x 4, y 4 ) ) g 5 = loss(w, x 5, y 5 ) L S w = 1 m σg i w w g m 1 w w g g m = loss(w, (x m, y m ) ) m x m,y m g m = loss(w, (x m, y m ) ) w w g i
10 Stochastic vs Batch Intuitive argument: if only taking simple gradient steps, better to be stochastic To get L S w L S w + ε opt : Batch GD SGD #iter cost/iter runtime B 2 G 2 2 /ε opt md md B2 G 2 ε2 opt B 2 G 2 2 /ε opt d d B2 G 2 ε2 opt Comparison to methods with a log 1/ε opt dependence that use the structure of L S (w) (not only local access)? How small should ε opt be? What about L w, which is what we really care about?
11 Overall Analysis of L D (w) Recall for ERM: L D w L D w + 2sup w = arg min w B L S(w) w L D w + L S w For ε opt suboptimal ERM ഥw: L D ഥw L D w + 2sup L D w L S w + L S ഥw L S w ε aprox w w = arg min w B L D(w) ε est 2 B2 G 2 m ε opt B2 G 2 T Take ε opt ε est, i.e. #iter T sample size m To ensure L D w L D w + ε: T, m = O B2 G 2 ε 2
12 Direct Online-to-Batch: SGD on L D (w) min w L D(w) use g t = w hinge y w, φ x E g t Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T T = 1 T σ t=1 w t = L D w for random y, x~d L D ഥw T inf w 2 B L D w + B2 G 2 T m = T = O B2 G 2 ε 2
13 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t min w L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need to project nor require w B Implicit regularization via early stopping Can have T > m iterations Need to project to w B Explicit regularization via w
14 SGD for Machine Learning Direct SA (online2batch) Approach: Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t η t = min w B 2 /G 2 t L w SGD on ERM: min w 2 B L S w Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 proj w t+1 to w B Return ഥw T = 1 σ T T t=1 w t L ഥw T L w + B2 G 2 T L ഥw T L w + 2 B2 G 2 m + B2 G 2 T w = arg min w B L(w)
15 SGD for Machine Learning min w Direct SA (online2batch) Approach: L w SGD on RERM: min L S w + λ 2 w Initialize w 0 = 0 At iteration t: Draw x t, y t ~D If y t w t, φ x t < 1, w t+1 w t + η t y t φ x t else: w t+1 w t Return ഥw T = 1 σ T T t=1 w t Fresh sample at each iteration, m = T No need shrink w Implicit regularization via early stopping Draw x 1, y 1,, x m, y m ~D Initialize w 0 = 0 At iteration t: Pick i 1 m at random If y i w t, φ x i < 1, w t+1 w t + η t y i φ x i else: w t+1 w t w t+1 w t+1 λw (t) Return ഥw T = 1 σ T T t=1 w t Can have T > m iterations Need to shrink w Explicit regularization via w
16 SGD vs ERM w = arg min w B L S(w) w = arg min w B L(w) w 0 T > m w/ proj w ഥw m w O B 2 G 2 m arg min L S(w) w (overfit)
17 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense
18 Test misclassification error Mixed Approach: SGD on ERM m=300,000 m=400,000 m=500, ,000,000 2,000,000 3,000,000 # iterations k Reuters RCV1 data, CCAT task The mixed approach (reusing examples) can make sense Still: fresh samples are better ) With a larger training set, can reduce generalization error faster ) Larger training set means less runtime to get target generalization error
19 Online Optimization vs Stochastic Approximation In both Online Setting and Stochastic Approximation Receive samples sequentially Update w after each sample But, in Online Setting: Objective is empirical regret, i.e. behavior on observed instances z t chosen adversarialy (no distribution involved) As opposed on Stochastic Approximation: Objective is E l w, z, i.e. behavior on future samples i.i.d. samples z t Stochastic Approximation is a computational approach, Online Learning is an analysis setup E.g. Follow the leader
20 Part II: Realizable vs Agnostic Rates
21 Realizable vs Agnostic Rates Recall for finite hypothesis classes: log H +log2 δ 2m L D h inf L D(h) + 2 h H But in the realizable case, if inf L D(h) = 0: h H L D h log H +log1 δ m m = O log H ε m = O log H ε 2 Also for VC-classes, in general m = O VCdim realizable case m = O VCdim log 1/ε ε ε 2 while in the What happens if L = inf h H L D(h) is low, but not zero?
22 Estimating the Bias of a Coin p pƹ log 2 δ 2m p pƹ 2 p log 2 δ m + 2 log 2 δ 3m
23 Optimistic VC bound (aka L -bound, multiplicative bound) h = arg min h H L S(h) L = inf h H L(h) For a hypothesis class with VC-dim D, w.p. 1-δ over n samples: L h L + 2 L = inf α D log2emτ D +log m 2 δ 1 + α L α D log2emτ + 4 D +log 2 δ m D log2emτ 4 D +log 2 δ m Sample complexity to get L h L + ε: m ε = O D ε L + ε ε log 1 ε Extends to bounded real-valued loss in terms of VC-subgraph dim
24 From Parametric to Scale Sensitive L h = E loss h x, y h H Instead of VC-dim or VC-subgraph-dim ( #params), rely on metric scale to control complexity, e.g.: H = w w, x w 2 B } R Learning depends on: Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity Scale sensitivity of loss (bound on derivatives or margin ) For Hwith Rademacher Complexity R m, and loss G: L h L + 2GR m + log 2 δ 2m R m R m L + O G 2 R + log 2 δ 2m = R m H = B2 sup x 2 m
25 Non-Parametric Optimistic Rate for Smooth Loss Theorem: for any H with (worst case) Rademacher Complexity R m (H), and any smooth loss with loss loss b, w.p. 1 δ over n samples: H, Sample complexity
26 Parametric vs Non-Parametric Parametric dim(h) D, h 1 Scale-Sensitive R n H R n Lipschitz: φ G (e.g. hinge, l 1 ) G D m + L GD m G 2 R m Smooth: φ H (e.g. logistic, Huber, smoothed hinge) H D m + L HD m H R m + L HR m Smooth & strongly convex: μ φ H (e.g. square loss) H μ H D m H R m + L HR m Min-max tight up to poly-log factors
27 Optimistic Learning Guarantees L h 1 + α L α O R m m ε O R ε L + ε ε Parametric classes Scale-sensitive classes with smooth loss Perceptron gurantee Margin Bounds Stability-based guarantees with smooth loss Online Learning/Optimization with smooth loss Non-param (scale sensitive) classes with non-smooth loss Online Learning/Optimization with non-smooth loss
28 Why Optimistic Guarantees? L h 1 + α L α O R m m ε O R ε L + ε ε Optimistic regime typically relevant regime: Approximation error L Estimation error ε If ε L, better to spend energy on lowering approx. error (use more complex class) Often important in highlighting true phenomena
29 Part III: Nearest Neighbor Classification
30 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i?
31 The Nearest Neighbor Classifier Training sample S = x 1, y 1,, x m, y m Want to predict label of new point x The Nearest Neighbor Rule: Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i As learning rule: NN S = h where h(x) = y arg min i ρ x,x i
32 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? xx x x 12 φ x φ(x ) 2 φ x = (5x 1, x 2 )
33 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? φ x φ(x ) 2 φ x = (5x 1, x 2 ) x x 1 x x
34 Where is the Bias Hiding? Find the closest training point: i = arg min i ρ(x, x i ) Predict label of x as y i What is the right distance between images? Between sound waves? Between sentences? Option 1: ρ x, x = φ x φ x 2 What representation φ(x)? Maybe a different distance? φ x φ x 1? φ x φ x? sin( (φ x, φ x )? KL(φ(x) φ x )? Option 2: Special-purpose distance measure on x E.g. edit distance, deformation measure, etc
35 Nearest Neighbor Learning Guarantee Optimal predictor: h = arg min L D (h) h +1, η x > 0.5 x = ቊ 1, η x < 0.5 η x = P D (y = 1 x) For the NN rule with ρ x, x = φ x φ x 2, and φ: X [0,1] d : E S D m L NN S 2L h + 4c D d d+1 m η x η x c D ρ(x, x )
36 Data Fit / Complexity Tradeoff E S D m L NN S 2L(h ) + 4c D d d+1 m k-nearest Neighbor: predict according to majority among k closest point from S.
37 k-nearest Neighbor: Data Fit / Complexity Tradeoff h*= S= k=1 k=50 k=5 k=12 k=100 k=200
38 k-nearest Neighbor Guarantee For k-nn with ρ x, x = φ x φ x 2, and φ: X [0,1] d : η x η x c D ρ(x, x ) E S D m L NN k S k L h + 6c D d + k d+1 m Should increase k with sample size m Above theory suggests k m L h 2/3 m 2 3(d+1) Universal Learning: for any smooth D and representation φ (with continuous P(y φ x )), if we increase k slowly enough, we will eventually converge to optimal L(h ) Very non-uniform: sample complexity depends not only on h, but also on D
39 Uniform and Non-Uniform Learnability Definition: A hypothesis class H is agnostically PAC-Learnable if there exists a δ learning rule A such that ε, δ > 0, m ε, δ, D, h, S D m ε,δ, L D A S L D h + ε Definition: A hypothesis class H is non-uniformly learnable if there exists a δ learning rule A such that ε, δ > 0, h, m ε, δ, h, D, S D m ε,δ,h, L D A S L D h + ε Definition: A hypothesis class H is consistently learnable if there exists a δ learning rule A such that ε, δ > 0, h D, m ε, δ, h, D, S D m ε,δ,h,d, L D A S L D h + ε Realizable/Optimistic Guarantees: D dependence through L D (h)
Optimistic Rates Nati Srebro
Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems
More informationLearnability, Stability, Regularization and Strong Convexity
Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationOptimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade
Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationEmpirical Risk Minimization
Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space
More informationStochastic optimization in Hilbert spaces
Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert
More informationIntroduction to Machine Learning (67577) Lecture 7
Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationDay 3: Classification, logistic regression
Day 3: Classification, logistic regression Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 20 June 2018 Topics so far Supervised
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 7: Computational Complexity of Learning Agnostic Learning Hardness of Learning via Crypto Assumption: No poly-time algorithm
More informationThe sample complexity of agnostic learning with deterministic labels
The sample complexity of agnostic learning with deterministic labels Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, ON, N2L 3G CANADA shai@uwaterloo.ca Ruth Urner College
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationLarge Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 4: MDL and PAC-Bayes Uniform vs Non-Uniform Bias No Free Lunch: we need some inductive bias Limiting attention to hypothesis
More informationMachine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017
Machine Learning Regularization and Feature Selection Fabio Vandin November 13, 2017 1 Learning Model A: learning algorithm for a machine learning task S: m i.i.d. pairs z i = (x i, y i ), i = 1,..., m,
More informationStochastic Gradient Descent
Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular
More informationComputational and Statistical Learning Theory
Coputational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 2: PAC Learning and VC Theory I Fro Adversarial Online to Statistical Three reasons to ove fro worst-case deterinistic
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory Problem set 1 Due: Monday, October 10th Please send your solutions to learning-submissions@ttic.edu Notation: Input space: X Label space: Y = {±1} Sample:
More informationMachine Learning. Linear Models. Fabio Vandin October 10, 2017
Machine Learning Linear Models Fabio Vandin October 10, 2017 1 Linear Predictors and Affine Functions Consider X = R d Affine functions: L d = {h w,b : w R d, b R} where ( d ) h w,b (x) = w, x + b = w
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationCSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18
CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H
More informationIntroduction to Machine Learning (67577) Lecture 5
Introduction to Machine Learning (67577) Lecture 5 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Nonuniform learning, MDL, SRM, Decision Trees, Nearest Neighbor Shai
More informationDan Roth 461C, 3401 Walnut
CIS 519/419 Applied Machine Learning www.seas.upenn.edu/~cis519 Dan Roth danroth@seas.upenn.edu http://www.cis.upenn.edu/~danroth/ 461C, 3401 Walnut Slides were created by Dan Roth (for CIS519/419 at Penn
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationMachine Learning. VC Dimension and Model Complexity. Eric Xing , Fall 2015
Machine Learning 10-701, Fall 2015 VC Dimension and Model Complexity Eric Xing Lecture 16, November 3, 2015 Reading: Chap. 7 T.M book, and outline material Eric Xing @ CMU, 2006-2015 1 Last time: PAC and
More informationMachine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang
Machine Learning Basics Lecture 4: SVM I Princeton University COS 495 Instructor: Yingyu Liang Review: machine learning basics Math formulation Given training data x i, y i : 1 i n i.i.d. from distribution
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationCOMS 4771 Introduction to Machine Learning. Nakul Verma
COMS 4771 Introduction to Machine Learning Nakul Verma Announcements HW2 due now! Project proposal due on tomorrow Midterm next lecture! HW3 posted Last time Linear Regression Parametric vs Nonparametric
More informationLearning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14
Learning Theory Piyush Rai CS5350/6350: Machine Learning September 27, 2011 (CS5350/6350) Learning Theory September 27, 2011 1 / 14 Why Learning Theory? We want to have theoretical guarantees about our
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationUnderstanding Generalization Error: Bounds and Decompositions
CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationOLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research
OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationIntroduction to Machine Learning (67577) Lecture 3
Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz
More informationAdaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade
Adaptive Gradient Methods AdaGrad / Adam Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade 1 Announcements: HW3 posted Dual coordinate ascent (some review of SGD and random
More informationLittlestone s Dimension and Online Learnability
Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationStochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration
Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated
More informationSupport Vector Machines. Machine Learning Fall 2017
Support Vector Machines Machine Learning Fall 2017 1 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost 2 Where are we? Learning algorithms Decision Trees Perceptron AdaBoost Produce
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationPart of the slides are adapted from Ziko Kolter
Part of the slides are adapted from Ziko Kolter OUTLINE 1 Supervised learning: classification........................................................ 2 2 Non-linear regression/classification, overfitting,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationRegression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011!
Regression and Classification" with Linear Models" CMPSCI 383 Nov 15, 2011! 1 Todayʼs topics" Learning from Examples: brief review! Univariate Linear Regression! Batch gradient descent! Stochastic gradient
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationThe Perceptron algorithm
The Perceptron algorithm Tirgul 3 November 2016 Agnostic PAC Learnability A hypothesis class H is agnostic PAC learnable if there exists a function m H : 0,1 2 N and a learning algorithm with the following
More informationGeneralization, Overfitting, and Model Selection
Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationIntroduction to Machine Learning
Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses Efficient PAC Learning Until now we were mostly worried about sample complexity
More informationOverparametrization for Landscape Design in Non-convex Optimization
Overparametrization for Landscape Design in Non-convex Optimization Jason D. Lee University of Southern California September 19, 2018 The State of Non-Convex Optimization Practical observation: Empirically,
More informationLecture 4: Linear predictors and the Perceptron
Lecture 4: Linear predictors and the Perceptron Introduction to Learning and Analysis of Big Data Kontorovich and Sabato (BGU) Lecture 4 1 / 34 Inductive Bias Inductive bias is critical to prevent overfitting.
More informationGeneralization bounds
Advanced Course in Machine Learning pring 200 Generalization bounds Handouts are jointly prepared by hie Mannor and hai halev-hwartz he problem of characterizing learnability is the most basic question
More informationGeneralization theory
Generalization theory Daniel Hsu Columbia TRIPODS Bootcamp 1 Motivation 2 Support vector machines X = R d, Y = { 1, +1}. Return solution ŵ R d to following optimization problem: λ min w R d 2 w 2 2 + 1
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationSample Complexity of Learning Mahalanobis Distance Metrics. Nakul Verma Janelia, HHMI
Sample Complexity of Learning Mahalanobis Distance Metrics Nakul Verma Janelia, HHMI feature 2 Mahalanobis Metric Learning Comparing observations in feature space: x 1 [sq. Euclidean dist] x 2 (all features
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationApplied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson
Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationSupport Vector Machines
Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 8: Boosting (and Compression Schemes) Boosting the Error If we have an efficient learning algorithm that for any distribution
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationMachine Learning 4771
Machine Learning 477 Instructor: Tony Jebara Topic 5 Generalization Guarantees VC-Dimension Nearest Neighbor Classification (infinite VC dimension) Structural Risk Minimization Support Vector Machines
More informationPAC-learning, VC Dimension and Margin-based Bounds
More details: General: http://www.learning-with-kernels.org/ Example of more complex bounds: http://www.research.ibm.com/people/t/tzhang/papers/jmlr02_cover.ps.gz PAC-learning, VC Dimension and Margin-based
More informationGeneralization Bounds in Machine Learning. Presented by: Afshin Rostamizadeh
Generalization Bounds in Machine Learning Presented by: Afshin Rostamizadeh Outline Introduction to generalization bounds. Examples: VC-bounds Covering Number bounds Rademacher bounds Stability bounds
More informationECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting PAC Learning Readings: Murphy 16.4;; Hastie 16 Stefan Lee Virginia Tech Fighting the bias-variance tradeoff Simple
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationLecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data
Lecture 2 - Learning Binary & Multi-class Classifiers from Labelled Training Data DD2424 March 23, 2017 Binary classification problem given labelled training data Have labelled training examples? Given
More informationIFT Lecture 7 Elements of statistical learning theory
IFT 6085 - Lecture 7 Elements of statistical learning theory This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s): Brady Neal and
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationCSC242: Intro to AI. Lecture 21
CSC242: Intro to AI Lecture 21 Administrivia Project 4 (homeworks 18 & 19) due Mon Apr 16 11:59PM Posters Apr 24 and 26 You need an idea! You need to present it nicely on 2-wide by 4-high landscape pages
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationLecture 3: Introduction to Complexity Regularization
ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationTutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.
Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research
More informationLecture 8. Instructor: Haipeng Luo
Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationMachine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website)
More informationMachine Learning
Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University October 11, 2012 Today: Computational Learning Theory Probably Approximately Coorrect (PAC) learning theorem
More informationMachine Learning Theory (CS 6783)
Machine Learning Theory (CS 6783) Tu-Th 1:25 to 2:40 PM Hollister, 306 Instructor : Karthik Sridharan ABOUT THE COURSE No exams! 5 assignments that count towards your grades (55%) One term project (40%)
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More information