Inverse Time Dependency in Convex Regularized Learning

Size: px
Start display at page:

Download "Inverse Time Dependency in Convex Regularized Learning"

Transcription

1 Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December 7,

2 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

3 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

4 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

5 Observation 10K samples 1 hour 2.3% error 1M samples 1 week 2.29% error 1 hour 2.3% error 10 minutes 2.3% error Can we? The runtime decreases as the number of samples increase, when desired accuracy is fixed. Inverse Time Dependency December 7,

6 Our Contribution Propose a Primal Gradient Solver (PGS) and proves its inverse time dependency property. This work generalizes the state-ofthe-art l 2 -SVM result to l p -norm with convex loss functions. By bounding S (the domain of w), PGS is able to support more loss functions. For example, Least Square. It first demonstrates that both logistic loss and least square loss can be adopted into PGS and achieve the inverse time dependency property. December 7,

7 Error Decomposition err(w) Optimization error The error due to the early-stop of the optimization algorithm. Estimation error Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model. December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

8 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

9 Error Decomposition err(w) Optimization error Estimation error The error due to the early-stop of the optimization algorithm. Extra error due to the difference between the training set and the real sample distribution Approximation error Best achievable error for the given model Data set size n Estimation error (blue) Desire accuracy (sum) Prediction error (green) No. of iterations Using stochastic gradient descent Total running time Remark: Generalization Error (formally defined later) = Optimization Error + Estimation Error December 7, Example borrowed from Shalev-Shwartz & Srebro s slides.

10 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m December 7,

11 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Minimize December 7,

12 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i December 7,

13 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Regularizer December 7,

14 Convex Regularized Learning Ψ = x i, y i x i R n m, y i 1,1 i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i Loss December 7,

15 Convex Regularized Learning Ψ = θ i i=1 m m = x i, y i i=1 F σ w = σ 2 w m m i=1 max 0,1 y i w, x i F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i The l p -norm regularizer: 1 r w = 2 p 1 w p 2, p 1,2 The SVM hinge loss: l w, θ ; θ = max 0, 1 y w, x The Logistic loss: l w, θ ; θ = log 1 + e y w,x The Least Square loss: l w, θ ; θ = w, x y 2 December 7,

16 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

17 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

18 Inverse Time Dependency Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i December 7,

19 Generalization Generalization objective F σ w = σ r w + E θ~dist l w, θ ; θ Empirical objective F σ w = σ r(w) + 1 m m i=1 l w, θ i ; θ i Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,

20 Generalization Running time Theorem 2 Theorem 1 Generalization error Optimization error Optimization error Generalization error ε acc, satisfies F σ w F σ w + ε acc ε, satisfies l w l w 0 + ε December 7,

21 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 We will create an algorithm based on Stochastic Gradient Descent, and then build the relationship between ε acc and T. December 7,

22 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

23 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

24 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

25 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

26 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

27 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 December 7,

28 Thm 1 - Primal Gradient Solver INPUT: training sample space Ψ = θ 1, θ m ; p, σ, T, k INITIALIZE: w 0 0, λ 0, q 1/(1 1/p) FOR t = 1,2,, T Choose A t Ψ satisfying A t = k Set g t w 1 A t l w, θ ; θ θ A t Choose λ t g t w t 1 Let λ λ λ t Define w t r Return a random w i λ 1 where r w = (t+1)σ 2 p 1 w 1, w T as linear predictor w p 2 Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,

29 Thm 1 - Primal Gradient Solver online strongly convex optimization Thm 1: with probability of at least 1 δ over the choices of A 1, A T and the index i C log T F σ w i F σ w + σtδ December 7,

30 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 December 7,

31 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 O 1 σtδ + O 1 σm + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Theorem 1 Karthik Sridharan, Nathan Srebro, and Shai Shalev-Shwartz, "Fast Rates for Regularized Objectives," in NIPS, December 7,

32 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 Non-positive December 7,

33 Theorem 2 Thm 2: Suppose w is the predictor optimized by the Primal Gradient Solver. If the desired error rate ε obeys l w l w 0 + ε, w 0 S, then the required number of iterations satisfies: 1/δ T = O 2ε 2 p 1 w 2 O 1 0 p m Sketch of the Proof: using Oracle Inequality. l w l w 0 = F σ w F σ w + F σ w F σ w 0 σ 2 p 1 w p 2 + σ 2 p 1 w 0 p 2 l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 December 7,

34 Error Decomposition C log T F σ w F σ w + σtδ err(w) l w l w 0 O 1 σtδ + O 1 σm + σ 2 p 1 w 0 p 2 T = O 1/δ 2ε 2 p 1 w 0 p 2 O 1 m Recall the definitions: δ confidence parameter ε desired generalization error p p-norm regularizer w 0 the optimal predictor m number of samples December 7,

35 Experimental Results Accuracy of PGS: (CCAT dataset) (in comparison with the best achievable accuracy by Quasi-Newton) Regularizer Loss QN Accuracy (2 hours) PGS Accuracy PGS Training Time l 2 LogisticRegression ± sec l 1. 8 LogisticRegression ± sec l 2 Least Square ± sec Speed of PGS: (CCAT dataset) To achieve an accuracy of 94%: PGS: 10 seconds for p = 2 20 seconds for p = 1.8 Quasi-Newton 600 seconds for both December 7,

36 Experimental Results December 7,

37 Experimental Results December 7,

38 Further Discussion P-norm? p 1,2? Non-linear? Kernel? Welcome to my talk on Tuesday 2 4PM P packsvm: Parallel Primal gradient descent Kernel SVM Other applications? December 7,

39 Conclusion Fast Primal Gradient Solver for l p -norm regularized convex learning Regularization error = Optimization error + Estimation error Running time inverse time dependent on input data size December 7,

40 Thanks Questions: Acknowledgment: Shai Shalev-Shwartz From Hebrew University. December 7,

41 Thm 1 - Primal Gradient Solver 1. INPUT: λ, p, S. Let n be the feature dimension. 2. FOR i = 1,2,, n 3. w t i 1 q 1 j λ j t+1 σ q 2 q 1 λ j t+1 σ q 1 sgn λ j 4. IF S = R n, RETURN w t 5. IF S = w: w p B 6. IF w t p > B THEN, w t B w t 7. RETURN w t q w t Explicit calculation for w t = r λ (t + 1)σ. We use the superscript of the form (j) to denote the j th coordinate of a vector December 7,

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Optimistic Rates Nati Srebro

Optimistic Rates Nati Srebro Optimistic Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

On the tradeoff between computational complexity and sample complexity in learning

On the tradeoff between computational complexity and sample complexity in learning On the tradeoff between computational complexity and sample complexity in learning Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Joint work with Sham

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

P-packSVM: Parallel Primal gradient descent Kernel SVM

P-packSVM: Parallel Primal gradient descent Kernel SVM P-packSVM: Parallel Primal gradient descent Kernel SVM Zeyuan Allen Zhu 2*, Weizhu Chen 2, Gang Wang 2, Chenguang Zhu 23, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua University

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Introduction to Machine Learning (67577) Lecture 7

Introduction to Machine Learning (67577) Lecture 7 Introduction to Machine Learning (67577) Lecture 7 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Solving Convex Problems using SGD and RLM Shai Shalev-Shwartz (Hebrew

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families

Topics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Machine Learning Fall 2017 Based on slides and material from Geoffrey Hinton, Richard Socher, Dan Roth, Yoav Goldberg, Shai Shalev-Shwartz and Shai Ben-David, and others

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Beating SGD: Learning SVMs in Sublinear Time

Beating SGD: Learning SVMs in Sublinear Time Beating SGD: Learning SVMs in Sublinear Time Elad Hazan Tomer Koren Technion, Israel Institute of Technology Haifa, Israel 32000 {ehazan@ie,tomerk@cs}.technion.ac.il Nathan Srebro Toyota Technological

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

COMP 551 Applied Machine Learning Lecture 2: Linear regression

COMP 551 Applied Machine Learning Lecture 2: Linear regression COMP 551 Applied Machine Learning Lecture 2: Linear regression Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted for this

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Learnability, Stability, Regularization and Strong Convexity

Learnability, Stability, Regularization and Strong Convexity Learnability, Stability, Regularization and Strong Convexity Nati Srebro Shai Shalev-Shwartz HUJI Ohad Shamir Weizmann Karthik Sridharan Cornell Ambuj Tewari Michigan Toyota Technological Institute Chicago

More information

Lecture Support Vector Machine (SVM) Classifiers

Lecture Support Vector Machine (SVM) Classifiers Introduction to Machine Learning Lecturer: Amir Globerson Lecture 6 Fall Semester Scribe: Yishay Mansour 6.1 Support Vector Machine (SVM) Classifiers Classification is one of the most important tasks in

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

A Novel Click Model and Its Applications to Online Advertising

A Novel Click Model and Its Applications to Online Advertising A Novel Click Model and Its Applications to Online Advertising Zeyuan Zhu Weizhu Chen Tom Minka Chenguang Zhu Zheng Chen February 5, 2010 1 Introduction Click Model - To model the user behavior Application

More information

Stochastic Methods for l 1 Regularized Loss Minimization

Stochastic Methods for l 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Introduction to Machine Learning (67577)

Introduction to Machine Learning (67577) Introduction to Machine Learning (67577) Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem Deep Learning Shai Shalev-Shwartz (Hebrew U) IML Deep Learning Neural Networks

More information

Presentation in Convex Optimization

Presentation in Convex Optimization Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology

More information

Multi-class classification via proximal mirror descent

Multi-class classification via proximal mirror descent Multi-class classification via proximal mirror descent Daria Reshetova Stanford EE department resh@stanford.edu Abstract We consider the problem of multi-class classification and a stochastic optimization

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Efficient Bandit Algorithms for Online Multiclass Prediction Sham Kakade, Shai Shalev-Shwartz and Ambuj Tewari Presented By: Nakul Verma Motivation In many learning applications, true class labels are

More information

Stochastic optimization in Hilbert spaces

Stochastic optimization in Hilbert spaces Stochastic optimization in Hilbert spaces Aymeric Dieuleveut Aymeric Dieuleveut Stochastic optimization Hilbert spaces 1 / 48 Outline Learning vs Statistics Aymeric Dieuleveut Stochastic optimization Hilbert

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,

More information

Large-scale machine learning Stochastic gradient descent

Large-scale machine learning Stochastic gradient descent Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Classification Logistic Regression

Classification Logistic Regression Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham

More information

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration

Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Stochastic Approximation: Mini-Batches, Optimistic Rates and Acceleration Nati Srebro Toyota Technological Institute at Chicago a philanthropically endowed academic computer science institute dedicated

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Introduction to Machine Learning (67577) Lecture 3

Introduction to Machine Learning (67577) Lecture 3 Introduction to Machine Learning (67577) Lecture 3 Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem General Learning Model and Bias-Complexity tradeoff Shai Shalev-Shwartz

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 12: Weak Learnability and the l 1 margin Converse to Scale-Sensitive Learning Stability Convex-Lipschitz-Bounded Problems

More information

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods Convex Optimization Prof. Nati Srebro Lecture 12: Infeasible-Start Newton s Method Interior Point Methods Equality Constrained Optimization f 0 (x) s. t. A R p n, b R p Using access to: 2 nd order oracle

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Online Sparse Passive Aggressive Learning with Kernels

Online Sparse Passive Aggressive Learning with Kernels Online Sparse Passive Aggressive Learning with Kernels Jing Lu Peilin Zhao Steven C.H. Hoi Abstract Conventional online kernel methods often yield an unbounded large number of support vectors, making them

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson Applied Machine Learning Lecture 5: Linear classifiers, continued Richard Johansson overview preliminaries logistic regression training a logistic regression classifier side note: multiclass linear classifiers

More information

Support Vector Machines

Support Vector Machines Two SVM tutorials linked in class website (please, read both): High-level presentation with applications (Hearst 1998) Detailed tutorial (Burges 1998) Support Vector Machines Machine Learning 10701/15781

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Homework 5. Convex Optimization /36-725

Homework 5. Convex Optimization /36-725 Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Solving the SVM Optimization Problem

Solving the SVM Optimization Problem Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

A simpler unified analysis of Budget Perceptrons

A simpler unified analysis of Budget Perceptrons Ilya Sutskever University of Toronto, 6 King s College Rd., Toronto, Ontario, M5S 3G4, Canada ILYA@CS.UTORONTO.CA Abstract The kernel Perceptron is an appealing online learning algorithm that has a drawback:

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

COMP 551 Applied Machine Learning Lecture 2: Linear Regression

COMP 551 Applied Machine Learning Lecture 2: Linear Regression COMP 551 Applied Machine Learning Lecture 2: Linear Regression Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless otherwise

More information

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010 Linear Regression Aarti Singh Machine Learning 10-701/15-781 Sept 27, 2010 Discrete to Continuous Labels Classification Sports Science News Anemic cell Healthy cell Regression X = Document Y = Topic X

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Online Learning meets Optimization in the Dual

Online Learning meets Optimization in the Dual Online Learning meets Optimization in the Dual Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., The Hebrew University, Jerusalem 91904, Israel 2 Google Inc., 1600 Amphitheater

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning Kernels MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Kernel methods. Learning kernels scenario. learning bounds. algorithms. page 2 Machine Learning

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

An empirical study about online learning with generalized passive-aggressive approaches

An empirical study about online learning with generalized passive-aggressive approaches An empirical study about online learning with generalized passive-aggressive approaches Adrian Perez-Suay, Francesc J. Ferri, Miguel Arevalillo-Herráez, and Jesús V. Albert Dept. nformàtica, Universitat

More information