Advanced Topics in Machine Learning

Size: px
Start display at page:

Download "Advanced Topics in Machine Learning"

Transcription

1 Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16

2 Outline 10. Linearization of Nonlinear Kernels 2 / 16

3 Outline 10. Linearization of Nonlinear Kernels 2 / 16

4 Subgradient Descent minimize f (β, β 0 ; D) := 1 D subgradient g(β, β 0 ; D) := (x,y) D 1 D [1 y(β T x + β 0 )] λ β 2 (x,y) D y(β T x+β 0 )<1 1 D (x,y) D y(β T x+β 0 )<1 yx + λβ y 2 / 16

5 Subgradient Descent (1) learn-linear-svm-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t ) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) do (9) ˆβ := 1 n (10) ˆβ 0 := 1 n n i=1 y i(β T x i+β 0)<1 n i=1 y i(β T x i+β 0)<1 (11) ˆβ := (1 ηt λ) ˆβ η t ˆβ (12) ˆβ0 := ˆβ 0 η t ˆβ 0 (13) t := t + 1 (14) while η t ˆβ ɛ (15) return ( ˆβ, ˆβ 0 ) y i x i y i 3 / 16

6 Subgradient Descent (subsample approximation) Idea: Do not use all training examples to estimate the error and the gradient, but just a subsample D (t) D The subsample may vary over steps t. Then approximate f ( ; D) by f ( ; D (t) ) in step t. Extremes: all samples. (subgradient descent) just a single (random) sample. (stochastic subgradient descent) 4 / 16

7 Stochastic Subgradient Descent (1) learn-linear-svm-stochastic-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t, stop count t 0 ) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) l t := 0, t = 0,... t 0 1 (9) do (10) draw i randomly from {1,..., n} (11) ˆβ := δ yi(β T x i+β 0)<1y i x i (12) ˆβ 0 := δ yi(β T x i+β 0)<1y i (13) ˆβ := (1 ηt λ) ˆβ η t ˆβ (14) ˆβ0 := ˆβ 0 η t ˆβ 0 (15) l t mod t0 := η t ˆβ (16) t := t + 1 (17) while t 0 1 t =0 lt ɛ (18) return ( ˆβ, ˆβ 0 ) 5 / 16

8 Subgradient Descent with Subsample Approximation (1) learn-linear-svm-approx-subgradient-descent-primal(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) step lengths η t, stop count t 0, (4) subsample size k) : (5) n := x (6) ˆβ := 0 (7) ˆβ0 := 0 (8) t := 0 (9) l t := 0, t = 0,... t 0 1 (10) do (11) draw subset I randomly from {1,..., n} with I = k (12) ˆβ := 1 n y i x i k (13) ˆβ 0 := 1 k i I yi(β T xi+β0)<1 n i I yi(β T xi+β0)<1 (14) ˆβ := (1 ηt λ) ˆβ η t ˆβ y i (15) ˆβ0 := ˆβ 0 η t ˆβ 0 (16) l t mod t0 := η t ˆβ (17) t := t + 1 (18) while t0 1 t =0 lt ɛ (19) return ( ˆβ, ˆβ 0 ) 6 / 16

9 Subgradient Descent (subsample approximation) Shalev-Shwartz, Singer, and Srebro 2007 experimented with approximations by samples of fixed size k, i.e., D (t) = k, t Pegasos: Primal Estimated T=1250 T= kt=10 4 kt=10 5 kt= k k [Shalev-Shwartz, Singer, and Srebro 2007] Figure 3. The effect of k on the objective value of Pegasos on the Astro-Physics dataset. Left: T is fixed. Right: kt is fixed. 7 / 16

10 Subgradient Descent (subsample approximation) Shalev-Shwartz, Singer, and Srebro 2007 experimented with approximations by samples of fixed size k, i.e., D (t) = k, t Pegasos: Primal Estimated sub-gradient SO T=1250 T= k k Figure 3. The effect of k on the objective value of Pegasos on the Astro-Physics dataset. Left: T is fixed. Right: kt is fixed. kt=10 4 kt=10 5 kt=10 6 tion of SVM stochastic gr step is accom rate of conve rithm. Our e Pegasos achi of its simplic surfaced in th iments with n investigating ing problems [Shalev-Shwartz, Singer, and Srebro 2007] Acknowledg 7 / 16

11 Maintaining Small Parameters Lemma (Shalev-Shwartz, Singer, and Srebro 2007) The optimal β satisfies β 1 λ Proof. Due to strong duality for the optimal β, β 0 : f (β ) = 1 D and with β = 1 λ X T (y α ) (x,y) D [1 y(β T x + β 0)] λ β 2! = f (α ) = 1 2λ α T (XX T yy T )α + 1 D α 1 8 / 16

12 Maintaining Small Parameters Proof (ctd.). 1 2 λ β D (x,y) D λ β 2 = 1 D α 1 1 D [1 y(β T x + β 0)] + = 1 2 λ β D α 1 (x,y) D [1 y(β T x + β 0)] + 1 D α 1 and with 0 α 1 : 1 β 1 λ 9 / 16

13 Primal Estimated subgradient solver for SVM (PEGASOS) Basic ideas: use subsample approximation with fixed k (but k = 1, stochastic gradient descent, turns out to be optimal) retain β 1/ λ by rescaling in each step: β := β max(1, λ β ) Decrease step size over time: η t := 1 λt 10 / 16

14 Decrease Step Size Over Time ient SOlver for SVM Pegasos Norma Pegasos Zhang T T [Shalev-Shwartz, Singer, and Srebro 2007] 2. Comparisons of Pegasos to Norma (left) and Pegasos to stic gradient descent with a fixed learning rate (right) on the Physics datset. In the left plot, the solid lines designate the ive value and the dashed lines depict the loss on the test set. mber of examples but rather on the value of λ. In- 11 / 16

15 Pegasos (1) learn-linear-svm-pegasos(training predictors x, training targets y, (2) regularization λ, accuracy ɛ, (3) stop count t 0, subsample size k) : (4) n := x (5) ˆβ := 0 (6) ˆβ0 := 0 (7) t := 0 (8) l t := 0, t = 0,... t 0 1 (9) do (10) draw subset I randomly from {1,..., n} with I = k (11) ˆβ := 1 n y i x i k (12) ˆβ 0 := 1 k i I yi(β T xi+β0)<1 n i I yi(β T xi+β0)<1 (13) η t := 1/(λt) (14) ˆβ := (1 ηt λ) ˆβ η t ˆβ (15) ˆβ0 := ˆβ 0 η t ˆβ 0 (16) ˆβ := ˆβ/ max(1, λ β ) (17) l t mod t0 := η t ˆβ (18) t := t + 1 (19) while t0 1 t =0 lt ɛ (20) return ( ˆβ, ˆβ 0 ) y i 12 / 16

16 Comparison Dual Coordinate Descent vs. Pegasos A Dual Coordinate Descent Method for Large-scale Linear SV (a) L1-SVM: astro-physic (b) L2-SVM: astro-physic (a) L1-SVM: astro-physic (c) L1-SVM: news20 (d) L2-SVM: news20 (c) L1-SVM: news20 [C. J. Hsieh et al. 2008] (e) L1-SVM: rcv1 (f) L2-SVM: rcv1 (e) L1-SVM: rcv1 Lars Schmidt-Thieme, Information FigureSystems 1. Timeand versus Machine thelearning relativelab error (ISMLL), (20). University DCDL1-S, of Hildesheim, Figure 2. Germany Time versus the differ DCDL2-S are DCDL1, DCDL2 with shrinking. The dotted tween the current model 13 and / 16 th

17 Comparison Dual Coordinate Descent vs. Pegasos ordinate Descent Method for Large-scale Linear SVM b) L2-SVM: astro-physic (a) L1-SVM: astro-physic (b) L2-SVM: astro-physic (d) L2-SVM: news20 (c) L1-SVM: news20 (d) L2-SVM: news20 [C. J. Hsieh et al. 2008] (f) L2-SVM: rcv1 (e) L1-SVM: rcv1 (f) L2-SVM: rcv1 error Lars Schmidt-Thieme, (20). DCDL1-S, Information Figure Systems 2. Time andversus Machine thelearning difference Lab of (ISMLL), testing University accuracy of be-hildesheimtween the current model and the reference model (obtained 14 / Germany shrinking. The dotted 16

18 10. Linearization of Nonlinear Kernels Outline 10. Linearization of Nonlinear Kernels 15 / 16

19 10. Linearization of Nonlinear Kernels Basic Idea Instead of using a nonlinear kernel, e.g., the polynomial kernel of degree d K(x, z) := (γx T z + r) d with hyperparameters d, γ and r for data x, z R n, use the explicit embedding, e.g., for d = 1 and r = 1: φ(x) :=(1, 2γx 1,..., 2γx n, γx 2 1,..., γx 2 n, 2γx 1 x 2,..., 2γx n 1 x n ) or more simple φ(x) :=(1, x 1,..., x n, x 2 1,..., x 2 n, x 1 x 2,..., x n 1 x n ) of dimension (n+d)! n!d!. 15 / 16

20 10. Linearization of Nonlinear Kernels Comparison Linearized CHANG, Nonlinear HSIEH, CHANG, RINGGAARD vs. AND Nonlinear LIN Kernel Linear (LIBLINEAR) RBF (LIBSVM) Data set C Time (s) Accuracy C γ Time (s) Accuracy a9a real-sim ijcnn MNIST covtype , webspam , Table 4: Comparison of linear SVM and nonlinear SVM with RBF kernel. Time is in seconds. Degree-2 Polynomial Accuracy diff. Data set Training time (s) C γ LIBLINEAR LIBSVM Accuracy Linear RBF a9a real-sim , ijcnn MNIST covtype 2 8 5,211.9 NA webspam 8 8 3,228.1 NA Table 5: Training time (in seconds) and testing accuracy of using the degree-2 polynomial mapping. The last two columns show the accuracy difference to results using linear and RBF. NA indicates that programs do not terminate after 300,000 seconds. [Y. Chang et al. 2010] 16 / 16

21 References Chang, Yin-Wen et al. (Aug. 2010): Training and Testing Low-degree Polynomial Data Mappings via Linear SVM. In: J. Mach. Learn. Res. 11, Hsieh, C. J et al. (2008): A dual coordinate descent method for large-scale linear SVM. In: Proceedings of the 25th international conference on Machine learning, Shalev-Shwartz, S., Y. Singer, and N. Srebro (2007): Pegasos: Primal estimated sub-gradient solver for svm. In: Proceedings of the 24th international conference on Machine learning, / 16

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Support Vector Machine

Support Vector Machine Andrea Passerini passerini@disi.unitn.it Machine Learning Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi

More information

A Dual Coordinate Descent Method for Large-scale Linear SVM

A Dual Coordinate Descent Method for Large-scale Linear SVM Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Predictive Models Predictive Models 1 / 34 Outline

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Support Vector Machines

Support Vector Machines Support Vector Machines INFO-4604, Applied Machine Learning University of Colorado Boulder September 28, 2017 Prof. Michael Paul Today Two important concepts: Margins Kernels Large Margin Classification

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods Support Vector Machines and Kernel Methods Chih-Jen Lin Department of Computer Science National Taiwan University Talk at International Workshop on Recent Trends in Learning, Computation, and Finance,

More information

Incremental and Decremental Training for Linear Classification

Incremental and Decremental Training for Linear Classification Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee

More information

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines

Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Journal of Machine Learning Research 9 (2008) 1369-1398 Submitted 1/08; Revised 4/08; Published 7/08 Coordinate Descent Method for Large-scale L2-loss Linear Support Vector Machines Kai-Wei Chang Cho-Jui

More information

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin

Coordinate Descent Method for Large-scale L2-loss Linear SVM. Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Coordinate Descent Method for Large-scale L2-loss Linear SVM Kai-Wei Chang Cho-Jui Hsieh Chih-Jen Lin Abstract Linear support vector machines (SVM) are useful for classifying largescale sparse data. Problems

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan A. Zhu (Tsinghua University) Weizhu Chen (MSRA) Chenguang Zhu (Tsinghua University) Gang Wang (MSRA) Haixun Wang (MSRA) Zheng Chen (MSRA) December

More information

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme Machine Learning A. Supervised Learning A.1. Linear Regression Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

HOMEWORK 4: SVMS AND KERNELS

HOMEWORK 4: SVMS AND KERNELS HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

Machine Learning in the Data Revolution Era

Machine Learning in the Data Revolution Era Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Large-scale machine learning Stochastic gradient descent

Large-scale machine learning Stochastic gradient descent Stochastic gradient descent IRKM Lab April 22, 2010 Introduction Very large amounts of data being generated quicker than we know what do with it ( 08 stats): NYSE generates 1 terabyte/day of new trade

More information

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina

Indirect Rule Learning: Support Vector Machines. Donglin Zeng, Department of Biostatistics, University of North Carolina Indirect Rule Learning: Support Vector Machines Indirect learning: loss optimization It doesn t estimate the prediction rule f (x) directly, since most loss functions do not have explicit optimizers. Indirection

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach CIML book Chap 7.7 Linear Classification: Support Vector Machines (SVM) Professor Liang Huang some slides from Alex Smola (CMU) Linear Separator Ham Spam From Perceptron

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems

More information

Linear Models in Machine Learning

Linear Models in Machine Learning CS540 Intro to AI Linear Models in Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu We briefly go over two linear models frequently used in machine learning: linear regression for, well, regression,

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic

More information

Weighted Coordinate-Wise Pegasos

Weighted Coordinate-Wise Pegasos Weighted Coordinate-Wise Pegasos Vilen Jumutc, Johan A.K. Suykens Katholieke Universiteit Leuven, Kasteelpark Arenberg 10, B-3001, Leuven, Belgium {vilen.jumutc, johan.suykens}@esat.kuleuven.be Abstract.

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

P-packSVM: Parallel Primal gradient descent Kernel SVM

P-packSVM: Parallel Primal gradient descent Kernel SVM P-packSVM: Parallel Primal gradient descent Kernel SVM Zeyuan Allen Zhu 2*, Weizhu Chen 2, Gang Wang 2, Chenguang Zhu 23, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua University

More information

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited Constantinos Panagiotaopoulos and Petroula Tsampoua School of Technology, Aristotle University of Thessalonii, Greece costapan@eng.auth.gr,

More information

Random Features for Large Scale Kernel Machines

Random Features for Large Scale Kernel Machines Random Features for Large Scale Kernel Machines Andrea Vedaldi From: A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Proc. NIPS, 2007. Explicit feature maps Fast algorithm for

More information

Efficient HIK SVM Learning for Image Classification

Efficient HIK SVM Learning for Image Classification ACCEPTED BY IEEE T-IP 1 Efficient HIK SVM Learning for Image Classification Jianxin Wu, Member, IEEE Abstract Histograms are used in almost every aspect of image processing and computer vision, from visual

More information

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Gaussian Processes Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 01 Pictorial view of embedding distribution Transform the entire distribution to expected features Feature space Feature

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Learning with Rejection

Learning with Rejection Learning with Rejection Corinna Cortes 1, Giulia DeSalvo 2, and Mehryar Mohri 2,1 1 Google Research, 111 8th Avenue, New York, NY 2 Courant Institute of Mathematical Sciences, 251 Mercer Street, New York,

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Machine Learning Techniques

Machine Learning Techniques Machine Learning Techniques ( 機器學習技巧 ) Lecture 6: Kernel Models for Regression Hsuan-Tien Lin ( 林軒田 ) htlin@csie.ntu.edu.tw Department of Computer Science & Information Engineering National Taiwan University

More information

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation

Deviations from linear separability. Kernel methods. Basis expansion for quadratic boundaries. Adding new features Systematic deviation Deviations from linear separability Kernel methods CSE 250B Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Systematic deviation

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Kernel methods, kernel SVM and ridge regression

Kernel methods, kernel SVM and ridge regression Kernel methods, kernel SVM and ridge regression Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Collaborative Filtering 2 Collaborative Filtering R: rating matrix; U: user factor;

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Shivani Agarwal Support Vector Machines (SVMs) Algorithm for learning linear classifiers Motivated by idea of maximizing margin Efficient extension to non-linear

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Review: Support vector machines. Machine learning techniques and image analysis

Review: Support vector machines. Machine learning techniques and image analysis Review: Support vector machines Review: Support vector machines Margin optimization min (w,w 0 ) 1 2 w 2 subject to y i (w 0 + w T x i ) 1 0, i = 1,..., n. Review: Support vector machines Margin optimization

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

Kernel methods CSE 250B

Kernel methods CSE 250B Kernel methods CSE 250B Deviations from linear separability Noise Find a separator that minimizes a convex loss function related to the number of mistakes. e.g. SVM, logistic regression. Deviations from

More information

Support Vector Machines

Support Vector Machines EE 17/7AT: Optimization Models in Engineering Section 11/1 - April 014 Support Vector Machines Lecturer: Arturo Fernandez Scribe: Arturo Fernandez 1 Support Vector Machines Revisited 1.1 Strictly) Separable

More information

Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach

Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach Scaling up Kernel SVM on Limited Resources: A Low-rank Linearization Approach Kai Zhang Siemens Corporate Research & Technoplogy kai-zhang@siemens.com Zhuang Wang Siemens Corporate Research & Technoplogy

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Inverse Time Dependency in Convex Regularized Learning

Inverse Time Dependency in Convex Regularized Learning Inverse Time Dependency in Convex Regularized Learning Zeyuan Allen Zhu 2*, Weizhu Chen 2, Chenguang Zhu 23, Gang Wang 2, Haixun Wang 2, Zheng Chen 2 Fundamental Science Class, Department of Physics, Tsinghua

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Solving the SVM Optimization Problem

Solving the SVM Optimization Problem Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Support Vector Machines, Kernel SVM

Support Vector Machines, Kernel SVM Support Vector Machines, Kernel SVM Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 27, 2017 1 / 40 Outline 1 Administration 2 Review of last lecture 3 SVM

More information

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels

SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels SVMs: Non-Separable Data, Convex Surrogate Loss, Multi-Class Classification, Kernels Karl Stratos June 21, 2018 1 / 33 Tangent: Some Loose Ends in Logistic Regression Polynomial feature expansion in logistic

More information

Kernels and the Kernel Trick. Machine Learning Fall 2017

Kernels and the Kernel Trick. Machine Learning Fall 2017 Kernels and the Kernel Trick Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem Support vectors, duals and kernels

More information

Asynchronous Doubly Stochastic Sparse Kernel Learning

Asynchronous Doubly Stochastic Sparse Kernel Learning The Thirty-Second AAAI Conference on Artificial Intelligence AAAI-8 Asynchronous Doubly Stochastic Sparse Kernel Learning Bin Gu, Xin Miao, Zhouyuan Huo, Heng Huang * Department of Electrical & Computer

More information

Jeff Howbert Introduction to Machine Learning Winter

Jeff Howbert Introduction to Machine Learning Winter Classification / Regression Support Vector Machines Jeff Howbert Introduction to Machine Learning Winter 2012 1 Topics SVM classifiers for linearly separable classes SVM classifiers for non-linearly separable

More information

Lecture 10: A brief introduction to Support Vector Machine

Lecture 10: A brief introduction to Support Vector Machine Lecture 10: A brief introduction to Support Vector Machine Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics, University of Pittsburgh Xingye Qiao Department

More information

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015

EE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015 EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Kaggle.

Kaggle. Administrivia Mini-project 2 due April 7, in class implement multi-class reductions, naive bayes, kernel perceptron, multi-class logistic regression and two layer neural networks training set: Project

More information

Machine Learning, Fall 2011: Homework 5

Machine Learning, Fall 2011: Homework 5 0-60 Machine Learning, Fall 0: Homework 5 Machine Learning Department Carnegie Mellon University Due:??? Instructions There are 3 questions on this assignment. Please submit your completed homework to

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55

Data Mining - SVM. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55 Data Mining - SVM Dr. Jean-Michel RICHER 2018 jean-michel.richer@univ-angers.fr Dr. Jean-Michel RICHER Data Mining - SVM 1 / 55 Outline 1. Introduction 2. Linear regression 3. Support Vector Machine 4.

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

CS798: Selected topics in Machine Learning

CS798: Selected topics in Machine Learning CS798: Selected topics in Machine Learning Support Vector Machine Jakramate Bootkrajang Department of Computer Science Chiang Mai University Jakramate Bootkrajang CS798: Selected topics in Machine Learning

More information

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods

Foundation of Intelligent Systems, Part I. SVM s & Kernel Methods Foundation of Intelligent Systems, Part I SVM s & Kernel Methods mcuturi@i.kyoto-u.ac.jp FIS - 2013 1 Support Vector Machines The linearly-separable case FIS - 2013 2 A criterion to select a linear classifier:

More information

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012 Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2012 Linear classifier Which classifier? x 2 x 1 2 Linear classifier Margin concept x 2

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Lecture 10: Support Vector Machine and Large Margin Classifier

Lecture 10: Support Vector Machine and Large Margin Classifier Lecture 10: Support Vector Machine and Large Margin Classifier Applied Multivariate Analysis Math 570, Fall 2014 Xingye Qiao Department of Mathematical Sciences Binghamton University E-mail: qiao@math.binghamton.edu

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Bayesian Networks. 10. Parameter Learning / Missing Values

Bayesian Networks. 10. Parameter Learning / Missing Values Bayesian Networks Bayesian Networks 10. Parameter Learning / Missing Values Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute for Business Economics and Information Systems

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers Hiva Ghanbari Joint work with Prof. Katya Scheinberg Industrial and Systems Engineering Department US & Mexico Workshop

More information

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (SVM) and Kernel Methods Support Vector Machine (SVM) and Kernel Methods CE-717: Machine Learning Sharif University of Technology Fall 2014 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information