Pairwise Away Steps for the Frank-Wolfe Algorithm

Size: px
Start display at page:

Download "Pairwise Away Steps for the Frank-Wolfe Algorithm"

Transcription

1 Pairwise Away Steps for the Frank-Wolfe Algorithm Héctor Allende Department of Informatics Universidad Federico Santa María, Chile Ricardo Ñanculef Department of Informatics Universidad Federico Santa María, Chile Emanuele Frandi Department of Science and High Technology University of Insubria, Italy Claudio Sartori Department of Computer Science and Engineering University of Bologna, Italy Abstract Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation algorithm for concave optimization known as the Frank-Wolfe (FW) method. This algorithm has been successfully applied for example to train large-scale instances of non-linear Support Vector Machines (SVMs). In this paper, we investigate an improvement of the FW method based on a new way to perform away steps, a classic strategy used to guarantee the linear convergence of the original FW procedure. On the theoretical side, we present some results about the convergence rate of the algorithm. On the practical side, we assess the performance of the method on several SVM problems. We conclude that our method is most of the times faster than traditional FW methods, and works well even in the cases where standard away steps slow down the algorithm. 1 Introduction Consider the following optimization problem maximize α g(α) subject to α S := { α R m : } α i = 1, α i 0, (1) i where g is convex but not necessarily strongly or strictly convex. This problem encompasses several models used in machine learning [4, 9]. The FW method computes a sequence of approximations α 1, α 1,..., α k to a solution of problem (1) by iterating until convergence two simple steps: it first finds the largest coordinate of the gradient, i.e. i argmax i g(α k ) i, and then it moves the current solution in the direction d F k W = (e i α k ) seeking for the best feasible improvement of the objective function [4, 11, 16]. It can be shown that the convergence of this method can be boosted by considering a slight variant [2, 6, 8]: instead of always moving α k towards the ascent vertex e i, we can consider moving α k away from the (descent) vertex e j argmin j:αj>0 g(α k ) j. This variant, known as the Modified Frank-Wolfe (MFW) method, is summarized in Algorithm 1. It has been shown that MFW asymptotically exhibits linear convergence to the solution of problem (1) under some assumptions on the form of the objective function and the feasible set [1, 2, 8, 17]. In addition, the MFW algorithm has the potential to compute sparser solutions in practice, since in contrast to the FW method, it allows setting a coordinate of α k to zero at each step. However, MFW often fails to improve the running times of FW and sometimes indeed is slower [2, 12, 17]. 1

2 Algorithm 1: MFW method for problem (1). 1 Compute an initial estimate α 0. 2 Set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i and define d F k W = e i α k. 5 Search for j argmin j Ik g(α k ) j and define d A k = α k e j. 6 if g(α k ) T d F k W g(α k ) T d A k then 7 Perform a line-search to find λ fw argmax λ [0,1] g(α k + λd F k W ). 8 Perform the FW step α = α k + λ fw (e i α k ). 9 Update I k by I = I k {i }. 10 else 11 Perform a line-search to find λ away argmax λ [0,1] g(α k + λd A k ). 12 Clip the line-search parameter, λ away = min(λ away, α k,j /(1 α k,j )) 13 Perform the AWAY step α = α k + λ away (α k e j ). 14 if λ away = α k,j /(1 α k,j ), then I = I \ {j }. 2 New Away Steps for FW Here we define a new type of away step. At each iteration, we find the ascent vertex e i and the away vertex e j, where i argmax i g(α k ) i and j argmin j:αj>0 g(α k ) j, just as in the MFW method. However, instead of considering the update α = α k + λ (α k e j ), we propose a pairwise away step of the form α = α k + λ (e i e j ), (2) where λ is determined by a line-search. That is, instead of exploring the away direction d MFW k = (α k e j ), our algorithm explores the direction d SWAP k = (e i e j ). The method, referred to as the SWAP method, is summarized in Algorithm 2 and depicted in Fig. (1). Note that this away step allows to get away from e j and move toward e i in the same iteration. Algorithm 2: The SWAP method for problem (1). 1 Set k = 0. 2 Compute an initial estimate α 0 and set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i (ascent direction). 5 Search for j argmin j:αk,j ɛ w g(α k ) j (descent direction). 6 Perform a line-search to find λ swap argmax λ [0,αk,j ] g (α k + λ(e i e j )). 7 Perform a line-search to find λ fw argmax λ [0,1] g (α k + λ(e i α k )). 8 Compute δ swap = g (α k + λ swap (e i e j )) g(α k ) (improvement of a SWAP step). 9 Compute δ fw = g (α k + λ fw (e i α k )) g(α k ) (improvement of a toward step). 10 Compute δ k = max (δ swap, δ fw ) (the best improvement). 11 if (δ k = δ swap ) then 12 If λ swap = α k,j mark the iteration as a SWAP-drop step. 13 If λ swap = λ swap mark the iteration as a SWAP-add step. 14 Perform the SWAP step α = α k + λ swap (e i e j ). 15 Set I = I k {i }. 16 If a SWAP-drop step was performed, I = I \ {j }. 17 else 18 Mark the iteration as a FW step. 19 Perform the FW step α = α k + λ fw (e i α k ). 20 Set I = I k {i }. 2

3 z e 3 d FW α FW α SWAP d SWAP d MFW α k α MFW x e 1 e 2 y Figure 1: A sketch of the search directions used by FW, MFW and SWAP methods. 3 Convergence Analysis Here we state some results about the convergence of the proposed method. Proofs are omitted for space constraints. Note that the stopping criterion in Proposition 3, d (α k ) = max i g(α k ) i α T k g(α k) ε, corresponds to a primal-dual measure of optimality, as considered in [2, 4, 11]. Proposition 1. (Global Convergence) Suppose g is Lipschitz-continuous on the feasible set. Then, Algorithm 2 produces a sequence of iterates {α k } k such that g(α k ) converges to g(α ), where α is a solution of problem (1). If α is unique, {α k } k converges to α. Proposition 2. (Linear Convergence) Suppose g is twice continuously differentiable and that there is a solution α of (1) satisfying the strong sufficient condition of Robinson in [13]. Then, for sufficiently large k, any iteration marked as SWAP-add or FW in Algorithm 2 produces an iterate α k satisfying the inequality for some constant M > 1. g(α ) g(α ) g(α ) g(α k ) ( 1 1 ) M Proposition 3. Suppose g is twice continuously differentiable and suppose we stop the algorithm by using the stopping condition d (α k ) = max i g(α k ) i α T k g(α k) ε for some ε (0, 1). Let K be the number of unclipped iterations performed by Algorithm 2. Then, where Q, M are constants independent of m and ε. 4 Experiments (3) K Q + M ε, (4) In this section, we evaluate the performance of the proposed method, specialized to the problem of training L 2 Support Vector Machines [5, 6, 15]. We present experiments performed on well-known 3

4 classification data available in several public repositories [3, 7]. In order to provide an idea of the size of each problem, we specify the size m of the training set and the number of classes K. In the case of multi-category classification problems, we adopt a one-versus-one approach (OVO) [10]. For the initialization of all the methods, that is, the computation of a starting solution α 0, we adopted the method proposed in [15]. In this approach, the starting solution is obtained by solving problem on a random subset of p training patterns. The indices of α 0 corresponding to other data points are set to zero. We used p = 20 points for initialization and the stopping criterion of Proposition 3 with ε = 10 6 for all the algorithms. In all the experiments presented in this paper, SVMs were trained using a RBF (Gaussian) kernel k(x 1, x 2 ) = exp ( x 1 x 2 2 /2σ 2) with scale parameter σ 2. Parameter σ 2 was determined using the default method employed in [15], i.e. it was set to the average squared distance among training patterns. Parameter C was determined on the logarithmic grid [2 0, 2 12 ] using a validation set consisting in a randomly computed 30% of the training-set. We also adopted the LRR caching strategy designed in [14] to avoid the computation of recently used kernel values and the probabilistic speed-up described in [14] to accelerate the search for i. Dataset K m FW MFW SWAP Adult a1a E E E-01 Adult a5a E E E+00 Adult a8a E E E+03 Web w1a E E E-01 Web w5a E E E+00 Web w8a E E E+00 Protein E E E+03 Usps-Ext E E E+01 Shuttle E E E-01 Kdd-10pc E E E-01 Table 1: Running times (seconds) of the different methods on some classification problems. Dataset K m FW MFW SWAP Adult a1a Adult a5a Adult a8a Web w1a Web w5a Web w8a Protein Usps-Ext Shuttle Kdd-10pc Table 2: Testing accuracies of the different methods on some classification problems. 5 Conclusions We presented a variant of the FW method for the general problem of maximizing a concave function on the unit simplex, introducing a novel way to perform away steps in the FW method. On the theoretical side, we showed that the method converges globally and that the unclipped optimization steps provide a linear rate of convergence. We also stated that the method achieves a primal-dual gap d (α k ) = max i g(α k ) i α T k g(α k) [4, 11] lower than a given tolerance ε in O(1/ε) unclipped iterations, independently of m, the dimensionality of the feasible space and the number of examples in SVM problems. On the experimental side, we showed that the proposed method was faster than the FW and MFW methods on several SVM training problems. We observed that the SWAP method performed better than FW and MFW in those cases in which classic away steps effectively boost the convergence of the FW method, and proved to be a robust alternative to MFW in the cases where classic away steps failed. 4

5 References [1] S. Damla Ahipasaoglu, Sun Peng, and Michael Todd. Linear convergence of a modified Frank-Wolfe algorithm for computing minimum volume enclosing ellipsoids. Optimization Methods and Software, 23(1):5 19, [2] Héctor Allende, Emanuele Frandi, Ricardo Ñanculef, and Claudio Sartori. Novel Frank-Wolfe methods for SVM learning. Technical report, [3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, [4] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In Proceedings of SODA 08, pages SIAM, [5] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. A new algorithm for training SVMs using approximate minimal enclosing balls. In Proceedings of the 15th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, pages Springer, [6] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. Training support vector machines using Frank-Wolfe methods. International Journal of Pattern Recognition and Artificial Intelligence, 27(3), [7] Andrew Frank and Arthur Asuncion. The UCI KDD Archive [8] Jacques Guélat and Patrice Marcotte. Some comments on Wolfe s away step. Mathematical Programming, 35: , [9] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In John Hershberger and Efi Fogel, editors, Symposium on Computational Geometry, pages ACM, [10] Thomas Hofmann, Bernhard Schölkopf, and Alexander Smola. Kernel methods in machine learning. Annals of Statistics, 36(3): , [11] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (to appear), [12] Hua Ouyang and Alexander Gray. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. In SDM, pages , [13] Stephen Robinson. Generalized equations and their solutions, part II: Applications to nonlinear programming. In Optimality and Stability in Mathematical Programming, volume 19 of Mathematical Programming Studies, pages Springer Berlin Heidelberg, [14] Ivor Tsang, Andras Kocsor, and James Kwok. LibCVM Toolkit [15] Ivor Tsang, James Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6: , [16] Philip Wolfe. Convergence theory in nonlinear programming. In J. Abadie, editor, Integer and Nonlinear Programming, pages North-Holland, Amsterdam, [17] Emre Alper Yildirim. Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3): ,

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets

Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Fast and scalable Lasso via stochastic Frank Wolfe methods with a convergence guarantee

Fast and scalable Lasso via stochastic Frank Wolfe methods with a convergence guarantee Mach Learn 2016) 104:195 221 DOI 10.1007/s10994-016-5578-4 Fast and scalable Lasso via stochastic Frank Wolfe methods with a convergence guarantee Emanuele Frandi 1 Ricardo Ñanculef 2 Stefano Lodi 3 Claudio

More information

A Revisit to Support Vector Data Description (SVDD)

A Revisit to Support Vector Data Description (SVDD) A Revisit to Support Vector Data Description (SVDD Wei-Cheng Chang b99902019@csie.ntu.edu.tw Ching-Pei Lee r00922098@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids

Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids S. Damla Ahipasaoglu Peng Sun Michael J. Todd October 5, 2006 Dedicated to the memory of Naum Shor

More information

On the von Neumann and Frank-Wolfe Algorithms with Away Steps

On the von Neumann and Frank-Wolfe Algorithms with Away Steps On the von Neumann and Frank-Wolfe Algorithms with Away Steps Javier Peña Daniel Rodríguez Negar Soheili July 16, 015 Abstract The von Neumann algorithm is a simple coordinate-descent algorithm to determine

More information

A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result

A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result Piyush Kumar Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530,

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Rough Margin based Core Vector Machine

Rough Margin based Core Vector Machine Rough Margin based Core Vector Machine Gang Niu, Bo Dai 2, Lin Shang, and Yangsheng Ji State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 20093, P.R.China {niugang,jiyangsheng,lshang}@ai.nju.edu.cn

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Nearest Neighbors Methods for Support Vector Machines

Nearest Neighbors Methods for Support Vector Machines Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad

More information

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives

Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013

Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel

More information

Support Vector Machine via Nonlinear Rescaling Method

Support Vector Machine via Nonlinear Rescaling Method Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University

More information

An Algorithm and a Core Set Result for the Weighted Euclidean One-Center Problem

An Algorithm and a Core Set Result for the Weighted Euclidean One-Center Problem An Algorithm and a Core Set Result for the Weighted Euclidean One-Center Problem Piyush Kumar Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530, USA, piyush@cs.fsu.edu

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

Support Vector Machines and Kernel Methods

Support Vector Machines and Kernel Methods 2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University

More information

SMO Algorithms for Support Vector Machines without Bias Term

SMO Algorithms for Support Vector Machines without Bias Term Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002

More information

The Frank-Wolfe Algorithm:

The Frank-Wolfe Algorithm: The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology

More information

Support Vector Machine Classification with Indefinite Kernels

Support Vector Machine Classification with Indefinite Kernels Support Vector Machine Classification with Indefinite Kernels Ronny Luss ORFE, Princeton University Princeton, NJ 08544 rluss@princeton.edu Alexandre d Aspremont ORFE, Princeton University Princeton, NJ

More information

Totally Corrective Boosting Algorithms that Maximize the Margin

Totally Corrective Boosting Algorithms that Maximize the Margin Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Machine Learning. Lecture 6: Support Vector Machine. Feng Li. Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Discriminant Kernels based Support Vector Machine

Discriminant Kernels based Support Vector Machine Discriminant Kernels based Support Vector Machine Akinori Hidaka Tokyo Denki University Takio Kurita Hiroshima University Abstract Recently the kernel discriminant analysis (KDA) has been successfully

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16 COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

A Distributed Solver for Kernelized SVM

A Distributed Solver for Kernelized SVM and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,

More information

Solving the SVM Optimization Problem

Solving the SVM Optimization Problem Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM

An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

UVA CS 4501: Machine Learning

UVA CS 4501: Machine Learning UVA CS 4501: Machine Learning Lecture 16 Extra: Support Vector Machine Optimization with Dual Dr. Yanjun Qi University of Virginia Department of Computer Science Today Extra q Optimization of SVM ü SVM

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Support Vector Machine (continued)

Support Vector Machine (continued) Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need

More information

Support Vector Machine Regression for Volatile Stock Market Prediction

Support Vector Machine Regression for Volatile Stock Market Prediction Support Vector Machine Regression for Volatile Stock Market Prediction Haiqin Yang, Laiwan Chan, and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,

More information

Change point method: an exact line search method for SVMs

Change point method: an exact line search method for SVMs Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.

More information

Polytope conditioning and linear convergence of the Frank-Wolfe algorithm

Polytope conditioning and linear convergence of the Frank-Wolfe algorithm Polytope conditioning and linear convergence of the Frank-Wolfe algorithm Javier Peña Daniel Rodríguez December 24, 206 Abstract It is known that the gradient descent algorithm converges linearly when

More information

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization

More information

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology

... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology ..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING

DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal Department of Signal Theory and Communications Universidad

More information

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016 Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier

More information

Fast Nonnegative Matrix Factorization with Rank-one ADMM

Fast Nonnegative Matrix Factorization with Rank-one ADMM Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,

More information

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Homework 4. Convex Optimization /36-725

Homework 4. Convex Optimization /36-725 Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD

AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD J. KSIAM Vol., No., 1 10, 2018 AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD JOOYEON CHOI, BORA JEONG, YESOM PARK, JIWON SEO, AND CHOHONG MIN 1 ABSTRACT. Boosting, one of the

More information

Kernel Conjugate Gradient

Kernel Conjugate Gradient Kernel Conjugate Gradient Nathan Ratliff Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 ndr@andrew.cmu.edu J. Andrew Bagnell Robotics Institute Carnegie Mellon University Pittsburgh,

More information

ABC-LogitBoost for Multi-Class Classification

ABC-LogitBoost for Multi-Class Classification Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12

More information

Support Vector Machines: Maximum Margin Classifiers

Support Vector Machines: Maximum Margin Classifiers Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind

More information

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several

More information

Machine Learning : Support Vector Machines

Machine Learning : Support Vector Machines Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into

More information

IE 5531: Engineering Optimization I

IE 5531: Engineering Optimization I IE 5531: Engineering Optimization I Lecture 15: Nonlinear optimization Prof. John Gunnar Carlsson November 1, 2010 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 1 / 24

More information

Cutting Plane Training of Structural SVM

Cutting Plane Training of Structural SVM Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

A Randomized Algorithm for Large Scale Support Vector Learning

A Randomized Algorithm for Large Scale Support Vector Learning A Randomized Algorithm for Large Scale Support Vector Learning Krishnan S. Department of Computer Science and Automation, Indian Institute of Science, Bangalore-12 rishi@csa.iisc.ernet.in Chiranjib Bhattacharyya

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

k-means Clustering via the Frank-Wolfe Algorithm

k-means Clustering via the Frank-Wolfe Algorithm k-means Clustering via the Frank-Wolfe Algorithm Christian Bauckhage B-IT, University of Bonn, Bonn, Germany Fraunhofer IAIS, Sankt Augustin, Germany http://multimedia-pattern-recognition.info Abstract.

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Sparse Support Vector Machines by Kernel Discriminant Analysis

Sparse Support Vector Machines by Kernel Discriminant Analysis Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines

More information

Maximum likelihood estimation

Maximum likelihood estimation Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization

More information

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University

Lecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations

More information

1 Maximizing a Submodular Function

1 Maximizing a Submodular Function 6.883 Learning with Combinatorial Structure Notes for Lecture 16 Author: Arpit Agarwal 1 Maximizing a Submodular Function In the last lecture we looked at maximization of a monotone submodular function,

More information

Support Vector Machine Classification via Parameterless Robust Linear Programming

Support Vector Machine Classification via Parameterless Robust Linear Programming Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Learning with kernels and SVM

Learning with kernels and SVM Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find

More information

Least Squares SVM Regression

Least Squares SVM Regression Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error

More information

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Advanced Topics in Machine Learning

Advanced Topics in Machine Learning Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization

More information

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2

PoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2 Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets 1 Shaanxi Normal University, Xi'an, China E-mail: lizekun@snnu.edu.cn Shuyu Li Shaanxi Normal University, Xi'an,

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

Support Vector and Kernel Methods

Support Vector and Kernel Methods SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:

More information

Kernel Learning via Random Fourier Representations

Kernel Learning via Random Fourier Representations Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random

More information

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69 R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual

More information