Pairwise Away Steps for the Frank-Wolfe Algorithm
|
|
- Stuart Booth
- 5 years ago
- Views:
Transcription
1 Pairwise Away Steps for the Frank-Wolfe Algorithm Héctor Allende Department of Informatics Universidad Federico Santa María, Chile Ricardo Ñanculef Department of Informatics Universidad Federico Santa María, Chile Emanuele Frandi Department of Science and High Technology University of Insubria, Italy Claudio Sartori Department of Computer Science and Engineering University of Bologna, Italy Abstract Recently, there has been a renewed interest in the machine learning community for variants of a sparse greedy approximation algorithm for concave optimization known as the Frank-Wolfe (FW) method. This algorithm has been successfully applied for example to train large-scale instances of non-linear Support Vector Machines (SVMs). In this paper, we investigate an improvement of the FW method based on a new way to perform away steps, a classic strategy used to guarantee the linear convergence of the original FW procedure. On the theoretical side, we present some results about the convergence rate of the algorithm. On the practical side, we assess the performance of the method on several SVM problems. We conclude that our method is most of the times faster than traditional FW methods, and works well even in the cases where standard away steps slow down the algorithm. 1 Introduction Consider the following optimization problem maximize α g(α) subject to α S := { α R m : } α i = 1, α i 0, (1) i where g is convex but not necessarily strongly or strictly convex. This problem encompasses several models used in machine learning [4, 9]. The FW method computes a sequence of approximations α 1, α 1,..., α k to a solution of problem (1) by iterating until convergence two simple steps: it first finds the largest coordinate of the gradient, i.e. i argmax i g(α k ) i, and then it moves the current solution in the direction d F k W = (e i α k ) seeking for the best feasible improvement of the objective function [4, 11, 16]. It can be shown that the convergence of this method can be boosted by considering a slight variant [2, 6, 8]: instead of always moving α k towards the ascent vertex e i, we can consider moving α k away from the (descent) vertex e j argmin j:αj>0 g(α k ) j. This variant, known as the Modified Frank-Wolfe (MFW) method, is summarized in Algorithm 1. It has been shown that MFW asymptotically exhibits linear convergence to the solution of problem (1) under some assumptions on the form of the objective function and the feasible set [1, 2, 8, 17]. In addition, the MFW algorithm has the potential to compute sparser solutions in practice, since in contrast to the FW method, it allows setting a coordinate of α k to zero at each step. However, MFW often fails to improve the running times of FW and sometimes indeed is slower [2, 12, 17]. 1
2 Algorithm 1: MFW method for problem (1). 1 Compute an initial estimate α 0. 2 Set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i and define d F k W = e i α k. 5 Search for j argmin j Ik g(α k ) j and define d A k = α k e j. 6 if g(α k ) T d F k W g(α k ) T d A k then 7 Perform a line-search to find λ fw argmax λ [0,1] g(α k + λd F k W ). 8 Perform the FW step α = α k + λ fw (e i α k ). 9 Update I k by I = I k {i }. 10 else 11 Perform a line-search to find λ away argmax λ [0,1] g(α k + λd A k ). 12 Clip the line-search parameter, λ away = min(λ away, α k,j /(1 α k,j )) 13 Perform the AWAY step α = α k + λ away (α k e j ). 14 if λ away = α k,j /(1 α k,j ), then I = I \ {j }. 2 New Away Steps for FW Here we define a new type of away step. At each iteration, we find the ascent vertex e i and the away vertex e j, where i argmax i g(α k ) i and j argmin j:αj>0 g(α k ) j, just as in the MFW method. However, instead of considering the update α = α k + λ (α k e j ), we propose a pairwise away step of the form α = α k + λ (e i e j ), (2) where λ is determined by a line-search. That is, instead of exploring the away direction d MFW k = (α k e j ), our algorithm explores the direction d SWAP k = (e i e j ). The method, referred to as the SWAP method, is summarized in Algorithm 2 and depicted in Fig. (1). Note that this away step allows to get away from e j and move toward e i in the same iteration. Algorithm 2: The SWAP method for problem (1). 1 Set k = 0. 2 Compute an initial estimate α 0 and set I 0 = {i : α 0,i 0}. 3 for k = 0, 1,... do 4 Search for i argmax i g(α k ) i (ascent direction). 5 Search for j argmin j:αk,j ɛ w g(α k ) j (descent direction). 6 Perform a line-search to find λ swap argmax λ [0,αk,j ] g (α k + λ(e i e j )). 7 Perform a line-search to find λ fw argmax λ [0,1] g (α k + λ(e i α k )). 8 Compute δ swap = g (α k + λ swap (e i e j )) g(α k ) (improvement of a SWAP step). 9 Compute δ fw = g (α k + λ fw (e i α k )) g(α k ) (improvement of a toward step). 10 Compute δ k = max (δ swap, δ fw ) (the best improvement). 11 if (δ k = δ swap ) then 12 If λ swap = α k,j mark the iteration as a SWAP-drop step. 13 If λ swap = λ swap mark the iteration as a SWAP-add step. 14 Perform the SWAP step α = α k + λ swap (e i e j ). 15 Set I = I k {i }. 16 If a SWAP-drop step was performed, I = I \ {j }. 17 else 18 Mark the iteration as a FW step. 19 Perform the FW step α = α k + λ fw (e i α k ). 20 Set I = I k {i }. 2
3 z e 3 d FW α FW α SWAP d SWAP d MFW α k α MFW x e 1 e 2 y Figure 1: A sketch of the search directions used by FW, MFW and SWAP methods. 3 Convergence Analysis Here we state some results about the convergence of the proposed method. Proofs are omitted for space constraints. Note that the stopping criterion in Proposition 3, d (α k ) = max i g(α k ) i α T k g(α k) ε, corresponds to a primal-dual measure of optimality, as considered in [2, 4, 11]. Proposition 1. (Global Convergence) Suppose g is Lipschitz-continuous on the feasible set. Then, Algorithm 2 produces a sequence of iterates {α k } k such that g(α k ) converges to g(α ), where α is a solution of problem (1). If α is unique, {α k } k converges to α. Proposition 2. (Linear Convergence) Suppose g is twice continuously differentiable and that there is a solution α of (1) satisfying the strong sufficient condition of Robinson in [13]. Then, for sufficiently large k, any iteration marked as SWAP-add or FW in Algorithm 2 produces an iterate α k satisfying the inequality for some constant M > 1. g(α ) g(α ) g(α ) g(α k ) ( 1 1 ) M Proposition 3. Suppose g is twice continuously differentiable and suppose we stop the algorithm by using the stopping condition d (α k ) = max i g(α k ) i α T k g(α k) ε for some ε (0, 1). Let K be the number of unclipped iterations performed by Algorithm 2. Then, where Q, M are constants independent of m and ε. 4 Experiments (3) K Q + M ε, (4) In this section, we evaluate the performance of the proposed method, specialized to the problem of training L 2 Support Vector Machines [5, 6, 15]. We present experiments performed on well-known 3
4 classification data available in several public repositories [3, 7]. In order to provide an idea of the size of each problem, we specify the size m of the training set and the number of classes K. In the case of multi-category classification problems, we adopt a one-versus-one approach (OVO) [10]. For the initialization of all the methods, that is, the computation of a starting solution α 0, we adopted the method proposed in [15]. In this approach, the starting solution is obtained by solving problem on a random subset of p training patterns. The indices of α 0 corresponding to other data points are set to zero. We used p = 20 points for initialization and the stopping criterion of Proposition 3 with ε = 10 6 for all the algorithms. In all the experiments presented in this paper, SVMs were trained using a RBF (Gaussian) kernel k(x 1, x 2 ) = exp ( x 1 x 2 2 /2σ 2) with scale parameter σ 2. Parameter σ 2 was determined using the default method employed in [15], i.e. it was set to the average squared distance among training patterns. Parameter C was determined on the logarithmic grid [2 0, 2 12 ] using a validation set consisting in a randomly computed 30% of the training-set. We also adopted the LRR caching strategy designed in [14] to avoid the computation of recently used kernel values and the probabilistic speed-up described in [14] to accelerate the search for i. Dataset K m FW MFW SWAP Adult a1a E E E-01 Adult a5a E E E+00 Adult a8a E E E+03 Web w1a E E E-01 Web w5a E E E+00 Web w8a E E E+00 Protein E E E+03 Usps-Ext E E E+01 Shuttle E E E-01 Kdd-10pc E E E-01 Table 1: Running times (seconds) of the different methods on some classification problems. Dataset K m FW MFW SWAP Adult a1a Adult a5a Adult a8a Web w1a Web w5a Web w8a Protein Usps-Ext Shuttle Kdd-10pc Table 2: Testing accuracies of the different methods on some classification problems. 5 Conclusions We presented a variant of the FW method for the general problem of maximizing a concave function on the unit simplex, introducing a novel way to perform away steps in the FW method. On the theoretical side, we showed that the method converges globally and that the unclipped optimization steps provide a linear rate of convergence. We also stated that the method achieves a primal-dual gap d (α k ) = max i g(α k ) i α T k g(α k) [4, 11] lower than a given tolerance ε in O(1/ε) unclipped iterations, independently of m, the dimensionality of the feasible space and the number of examples in SVM problems. On the experimental side, we showed that the proposed method was faster than the FW and MFW methods on several SVM training problems. We observed that the SWAP method performed better than FW and MFW in those cases in which classic away steps effectively boost the convergence of the FW method, and proved to be a robust alternative to MFW in the cases where classic away steps failed. 4
5 References [1] S. Damla Ahipasaoglu, Sun Peng, and Michael Todd. Linear convergence of a modified Frank-Wolfe algorithm for computing minimum volume enclosing ellipsoids. Optimization Methods and Software, 23(1):5 19, [2] Héctor Allende, Emanuele Frandi, Ricardo Ñanculef, and Claudio Sartori. Novel Frank-Wolfe methods for SVM learning. Technical report, [3] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, [4] Kenneth Clarkson. Coresets, sparse greedy approximation, and the Frank-Wolfe algorithm. In Proceedings of SODA 08, pages SIAM, [5] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. A new algorithm for training SVMs using approximate minimal enclosing balls. In Proceedings of the 15th Iberoamerican Congress on Pattern Recognition, Lecture Notes in Computer Science, pages Springer, [6] Emanuele Frandi, Maria Grazia Gasparo, Stefano Lodi, Ricardo Ñanculef, and Claudio Sartori. Training support vector machines using Frank-Wolfe methods. International Journal of Pattern Recognition and Artificial Intelligence, 27(3), [7] Andrew Frank and Arthur Asuncion. The UCI KDD Archive [8] Jacques Guélat and Patrice Marcotte. Some comments on Wolfe s away step. Mathematical Programming, 35: , [9] Bernd Gärtner and Martin Jaggi. Coresets for polytope distance. In John Hershberger and Efi Fogel, editors, Symposium on Computational Geometry, pages ACM, [10] Thomas Hofmann, Bernhard Schölkopf, and Alexander Smola. Kernel methods in machine learning. Annals of Statistics, 36(3): , [11] Martin Jaggi. Revisiting Frank-Wolfe: Projection-free sparse convex optimization. In International Conference on Machine Learning (to appear), [12] Hua Ouyang and Alexander Gray. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. In SDM, pages , [13] Stephen Robinson. Generalized equations and their solutions, part II: Applications to nonlinear programming. In Optimality and Stability in Mathematical Programming, volume 19 of Mathematical Programming Studies, pages Springer Berlin Heidelberg, [14] Ivor Tsang, Andras Kocsor, and James Kwok. LibCVM Toolkit [15] Ivor Tsang, James Kwok, and Pak-Ming Cheung. Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research, 6: , [16] Philip Wolfe. Convergence theory in nonlinear programming. In J. Abadie, editor, Integer and Nonlinear Programming, pages North-Holland, Amsterdam, [17] Emre Alper Yildirim. Two algorithms for the minimum enclosing ball problem. SIAM Journal on Optimization, 19(3): ,
Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets
Journal of Machine Learning Research 8 (27) 291-31 Submitted 1/6; Revised 7/6; Published 2/7 Comments on the Core Vector Machines: Fast SVM Training on Very Large Data Sets Gaëlle Loosli Stéphane Canu
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationFast and scalable Lasso via stochastic Frank Wolfe methods with a convergence guarantee
Mach Learn 2016) 104:195 221 DOI 10.1007/s10994-016-5578-4 Fast and scalable Lasso via stochastic Frank Wolfe methods with a convergence guarantee Emanuele Frandi 1 Ricardo Ñanculef 2 Stefano Lodi 3 Claudio
More informationA Revisit to Support Vector Data Description (SVDD)
A Revisit to Support Vector Data Description (SVDD Wei-Cheng Chang b99902019@csie.ntu.edu.tw Ching-Pei Lee r00922098@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationLinear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids
Linear Convergence of a Modified Frank-Wolfe Algorithm for Computing Minimum Volume Enclosing Ellipsoids S. Damla Ahipasaoglu Peng Sun Michael J. Todd October 5, 2006 Dedicated to the memory of Naum Shor
More informationOn the von Neumann and Frank-Wolfe Algorithms with Away Steps
On the von Neumann and Frank-Wolfe Algorithms with Away Steps Javier Peña Daniel Rodríguez Negar Soheili July 16, 015 Abstract The von Neumann algorithm is a simple coordinate-descent algorithm to determine
More informationA Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result
A Linearly Convergent Linear-Time First-Order Algorithm for Support Vector Classification with a Core Set Result Piyush Kumar Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530,
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationRough Margin based Core Vector Machine
Rough Margin based Core Vector Machine Gang Niu, Bo Dai 2, Lin Shang, and Yangsheng Ji State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 20093, P.R.China {niugang,jiyangsheng,lshang}@ai.nju.edu.cn
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationNearest Neighbors Methods for Support Vector Machines
Nearest Neighbors Methods for Support Vector Machines A. J. Quiroz, Dpto. de Matemáticas. Universidad de Los Andes joint work with María González-Lima, Universidad Simón Boĺıvar and Sergio A. Camelo, Universidad
More informationLimited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives
Limited Memory Kelley s Method Converges for Composite Convex and Submodular Objectives Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Song
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationarxiv: v1 [math.oc] 10 Oct 2018
8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013
Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 3 April 5, 2013 Due: April 19, 2013 A. Kernels 1. Let X be a finite set. Show that the kernel
More informationSupport Vector Machine via Nonlinear Rescaling Method
Manuscript Click here to download Manuscript: svm-nrm_3.tex Support Vector Machine via Nonlinear Rescaling Method Roman Polyak Department of SEOR and Department of Mathematical Sciences George Mason University
More informationAn Algorithm and a Core Set Result for the Weighted Euclidean One-Center Problem
An Algorithm and a Core Set Result for the Weighted Euclidean One-Center Problem Piyush Kumar Department of Computer Science, Florida State University, Tallahassee, FL 32306-4530, USA, piyush@cs.fsu.edu
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationSMO Algorithms for Support Vector Machines without Bias Term
Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002
More informationThe Frank-Wolfe Algorithm:
The Frank-Wolfe Algorithm: New Results, and Connections to Statistical Boosting Paul Grigas, Robert Freund, and Rahul Mazumder http://web.mit.edu/rfreund/www/talks.html Massachusetts Institute of Technology
More informationSupport Vector Machine Classification with Indefinite Kernels
Support Vector Machine Classification with Indefinite Kernels Ronny Luss ORFE, Princeton University Princeton, NJ 08544 rluss@princeton.edu Alexandre d Aspremont ORFE, Princeton University Princeton, NJ
More informationTotally Corrective Boosting Algorithms that Maximize the Margin
Totally Corrective Boosting Algorithms that Maximize the Margin Manfred K. Warmuth 1 Jun Liao 1 Gunnar Rätsch 2 1 University of California, Santa Cruz 2 Friedrich Miescher Laboratory, Tübingen, Germany
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationRandomized Algorithms
Randomized Algorithms Saniv Kumar, Google Research, NY EECS-6898, Columbia University - Fall, 010 Saniv Kumar 9/13/010 EECS6898 Large Scale Machine Learning 1 Curse of Dimensionality Gaussian Mixture Models
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationMachine Learning. Lecture 6: Support Vector Machine. Feng Li.
Machine Learning Lecture 6: Support Vector Machine Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Warm Up 2 / 80 Warm Up (Contd.)
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationAccelerating SVRG via second-order information
Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical
More informationDiscriminant Kernels based Support Vector Machine
Discriminant Kernels based Support Vector Machine Akinori Hidaka Tokyo Denki University Takio Kurita Hiroshima University Abstract Recently the kernel discriminant analysis (KDA) has been successfully
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationCOMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16
COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS6 Lecture 3: Classification with Logistic Regression Advanced optimization techniques Underfitting & Overfitting Model selection (Training-
More informationSupport Vector Machines.
Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationSolving the SVM Optimization Problem
Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationUVA CS 4501: Machine Learning
UVA CS 4501: Machine Learning Lecture 16 Extra: Support Vector Machine Optimization with Dual Dr. Yanjun Qi University of Virginia Department of Computer Science Today Extra q Optimization of SVM ü SVM
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationSupport Vector Machine (continued)
Support Vector Machine continued) Overlapping class distribution: In practice the class-conditional distributions may overlap, so that the training data points are no longer linearly separable. We need
More informationSupport Vector Machine Regression for Volatile Stock Market Prediction
Support Vector Machine Regression for Volatile Stock Market Prediction Haiqin Yang, Laiwan Chan, and Irwin King Department of Computer Science and Engineering The Chinese University of Hong Kong Shatin,
More informationChange point method: an exact line search method for SVMs
Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.
More informationPolytope conditioning and linear convergence of the Frank-Wolfe algorithm
Polytope conditioning and linear convergence of the Frank-Wolfe algorithm Javier Peña Daniel Rodríguez December 24, 206 Abstract It is known that the gradient descent algorithm converges linearly when
More informationSupport Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI
Support Vector Machines II CAP 5610: Machine Learning Instructor: Guo-Jun QI 1 Outline Linear SVM hard margin Linear SVM soft margin Non-linear SVM Application Linear Support Vector Machine An optimization
More information... SPARROW. SPARse approximation Weighted regression. Pardis Noorzad. Department of Computer Engineering and IT Amirkabir University of Technology
..... SPARROW SPARse approximation Weighted regression Pardis Noorzad Department of Computer Engineering and IT Amirkabir University of Technology Université de Montréal March 12, 2012 SPARROW 1/47 .....
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationDESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING
DESIGNING RBF CLASSIFIERS FOR WEIGHTED BOOSTING Vanessa Gómez-Verdejo, Jerónimo Arenas-García, Manuel Ortega-Moral and Aníbal R. Figueiras-Vidal Department of Signal Theory and Communications Universidad
More informationProbabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016
Probabilistic classification CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2016 Topics Probabilistic approach Bayes decision theory Generative models Gaussian Bayes classifier
More informationFast Nonnegative Matrix Factorization with Rank-one ADMM
Fast Nonnegative Matrix Factorization with Rank-one Dongjin Song, David A. Meyer, Martin Renqiang Min, Department of ECE, UCSD, La Jolla, CA, 9093-0409 dosong@ucsd.edu Department of Mathematics, UCSD,
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationTraining Support Vector Machines: Status and Challenges
ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationAN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD
J. KSIAM Vol., No., 1 10, 2018 AN OPTIMAL BOOSTING ALGORITHM BASED ON NONLINEAR CONJUGATE GRADIENT METHOD JOOYEON CHOI, BORA JEONG, YESOM PARK, JIWON SEO, AND CHOHONG MIN 1 ABSTRACT. Boosting, one of the
More informationKernel Conjugate Gradient
Kernel Conjugate Gradient Nathan Ratliff Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 ndr@andrew.cmu.edu J. Andrew Bagnell Robotics Institute Carnegie Mellon University Pittsburgh,
More informationABC-LogitBoost for Multi-Class Classification
Ping Li, Cornell University ABC-Boost BTRY 6520 Fall 2012 1 ABC-LogitBoost for Multi-Class Classification Ping Li Department of Statistical Science Cornell University 2 4 6 8 10 12 14 16 2 4 6 8 10 12
More informationSupport Vector Machines: Maximum Margin Classifiers
Support Vector Machines: Maximum Margin Classifiers Machine Learning and Pattern Recognition: September 16, 2008 Piotr Mirowski Based on slides by Sumit Chopra and Fu-Jie Huang 1 Outline What is behind
More informationCS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines
CS4495/6495 Introduction to Computer Vision 8C-L3 Support Vector Machines Discriminative classifiers Discriminative classifiers find a division (surface) in feature space that separates the classes Several
More informationMachine Learning : Support Vector Machines
Machine Learning Support Vector Machines 05/01/2014 Machine Learning : Support Vector Machines Linear Classifiers (recap) A building block for almost all a mapping, a partitioning of the input space into
More informationIE 5531: Engineering Optimization I
IE 5531: Engineering Optimization I Lecture 15: Nonlinear optimization Prof. John Gunnar Carlsson November 1, 2010 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 1 / 24
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationA Randomized Algorithm for Large Scale Support Vector Learning
A Randomized Algorithm for Large Scale Support Vector Learning Krishnan S. Department of Computer Science and Automation, Indian Institute of Science, Bangalore-12 rishi@csa.iisc.ernet.in Chiranjib Bhattacharyya
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationk-means Clustering via the Frank-Wolfe Algorithm
k-means Clustering via the Frank-Wolfe Algorithm Christian Bauckhage B-IT, University of Bonn, Bonn, Germany Fraunhofer IAIS, Sankt Augustin, Germany http://multimedia-pattern-recognition.info Abstract.
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationSparse Support Vector Machines by Kernel Discriminant Analysis
Sparse Support Vector Machines by Kernel Discriminant Analysis Kazuki Iwamura and Shigeo Abe Kobe University - Graduate School of Engineering Kobe, Japan Abstract. We discuss sparse support vector machines
More informationMaximum likelihood estimation
Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization
More informationLecture 18: Kernels Risk and Loss Support Vector Regression. Aykut Erdem December 2016 Hacettepe University
Lecture 18: Kernels Risk and Loss Support Vector Regression Aykut Erdem December 2016 Hacettepe University Administrative We will have a make-up lecture on next Saturday December 24, 2016 Presentations
More information1 Maximizing a Submodular Function
6.883 Learning with Combinatorial Structure Notes for Lecture 16 Author: Arpit Agarwal 1 Maximizing a Submodular Function In the last lecture we looked at maximization of a monotone submodular function,
More informationSupport Vector Machine Classification via Parameterless Robust Linear Programming
Support Vector Machine Classification via Parameterless Robust Linear Programming O. L. Mangasarian Abstract We show that the problem of minimizing the sum of arbitrary-norm real distances to misclassified
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationLearning with kernels and SVM
Learning with kernels and SVM Šámalova chata, 23. května, 2006 Petra Kudová Outline Introduction Binary classification Learning with Kernels Support Vector Machines Demo Conclusion Learning from data find
More informationLeast Squares SVM Regression
Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationSimple Optimization, Bigger Models, and Faster Learning. Niao He
Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture
More informationAdvanced Topics in Machine Learning
Advanced Topics in Machine Learning 1. Learning SVMs / Primal Methods Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) University of Hildesheim, Germany 1 / 16 Outline 10. Linearization
More informationPoS(CENet2017)018. Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets. Speaker 2
Privacy Preserving SVM with Different Kernel Functions for Multi-Classification Datasets 1 Shaanxi Normal University, Xi'an, China E-mail: lizekun@snnu.edu.cn Shuyu Li Shaanxi Normal University, Xi'an,
More informationReading Group on Deep Learning Session 1
Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular
More informationSupport Vector and Kernel Methods
SIGIR 2003 Tutorial Support Vector and Kernel Methods Thorsten Joachims Cornell University Computer Science Department tj@cs.cornell.edu http://www.joachims.org 0 Linear Classifiers Rules of the Form:
More informationKernel Learning via Random Fourier Representations
Kernel Learning via Random Fourier Representations L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Module 5: Machine Learning L. Law, M. Mider, X. Miscouridou, S. Ip, A. Wang Kernel Learning via Random
More informationI D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69
R E S E A R C H R E P O R T Online Policy Adaptation for Ensemble Classifiers Christos Dimitrakakis a IDIAP RR 03-69 Samy Bengio b I D I A P December 2003 D a l l e M o l l e I n s t i t u t e for Perceptual
More information