CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM
|
|
- Terence Lewis
- 6 years ago
- Views:
Transcription
1 CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information Science Academia Sinica Taipei, Taiwan ICML 2008 Large Scale Learning Workshop Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
2 Outline 1 Triple Jump Extrapolation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
3 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
4 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
5 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
6 Outline Triple Jump Extrapolation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
7 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
8 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Aitken s acceleration Apply a linear Taylor expansion of M around w, w (t+1) = M(w (t) ) w + M (w )(w (t) w ) = w + J(w (t) w ), (1) By applying M to w (t) consecutively for h times, w (t+h) w + J h (w (t) w ). Suppose that eig(j) ( 1, 1), we have the multivariate Aitken s acceleration: w w (t) + J h (w (t+1) w (t) ) h=0 = w (t) + (I J) 1 (M(w (t) ) w (t) ). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
9 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
10 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Algorithm Triple Jump 1: Initialize w (0), t 0 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: jump: Apply Equation (3) to obtain w (t+1) ; 8: end if 9: t t + 1; 10: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
11 Triple Jump GIS Triple Jump Extrapolation x Independent data set CTJPGIS TJPGIS PGIS Penalized Log Likelihood # of Forward backward evaluations Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
12 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
13 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Componentwise rate of convergence Now the i-th componentwise rate of convergence of M is defined as R i = lim t R(t) := lim i t w (t+1) w i i w (t) w i i = λ i w(t+1) w (t) i i w (t) w (t 1). i i Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
14 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
15 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i Global vs. Componentwise, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) When R i R, componentwise extrapolation should be preferred (Schafer 1997; Hsu et al. 2006). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
16 Outline Application to Linear SVM 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
17 Application to Linear SVM SGD for Linear SVM Decomposing primal objective function for linear SVM: P(w; X) = m P(w; x i ) i=1 m i=1 C max(0, 1 y i wx i ) + 1 2m w 2 Gradient Descent (Off-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; X) 4: Update η 5: until Convergence SGD (On-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; x i ) 4: Update η 5: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
18 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
19 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
20 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. CTJ takes only O(d) more time per iteration than SGD and the sparsity trick for SGD can still be applied here. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
21 Outline Implementation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
22 Parameter Settings Implementation κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
23 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
24 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Mini-batch size for SGD is always 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
25 Implementation Stopping Condition Since this is not a dual method, we have to design our own stopping condition. Initial condition: P(w (t 1) ; X) P(w (t) ; X) P(w (t 1) ; X) < δ = If this condition is satisfied 2 times successively, then decrease η η by a factor = 0.1. Repeat until η is decreased τ = 7 times. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
26 Outline Discussion 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
27 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
28 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Comparison CTj LSVM02: CTJLSVM with settings the same for final Linear track submissions lsvm04: SGD LSVM with a constant η linear SVM03: SGD LSVM with decreasing η per example by textbook update rule η η(0) 1+t/λ Results for Alpha data set Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
29 CTJ accelerates SGD Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
30 Discussion CTJ takes no extra time Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
31 Discussion CTJ is less sensitive to C Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
32 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
33 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Tuning CTJ How to guarentee convergence? Skip CTJ when failing to improve objective. But this check takes time. Instead, we use a smaller κ = 0.7. We usually use κ = 0.9 for EM and CRF. Upper bound of extrapolation rate (1 γ) 1 = 3.33 for κ = 0.7, (1 γ) 1 = 10.0 for κ = 0.9. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
34 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
35 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. We forgot to reduce τ for wild track... OUCH!! Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
36 Discussion Tricks that we didn t do We found several tricks that the participants could do to take advantage of the scoring rules Compute objective with a smaller data set to minimize obj. Scale C from 10 5 instead of 10 4 to minimize avgtime for C. Report worse results for small data size to boost Effort. Carefully choose T 1...T 10 to minimize autime vs. Report results of huge real-world data sets. obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
37 Discussion CTJLSVM can handle large data Table: Post-Challenge Test Result of CTJLSVM for OCR Method Data Obj. CPU Time CTJLSVM (T1) CTJLSVM (converged) Test rank Test rank 2 (converged) Overall test rank CPU Time in second caliberated; T1: first reporting time point No parameter re-tuning for CTJLSVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
38 Final Words Discussion CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
39 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
40 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
41 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. We thank the organizers and hope there will be a next year. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
42 Discussion Thank you for your attention! chunnan This work is supported in part by Advanced Bioinformatics Core (ABC), National Research Program for Genomic Medicine (NRPGM), National Science Council, Taiwan. We also wish to thank Yuh-Jye Lee for his useful advices on SVM implementation. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25
Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationMachine Learning Lecture 6 Note
Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationNormalization Techniques
Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationMachine Learning Basics III
Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationLinear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationCSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent
CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationNumerical solutions of nonlinear systems of equations
Numerical solutions of nonlinear systems of equations Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan E-mail: min@math.ntnu.edu.tw August 28, 2011 Outline 1 Fixed points
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationVasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks
C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationTutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba
Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationLarge Scale Machine Learning with Stochastic Gradient Descent
Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationStochastic Analogues to Deterministic Optimizers
Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured
More informationOptimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29
Optimization Yuh-Jye Lee Data Science and Machine Intelligence Lab National Chiao Tung University March 21, 2017 1 / 29 You Have Learned (Unconstrained) Optimization in Your High School Let f (x) = ax
More informationAn Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge
An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung
More informationSVMs, Duality and the Kernel Trick
SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today
More informationMidterm: CS 6375 Spring 2018
Midterm: CS 6375 Spring 2018 The exam is closed book (1 cheat sheet allowed). Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, use an additional
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationDeep Learning II: Momentum & Adaptive Step Size
Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following
More informationOverfitting, Bias / Variance Analysis
Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationSimple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017
Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient
More informationNumerical Methods I Solving Nonlinear Equations
Numerical Methods I Solving Nonlinear Equations Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 16th, 2014 A. Donev (Courant Institute)
More informationA Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation
A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue Shi 2 Liangjie Hong 2 Huzefa Rangwala 3 Naren Ramakrishnan 1 1 Virginia Tech 2 Yahoo Research. Yue Shi
More informationLinear Models for Regression
Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output
More informationLearning in a Distributed and Heterogeneous Environment
Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods
More informationMachine Learning Support Vector Machines. Prof. Matteo Matteucci
Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationExploiting Primal and Dual Sparsity for Extreme Classification 1
Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationMore Optimization. Optimization Methods. Methods
More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've
More informationScaling Neighbourhood Methods
Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)
More informationA Quick Tour of Linear Algebra and Optimization for Machine Learning
A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators
More informationA Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification
JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationIntroduction to Machine Learning
Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1
More informationCS-E3210 Machine Learning: Basic Principles
CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell
More informationLogistic Regression. Stochastic Gradient Descent
Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}
More informationLarge Scale Semi-supervised Linear SVMs. University of Chicago
Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.
More informationMidterm exam CS 189/289, Fall 2015
Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationAnnouncements - Homework
Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35
More informationLinear Regression. Volker Tresp 2018
Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationCSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression
CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html
More informationTraining Support Vector Machines: Status and Challenges
ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science
More informationCS6375: Machine Learning Gautam Kunapuli. Support Vector Machines
Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationNeural Networks: Backpropagation
Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu
More informationAn overview of deep learning methods for genomics
An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationIntroduction to Logistic Regression and Support Vector Machine
Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationNon-Convex Optimization. CS6787 Lecture 7 Fall 2017
Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper
More informationOptimization for Training I. First-Order Methods Training algorithm
Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order
More informationModern Optimization Techniques
Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic
More informationMidterm: CS 6375 Spring 2015 Solutions
Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an
More informationGRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)
0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0
More informationBrief Introduction to Machine Learning
Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector
More informationMidterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.
CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic
More informationSCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.
SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationPresentation in Convex Optimization
Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology
More informationNeed for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels
Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)
More informationOPTIMIZATION METHODS IN DEEP LEARNING
Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate
More informationy(x n, w) t n 2. (1)
Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,
More informationComments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms
Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:
More information