CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM

Size: px
Start display at page:

Download "CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM"

Transcription

1 CTJLSVM: Componentwise Triple Jump Acceleration for Training Linear SVM Han-Shen Huang, Porter Chang (Ker2) and Chun-Nan Hsu AI for Investigating Anti-cancer solutions (AIIA Lab) Institute of Information Science Academia Sinica Taipei, Taiwan ICML 2008 Large Scale Learning Workshop Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

2 Outline 1 Triple Jump Extrapolation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

3 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

4 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

5 Outline 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

6 Outline Triple Jump Extrapolation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

7 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

8 Triple Jump Extrapolation Aitken s Acceleration Fixed point iteration Solve w = M(w), w R d, by iteratively substituting the input of M with the output of M in the previous iteration: w (t+1) = M(w (t) ), w (t+2) = M(w (t+1) ),.... until w = M(w ). E.g., EM, Generalized Iterative Scaling, Gradient Descent, etc. Aitken s acceleration Apply a linear Taylor expansion of M around w, w (t+1) = M(w (t) ) w + M (w )(w (t) w ) = w + J(w (t) w ), (1) By applying M to w (t) consecutively for h times, w (t+h) w + J h (w (t) w ). Suppose that eig(j) ( 1, 1), we have the multivariate Aitken s acceleration: w w (t) + J h (w (t+1) w (t) ) h=0 = w (t) + (I J) 1 (M(w (t) ) w (t) ). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

9 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

10 Triple Jump Extrapolation Triple Jump Extrapolation Approximating Jacobian Since it is prohibitively expensive to compute J, from J(w (t) w (t 1) ) M(w (t) ) w (t), replace J with a scalar: γ (t) := M(w(t) ) w (t) w (t) w (t 1). (2) to obtain w (t+1) = w (t) + (1 γ (t) ) 1 (M(w (t) ) w (t) ). (3) Algorithm Triple Jump 1: Initialize w (0), t 0 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: jump: Apply Equation (3) to obtain w (t+1) ; 8: end if 9: t t + 1; 10: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

11 Triple Jump GIS Triple Jump Extrapolation x Independent data set CTJPGIS TJPGIS PGIS Penalized Log Likelihood # of Forward backward evaluations Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

12 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

13 Triple Jump Extrapolation Global and Componentwise Extrapolation Eigenvalue approximation and global rates of convergence Another way to derive the triple jump method is to approximate J with its eigenvalue. The global rate of convergence of M is given by: w (t+1) w R = lim t R(t) := lim t w (t) w = λmax w(t+1) w (t) w (t) w (t 1). R = λ max of J was established by (Dempster et al. 1977) when M is EM. In practice, w is unknown. Replacing it with empirical values leads to Equation (2). Componentwise rate of convergence Now the i-th componentwise rate of convergence of M is defined as R i = lim t R(t) := lim i t w (t+1) w i i w (t) w i i = λ i w(t+1) w (t) i i w (t) w (t 1). i i Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

14 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

15 Triple Jump Extrapolation Componentwise Triple Jump Componentwise extrapolation We can estimate the i-th eigenvalue λ i similarly by γ (t) i := M(w(t) ) i w (t) i w (t) i w (t 1) i and extrapolate at each dimension i by: w (t+1) i = w (t) i Global vs. Componentwise, i, (4) + (1 γ (t) i ) 1 (M(w (t) ) i w (t) i ). (5) When R i R, componentwise extrapolation should be preferred (Schafer 1997; Hsu et al. 2006). Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

16 Outline Application to Linear SVM 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

17 Application to Linear SVM SGD for Linear SVM Decomposing primal objective function for linear SVM: P(w; X) = m P(w; x i ) i=1 m i=1 C max(0, 1 y i wx i ) + 1 2m w 2 Gradient Descent (Off-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; X) 4: Update η 5: until Convergence SGD (On-line) 1: Initialize w (0) 2: repeat in iteration t 3: w (t+1) w (t) η P(w (t) ; x i ) 4: Update η 5: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

18 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

19 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

20 Application to Linear SVM Accelerating SGD with CTJ By treating a SGD epoch as M, we can apply CTJ to accelerate SGD for training linear SVM. CTJ takes only O(d) more time per iteration than SGD and the sparsity trick for SGD can still be applied here. Algorithm CTJLSVM 1: Initialize w (0), t 0, κ 2: repeat in iteration t 3: if mod(t, 2) == 0 then 4: hop: w (t+1) M(w (t) ); 5: else 6: step: Apply M again to obtain M(w (t) ); 7: Compute γ (t) by Equation (4) 8: γ (t) i min(κ, γ (t) i ) i to prevent erratic extrapolation 9: jump: Apply Equation (3) to obtain w (t+1) ; 10: end if 11: Update η; 12: t t + 1; 13: until Convergence Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

21 Outline Implementation 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

22 Parameter Settings Implementation κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

23 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

24 Implementation Parameter Settings κ: upper bound for eigenvalue estimation; Since we didn t implement convergence checking, we set κ = 0.7, a conservative setting considering that eigenvalues could be close to 1 η (0) : initial step size = 0.1 Mini-batch size for SGD is always 1 Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

25 Implementation Stopping Condition Since this is not a dual method, we have to design our own stopping condition. Initial condition: P(w (t 1) ; X) P(w (t) ; X) P(w (t 1) ; X) < δ = If this condition is satisfied 2 times successively, then decrease η η by a factor = 0.1. Repeat until η is decreased τ = 7 times. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

26 Outline Discussion 1 Triple Jump Extrapolation 2 Application to Linear SVM 3 Implementation 4 Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

27 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

28 Discussion Key Tricks CTJLSVM vs. SGD CTJLSVM effectively accelerates SGD. CTJLSVM usually achieves a lower objective. CTJLSVM s convergence is less sensitive to C. Comparison CTj LSVM02: CTJLSVM with settings the same for final Linear track submissions lsvm04: SGD LSVM with a constant η linear SVM03: SGD LSVM with decreasing η per example by textbook update rule η η(0) 1+t/λ Results for Alpha data set Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

29 CTJ accelerates SGD Discussion Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

30 Discussion CTJ takes no extra time Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

31 Discussion CTJ is less sensitive to C Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

32 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

33 Performance tuning Discussion Tuning SGD Textbook update η η(0) 1+t/λ is slow! A fixed η makes great progress in the beginning. Tuning CTJ How to guarentee convergence? Skip CTJ when failing to improve objective. But this check takes time. Instead, we use a smaller κ = 0.7. We usually use κ = 0.9 for EM and CRF. Upper bound of extrapolation rate (1 γ) 1 = 3.33 for κ = 0.7, (1 γ) 1 = 10.0 for κ = 0.9. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

34 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

35 Discussion Wild vs. Linear Track Run convert.py for SVMlight format may hurt aoprc? C = 0.5 yields much lower aoprc than required C = Data sets may not be linearly separable lower obj leads to higher aoprc! τ (# of η decreasing) in stopping condition controls obj and aoprc: a large τ leads to lower obj but higher aoprc; a small τ leads to lower aoprc but higher obj. We forgot to reduce τ for wild track... OUCH!! Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

36 Discussion Tricks that we didn t do We found several tricks that the participants could do to take advantage of the scoring rules Compute objective with a smaller data set to minimize obj. Scale C from 10 5 instead of 10 4 to minimize avgtime for C. Report worse results for small data size to boost Effort. Carefully choose T 1...T 10 to minimize autime vs. Report results of huge real-world data sets. obj. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

37 Discussion CTJLSVM can handle large data Table: Post-Challenge Test Result of CTJLSVM for OCR Method Data Obj. CPU Time CTJLSVM (T1) CTJLSVM (converged) Test rank Test rank 2 (converged) Overall test rank CPU Time in second caliberated; T1: first reporting time point No parameter re-tuning for CTJLSVM Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

38 Final Words Discussion CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

39 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

40 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

41 Discussion Final Words CTJ is a generic method that can be applied to accelerate fixed-point iteration methods, including EM, GIS, SGD, etc. in machine learning. Accelerating SGD by CTJ for training Linear SVM performs reasonably well. A clever way to adjust step size in SGD and stopping condition will improve CTJLSVM. We thank the organizers and hope there will be a next year. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

42 Discussion Thank you for your attention! chunnan This work is supported in part by Advanced Bioinformatics Core (ABC), National Research Program for Genomic Medicine (NRPGM), National Science Council, Taiwan. We also wish to thank Yuh-Jye Lee for his useful advices on SVM implementation. Huang, Chang and Hsu (AIIA.IIS.AS.TW) CTJLSVM July 9, / 25

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning CS 4900/5900. Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Machine Learning CS 4900/5900 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning is Optimization Parametric ML involves minimizing an objective function

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

Machine Learning Lecture 6 Note

Machine Learning Lecture 6 Note Machine Learning Lecture 6 Note Compiled by Abhi Ashutosh, Daniel Chen, and Yijun Xiao February 16, 2016 1 Pegasos Algorithm The Pegasos Algorithm looks very similar to the Perceptron Algorithm. In fact,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Normalization Techniques

Normalization Techniques Normalization Techniques Devansh Arpit Normalization Techniques 1 / 39 Table of Contents 1 Introduction 2 Motivation 3 Batch Normalization 4 Normalization Propagation 5 Weight Normalization 6 Layer Normalization

More information

ECE521 lecture 4: 19 January Optimization, MLE, regularization

ECE521 lecture 4: 19 January Optimization, MLE, regularization ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Machine Learning Basics III

Machine Learning Basics III Machine Learning Basics III Benjamin Roth CIS LMU München Benjamin Roth (CIS LMU München) Machine Learning Basics III 1 / 62 Outline 1 Classification Logistic Regression 2 Gradient Based Optimization Gradient

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington Linear Classification CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Example of Linear Classification Red points: patterns belonging

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets

More information

Numerical solutions of nonlinear systems of equations

Numerical solutions of nonlinear systems of equations Numerical solutions of nonlinear systems of equations Tsung-Ming Huang Department of Mathematics National Taiwan Normal University, Taiwan E-mail: min@math.ntnu.edu.tw August 28, 2011 Outline 1 Fixed points

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks C.M. Bishop s PRML: Chapter 5; Neural Networks Introduction The aim is, as before, to find useful decompositions of the target variable; t(x) = y(x, w) + ɛ(x) (3.7) t(x n ) and x n are the observations,

More information

Linear Regression (continued)

Linear Regression (continued) Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba

Tutorial on: Optimization I. (from a deep learning perspective) Jimmy Ba Tutorial on: Optimization I (from a deep learning perspective) Jimmy Ba Outline Random search v.s. gradient descent Finding better search directions Design white-box optimization methods to improve computation

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Large Scale Machine Learning with Stochastic Gradient Descent

Large Scale Machine Learning with Stochastic Gradient Descent Large Scale Machine Learning with Stochastic Gradient Descent Léon Bottou leon@bottou.org Microsoft (since June) Summary i. Learning with Stochastic Gradient Descent. ii. The Tradeoffs of Large Scale Learning.

More information

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab

More information

Stochastic Analogues to Deterministic Optimizers

Stochastic Analogues to Deterministic Optimizers Stochastic Analogues to Deterministic Optimizers ISMP 2018 Bordeaux, France Vivak Patel Presented by: Mihai Anitescu July 6, 2018 1 Apology I apologize for not being here to give this talk myself. I injured

More information

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29

Optimization. Yuh-Jye Lee. March 21, Data Science and Machine Intelligence Lab National Chiao Tung University 1 / 29 Optimization Yuh-Jye Lee Data Science and Machine Intelligence Lab National Chiao Tung University March 21, 2017 1 / 29 You Have Learned (Unconstrained) Optimization in Your High School Let f (x) = ax

More information

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge

An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge An Ensemble Ranking Solution for the Yahoo! Learning to Rank Challenge Ming-Feng Tsai Department of Computer Science University of Singapore 13 Computing Drive, Singapore Shang-Tse Chen Yao-Nan Chen Chun-Sung

More information

SVMs, Duality and the Kernel Trick

SVMs, Duality and the Kernel Trick SVMs, Duality and the Kernel Trick Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 26 th, 2007 2005-2007 Carlos Guestrin 1 SVMs reminder 2005-2007 Carlos Guestrin 2 Today

More information

Midterm: CS 6375 Spring 2018

Midterm: CS 6375 Spring 2018 Midterm: CS 6375 Spring 2018 The exam is closed book (1 cheat sheet allowed). Answer the questions in the spaces provided on the question sheets. If you run out of room for an answer, use an additional

More information

Kernelized Perceptron Support Vector Machines

Kernelized Perceptron Support Vector Machines Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:

More information

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning

Motivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic

More information

Deep Learning II: Momentum & Adaptive Step Size

Deep Learning II: Momentum & Adaptive Step Size Deep Learning II: Momentum & Adaptive Step Size CS 760: Machine Learning Spring 2018 Mark Craven and David Page www.biostat.wisc.edu/~craven/cs760 1 Goals for the Lecture You should understand the following

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017

Simple Techniques for Improving SGD. CS6787 Lecture 2 Fall 2017 Simple Techniques for Improving SGD CS6787 Lecture 2 Fall 2017 Step Sizes and Convergence Where we left off Stochastic gradient descent x t+1 = x t rf(x t ; yĩt ) Much faster per iteration than gradient

More information

Numerical Methods I Solving Nonlinear Equations

Numerical Methods I Solving Nonlinear Equations Numerical Methods I Solving Nonlinear Equations Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 MATH-GA 2011.003 / CSCI-GA 2945.003, Fall 2014 October 16th, 2014 A. Donev (Courant Institute)

More information

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation

A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation A Gradient-based Adaptive Learning Framework for Efficient Personal Recommendation Yue Ning 1 Yue Shi 2 Liangjie Hong 2 Huzefa Rangwala 3 Naren Ramakrishnan 1 1 Virginia Tech 2 Yahoo Research. Yue Shi

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression CSE 4309 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 The Regression Problem Training data: A set of input-output

More information

Learning in a Distributed and Heterogeneous Environment

Learning in a Distributed and Heterogeneous Environment Learning in a Distributed and Heterogeneous Environment Martin Jaggi EPFL Machine Learning and Optimization Laboratory mlo.epfl.ch Inria - EPFL joint Workshop - Paris - Feb 15 th Machine Learning Methods

More information

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning Support Vector Machines. Prof. Matteo Matteucci Machine Learning Support Vector Machines Prof. Matteo Matteucci Discriminative vs. Generative Approaches 2 o Generative approach: we derived the classifier from some generative hypothesis about the way

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Exploiting Primal and Dual Sparsity for Extreme Classification 1

Exploiting Primal and Dual Sparsity for Extreme Classification 1 Exploiting Primal and Dual Sparsity for Extreme Classification 1 Ian E.H. Yen Joint work with Xiangru Huang, Kai Zhong, Pradeep Ravikumar and Inderjit Dhillon Machine Learning Department Carnegie Mellon

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Why should you care about the solution strategies?

Why should you care about the solution strategies? Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

More Optimization. Optimization Methods. Methods

More Optimization. Optimization Methods. Methods More More Optimization Optimization Methods Methods Yann YannLeCun LeCun Courant CourantInstitute Institute http://yann.lecun.com http://yann.lecun.com (almost) (almost) everything everything you've you've

More information

Scaling Neighbourhood Methods

Scaling Neighbourhood Methods Quick Recap Scaling Neighbourhood Methods Collaborative Filtering m = #items n = #users Complexity : m * m * n Comparative Scale of Signals ~50 M users ~25 M items Explicit Ratings ~ O(1M) (1 per billion)

More information

A Quick Tour of Linear Algebra and Optimization for Machine Learning

A Quick Tour of Linear Algebra and Optimization for Machine Learning A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 3: Regression I slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 48 In a nutshell

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Large Scale Semi-supervised Linear SVMs. University of Chicago

Large Scale Semi-supervised Linear SVMs. University of Chicago Large Scale Semi-supervised Linear SVMs Vikas Sindhwani and Sathiya Keerthi University of Chicago SIGIR 2006 Semi-supervised Learning (SSL) Motivation Setting Categorize x-billion documents into commercial/non-commercial.

More information

Midterm exam CS 189/289, Fall 2015

Midterm exam CS 189/289, Fall 2015 Midterm exam CS 189/289, Fall 2015 You have 80 minutes for the exam. Total 100 points: 1. True/False: 36 points (18 questions, 2 points each). 2. Multiple-choice questions: 24 points (8 questions, 3 points

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

More information

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning

Kernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:

More information

Announcements - Homework

Announcements - Homework Announcements - Homework Homework 1 is graded, please collect at end of lecture Homework 2 due today Homework 3 out soon (watch email) Ques 1 midterm review HW1 score distribution 40 HW1 total score 35

More information

Linear Regression. Volker Tresp 2018

Linear Regression. Volker Tresp 2018 Linear Regression Volker Tresp 2018 1 Learning Machine: The Linear Model / ADALINE As with the Perceptron we start with an activation functions that is a linearly weighted sum of the inputs h = M j=0 w

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression CSC2515 Winter 2015 Introduction to Machine Learning Lecture 2: Linear regression All lecture slides will be available as.pdf on the course website: http://www.cs.toronto.edu/~urtasun/courses/csc2515/csc2515_winter15.html

More information

Training Support Vector Machines: Status and Challenges

Training Support Vector Machines: Status and Challenges ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science

More information

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines

CS6375: Machine Learning Gautam Kunapuli. Support Vector Machines Gautam Kunapuli Example: Text Categorization Example: Develop a model to classify news stories into various categories based on their content. sports politics Use the bag-of-words representation for this

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Neural Networks: Backpropagation

Neural Networks: Backpropagation Neural Networks: Backpropagation Seung-Hoon Na 1 1 Department of Computer Science Chonbuk National University 2018.10.25 eung-hoon Na (Chonbuk National University) Neural Networks: Backpropagation 2018.10.25

More information

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron

Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee227c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee227c@berkeley.edu

More information

An overview of deep learning methods for genomics

An overview of deep learning methods for genomics An overview of deep learning methods for genomics Matthew Ploenzke STAT115/215/BIO/BIST282 Harvard University April 19, 218 1 Snapshot 1. Brief introduction to convolutional neural networks What is deep

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Introduction to Logistic Regression and Support Vector Machine

Introduction to Logistic Regression and Support Vector Machine Introduction to Logistic Regression and Support Vector Machine guest lecturer: Ming-Wei Chang CS 446 Fall, 2009 () / 25 Fall, 2009 / 25 Before we start () 2 / 25 Fall, 2009 2 / 25 Before we start Feel

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017

Non-Convex Optimization. CS6787 Lecture 7 Fall 2017 Non-Convex Optimization CS6787 Lecture 7 Fall 2017 First some words about grading I sent out a bunch of grades on the course management system Everyone should have all their grades in Not including paper

More information

Optimization for Training I. First-Order Methods Training algorithm

Optimization for Training I. First-Order Methods Training algorithm Optimization for Training I First-Order Methods Training algorithm 2 OPTIMIZATION METHODS Topics: Types of optimization methods. Practical optimization methods breakdown into two categories: 1. First-order

More information

Modern Optimization Techniques

Modern Optimization Techniques Modern Optimization Techniques Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Stochastic Gradient Descent Stochastic

More information

Midterm: CS 6375 Spring 2015 Solutions

Midterm: CS 6375 Spring 2015 Solutions Midterm: CS 6375 Spring 2015 Solutions The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for an

More information

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x)

GRADIENT DESCENT. CSE 559A: Computer Vision GRADIENT DESCENT GRADIENT DESCENT [0, 1] Pr(y = 1) w T x. 1 f (x; θ) = 1 f (x; θ) = exp( w T x) 0 x x x CSE 559A: Computer Vision For Binary Classification: [0, ] f (x; ) = σ( x) = exp( x) + exp( x) Output is interpreted as probability Pr(y = ) x are the log-odds. Fall 207: -R: :30-pm @ Lopata 0

More information

Brief Introduction to Machine Learning

Brief Introduction to Machine Learning Brief Introduction to Machine Learning Yuh-Jye Lee Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU August 29, 2016 1 / 49 1 Introduction 2 Binary Classification 3 Support Vector

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University. SCMA292 Mathematical Modeling : Machine Learning Krikamol Muandet Department of Mathematics Faculty of Science, Mahidol University February 9, 2016 Outline Quick Recap of Least Square Ridge Regression

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Presentation in Convex Optimization

Presentation in Convex Optimization Dec 22, 2014 Introduction Sample size selection in optimization methods for machine learning Introduction Sample size selection in optimization methods for machine learning Main results: presents a methodology

More information

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels Need for Deep Networks Perceptron Can only model linear functions Kernel Machines Non-linearity provided by kernels Need to design appropriate kernels (possibly selecting from a set, i.e. kernel learning)

More information

OPTIMIZATION METHODS IN DEEP LEARNING

OPTIMIZATION METHODS IN DEEP LEARNING Tutorial outline OPTIMIZATION METHODS IN DEEP LEARNING Based on Deep Learning, chapter 8 by Ian Goodfellow, Yoshua Bengio and Aaron Courville Presented By Nadav Bhonker Optimization vs Learning Surrogate

More information

y(x n, w) t n 2. (1)

y(x n, w) t n 2. (1) Network training: Training a neural network involves determining the weight parameter vector w that minimizes a cost function. Given a training set comprising a set of input vector {x n }, n = 1,...N,

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information