StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
|
|
- Erick Dorsey
- 6 years ago
- Views:
Transcription
1 StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft October 9, 2012 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 1 / 41
2 Linear Support Vector Machines Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 2 / 41
3 Linear Support Vector Machines Binary Classification y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41
4 Linear Support Vector Machines Binary Classification y i = +1 y i = 1 {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 3 / 41
5 Linear Support Vector Machines Linear Support Vector Machines y i = +1 w, x 1 x 2 = 2 w w, x 1 x 2 = 2 w x 2 x 1 y i = 1 {x w, x = 1} {x w, x = 1} {x w, x = 0} S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 4 / 41
6 Linear Support Vector Machines Linear Support Vector Machines Optimization Problem 1 min w 2 w 2 + C m max (0, 1 y i w, x i ) i=1 Hinge Loss 3 Loss 2 1 Margin S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 5 / 41
7 Linear Support Vector Machines The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j i,j 0 α i C i α i Convex and quadratic objective Linear (box) constraints w = i α iy i x i Each α i corresponds to an x i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 6 / 41
8 Linear Support Vector Machines Coordinate Descent S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
9 Linear Support Vector Machines Coordinate Descent x y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
10 Linear Support Vector Machines Coordinate Descent x S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
11 Linear Support Vector Machines Coordinate Descent x S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
12 Linear Support Vector Machines Coordinate Descent x y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
13 Linear Support Vector Machines Coordinate Descent y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
14 Linear Support Vector Machines Coordinate Descent y S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 7 / 41
15 Linear Support Vector Machines Coordinate Descent in the Dual One dimensional function ˆD(α t ) = α2 t 2 x t, x t + α t α i y i y t x i, x t α t + const. i t s.t. 0 α t C Solve ˆD(α t ) = α t x t, x t + i t α i y i y t x i, x t 1 = 0 ( α t = median 0, C, 1 i t α ) iy i y t x i, x t x t, x t S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 8 / 41
16 StreamSVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 9 / 41
17 StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
18 StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
19 StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
20 StreamSVM We are Collecting Lots of Data... y i = +1 y i = 1 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 10 / 41
21 StreamSVM What if Data Does not Fit in Memory? Idea 1: Block Minimization [Yu et al., KDD 2010] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Idea 2: Selective Block Minimization [Chang and Roth, KDD 2011] Split data into blocks B 1, B 2... such that B j fits in memory Compress and store each block separately Load one block of data at a time and optimize only those α i s Retain informative samples from each block in main memory S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 11 / 41
22 StreamSVM What are Informative Samples? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 12 / 41
23 StreamSVM Some Observations SBM and BM are wasteful Both split data into blocks and compress the blocks This requires reading the entire data at least once (expensive) Both pause optimization while a block is loaded into memory Hardware 101 Disk I/O is slower than CPU (sometimes by a factor of 100) Random access on HDD is terrible sequential access is reasonably fast (factor of 10) Multi-core processors are becoming commonplace How can we exploit this? S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 13 / 41
24 StreamSVM Our Architecture Reader Trainer HDD RAM RAM Data Working Set Weight Vec S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 14 / 41
25 StreamSVM Our Philosophy Iterate over the data in main memory while streaming data from disk. Evict primarily examples from main memory that are uninformative. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 15 / 41
26 StreamSVM Reader for k = 1,..., max_iter do for i = 1,..., n do if A = Ω then randomly select i A A = A \ {i } delete y i, Q ii, x i from RAM end if read y i, x i from Disk calculate Q ii = x i, x i store y i, Q ii, x i in RAM A = A {i} end for if stopping criterion is met then exit end if end for S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 16 / 41
27 StreamSVM Trainer α 1 = 0, w 1 = 0, ε = 9, ε new = 0, β = 0.9 while stopping criterion is not met do for t = 1,..., n do If A > 0.9 Ω then ε = βε randomly select i A and read y i, Q ii, x i from RAM compute i D := y i w t, x i 1 if (αi t = 0 and i D > ε) or (αi t = C and i D < ε) then A = A \ {i} and delete y i, Q ii, x i from RAM continue end if = median α t+1 i ( 0, C, α t i i D ε new = max(ε new, i D ) end for Update stopping criterion ε = ε new end while Q ii ), w t+1 = w t + (α t+1 αi t)y ix i S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 17 / 41 i
28 StreamSVM Proof of Convergence (Sketch) Definition (Luo and Tseng Problem) minimize α g(eα) + b α subject to L i α i U i α, b R n, E R d n, L i [, ), U i (, ] The above optimization problem is a Luo and Tseng problem if the following conditions hold: 1 E has no zero columns 2 the set of optimal solutions A is non-empty 3 the function g is strictly convex and twice cont. differentiable 4 for all optimal solutions α A, 2 g(eα ) is positive definite. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 18 / 41
29 StreamSVM Proof of Convergence (Sketch) Lemma (see Theorem 1 of Hsieh et al., ICML 2008) If no data x i = 0, then the SVM dual is a Luo and Tseng problem. Definition (Almost cyclic rule) There exists an integer B n such that every coordinate is iterated upon at least once every B successive iterations. Theorem (Theorem 2.1 of Luo and Tseng, JOTA 1992) Let {α t } be a sequence of iterates generated by a coordinate descent method using the almost cyclic rule. The α t converges linearly to an element of A. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 19 / 41
30 StreamSVM Proof of Convergence (Sketch) Lemma If the trainer accesses the cached training data sequentially, and the following conditions hold: The trainer is at most κ 1 times faster than the reading thread, that is, the trainer performs at most κ coordinate updates in the time that it takes the reader to read one training data from disk. A point is never evicted from the RAM unless the α i corresponding to that point has been updated. Then this confirms to the almost cyclic rule. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 20 / 41
31 Experiments - SVM Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 21 / 41
32 Experiments - SVM Experiments dataset n d s(%) n + :n Datasize ocr 3.5 M GB dna 50 M e GB webspam-t 0.35 M M GB kddb M M 1e GB S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 22 / 41
33 Experiments - SVM Does Active Eviction Work? Relative Function Value Difference Linear SVM webspam-t C = 1.0 Random Active Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 23 / 41
34 Experiments - SVM Comparison with Block Minimization Relative Function Value Difference ocr C = 1.0 StreamSVM SBM BM Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
35 Experiments - SVM Comparison with Block Minimization Relative Function Value Difference webspam-t C = 1.0 StreamSVM SBM BM Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
36 Experiments - SVM Comparison with Block Minimization Relative Function Value Difference kddb C = 1.0 StreamSVM SBM BM Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
37 Experiments - SVM Comparison with Block Minimization Relative Objective Function Value dna C = 1.0 StreamSVM SBM BM Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 24 / 41
38 Experiments - SVM Effect of Varying C Relative Function Value Difference dna C = 10.0 StreamSVM SBM BM Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
39 Experiments - SVM Effect of Varying C Relative Function Value Difference dna C = StreamSVM SBM BM Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
40 Experiments - SVM Effect of Varying C Relative Objective Function Value dna C = StreamSVM SBM BM Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 25 / 41
41 Experiments - SVM Varying Cache Size Relative Function Value Difference kddb C = MB 1 GB 4 GB 16 GB Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 26 / 41
42 Experiments - SVM Expanding Features Relative Function Value Difference dna expanded C = GB 32GB Wall Clock Time (sec) 10 5 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 27 / 41
43 Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 28 / 41
44 Logistic Regression Logistic Regression Optimization Problem 1 min w 2 w 2 + C m log (1 + exp ( y i w, x i )) i=1 Hinge Loss 3 Loss 2 Logistic Loss 1 Margin S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 29 / 41
45 Logistic Regression The Dual Objective min α D(α) := 1 2 s.t. α i α j y i y j xi, x j + α i log α i + (C α i ) log(c α i ) i,j 0 α i C i w = i α iy i x i Convex objective (but not quadratic) Box constraints S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 30 / 41
46 Logistic Regression Coordinate Descent One dimensional function ˆD(α t ) := α2 t 2 x t, x t + α t α i y i y t x i, x t i t s.t. + α t log α t + (C α t ) log(c α t ) 0 α t C Solve No closed form solution A modified Newton method does the job! See Yu, Huang, Lin 2011 for details. S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 31 / 41
47 Logistic Regression Determining Importance: SVM case Shrinking α t i = 0 and i D(α) > ε α t i = C and i D(α) < ε or S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 32 / 41
48 Logistic Regression Determining Importance: SVM case 1 Loss Gradient 0.5 Margin w loss(w, x i, y i ) = α i π i D(α) = 0 w, x i 1 ɛ (KKT condition) Computing ɛ ɛ = max i A π i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 33 / 41
49 Logistic Regression Determining Importance: Logistic Regression 1 Loss Gradient Margin w loss(w, x i, y i ) α i i D(α) 0 w, x i ɛ (KKT condition) Computing ɛ ɛ = max i A i D(α) S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 34 / 41
50 Experiments - Logistic Regression Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 35 / 41
51 Experiments - Logistic Regression Does Active Eviction Work? Relative Function Value Difference Logistic Regression webspam-t C = 1.0 Random Active Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 36 / 41
52 Experiments - Logistic Regression Comparison with Block Minimization Relative Function Value Difference webspam-t C = 1.0 StreamSVM BM Wall Clock Time (sec) 10 4 S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 37 / 41
53 Conclusion Outline 1 Linear Support Vector Machines 2 StreamSVM 3 Experiments - SVM 4 Logistic Regression 5 Experiments - Logistic Regression 6 Conclusion S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 38 / 41
54 Conclusion Things I did not talk about/work in Progress StreamSVM Amortizing the cost of training different models Other storage Heirarchies (e.g. Solid State Drives) Extensions to other problems Multiclass problems Structured Output Prediction Distributed version of the algorithm S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 39 / 41
55 Conclusion References Shin Matsushima, S. V. N. Vishwanathan, and Alexander J Smola. Linear Support Vector Machines via Dual Cached Loops. KDD Code for StreamSVM available from streamsvm.html S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 40 / 41
56 Conclusion Joint work with Reading Thread Read (Sequential Access) Training Thread Load (Random Access) Read (Random Access) Update Disk Dataset RAM Cached Data (Working Set) RAM Weight Vector S.V. N. Vishwanathan (Purdue, Microsoft) StreamSVM 41 / 41
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationIncremental and Decremental Training for Linear Classification
Incremental and Decremental Training for Linear Classification Authors: Cheng-Hao Tsai, Chieh-Yen Lin, and Chih-Jen Lin Department of Computer Science National Taiwan University Presenter: Ching-Pei Lee
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationCutting Plane Training of Structural SVM
Cutting Plane Training of Structural SVM Seth Neel University of Pennsylvania sethneel@wharton.upenn.edu September 28, 2017 Seth Neel (Penn) Short title September 28, 2017 1 / 33 Overview Structural SVMs
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationCOMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37
COMP 652: Machine Learning Lecture 12 COMP 652 Lecture 12 1 / 37 Today Perceptrons Definition Perceptron learning rule Convergence (Linear) support vector machines Margin & max margin classifier Formulation
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationSolving the SVM Optimization Problem
Solving the SVM Optimization Problem Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian
More informationMatrix Factorization and Factorization Machines for Recommender Systems
Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationTraining Support Vector Machines: Status and Challenges
ICML Workshop on Large Scale Learning Challenge July 9, 2008 Chih-Jen Lin (National Taiwan Univ.) 1 / 34 Training Support Vector Machines: Status and Challenges Chih-Jen Lin Department of Computer Science
More informationChange point method: an exact line search method for SVMs
Erasmus University Rotterdam Bachelor Thesis Econometrics & Operations Research Change point method: an exact line search method for SVMs Author: Yegor Troyan Student number: 386332 Supervisor: Dr. P.J.F.
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationClassification: Logistic Regression from Data
Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification:
More informationClassification: Logistic Regression from Data
Classification: Logistic Regression from Data Machine Learning: Jordan Boyd-Graber University of Colorado Boulder LECTURE 3 Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber Boulder Classification:
More informationCIS 520: Machine Learning Oct 09, Kernel Methods
CIS 520: Machine Learning Oct 09, 207 Kernel Methods Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture They may or may not cover all the material discussed
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationIntroduction to Machine Learning
Introduction to Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland LOGISTIC REGRESSION FROM TEXT Slides adapted from Emily Fox Machine Learning: Jordan Boyd-Graber UMD Introduction
More informationA Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 06, Taiwan S. Sathiya Keerthi
More informationImprovements to Platt s SMO Algorithm for SVM Classifier Design
LETTER Communicated by John Platt Improvements to Platt s SMO Algorithm for SVM Classifier Design S. S. Keerthi Department of Mechanical and Production Engineering, National University of Singapore, Singapore-119260
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLecture 9: Large Margin Classifiers. Linear Support Vector Machines
Lecture 9: Large Margin Classifiers. Linear Support Vector Machines Perceptrons Definition Perceptron learning rule Convergence Margin & max margin classifiers (Linear) support vector machines Formulation
More informationClustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26
Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar
More informationA Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification
JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,
More informationMachine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5
Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5 Slides adapted from Jordan Boyd-Graber, Tom Mitchell, Ziv Bar-Joseph Machine Learning: Chenhao Tan Boulder 1 of 27 Quiz question For
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationA Dual Coordinate Descent Method for Large-scale Linear SVM
Cho-Jui Hsieh b92085@csie.ntu.edu.tw Kai-Wei Chang b92084@csie.ntu.edu.tw Chih-Jen Lin cjlin@csie.ntu.edu.tw Department of Computer Science, National Taiwan University, Taipei 106, Taiwan S. Sathiya Keerthi
More informationCA-SVM: Communication-Avoiding Support Vector Machines on Distributed System
CA-SVM: Communication-Avoiding Support Vector Machines on Distributed System Yang You 1, James Demmel 1, Kent Czechowski 2, Le Song 2, Richard Vuduc 2 UC Berkeley 1, Georgia Tech 2 Yang You (Speaker) James
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationThis ensures that we walk downhill. For fixed λ not even this may be the case.
Gradient Descent Objective Function Some differentiable function f : R n R. Gradient Descent Start with some x 0, i = 0 and learning rate λ repeat x i+1 = x i λ f(x i ) until f(x i+1 ) ɛ Line Search Variant
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationSupport Vector Machines for Classification and Regression
CIS 520: Machine Learning Oct 04, 207 Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 3: Linear Models I (LFD 3.2, 3.3) Cho-Jui Hsieh UC Davis Jan 17, 2018 Linear Regression (LFD 3.2) Regression Classification: Customer record Yes/No Regression: predicting
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationSMO Algorithms for Support Vector Machines without Bias Term
Institute of Automatic Control Laboratory for Control Systems and Process Automation Prof. Dr.-Ing. Dr. h. c. Rolf Isermann SMO Algorithms for Support Vector Machines without Bias Term Michael Vogt, 18-Jul-2002
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationSupport Vector Machines for Classification and Regression. 1 Linearly Separable Data: Hard Margin SVMs
E0 270 Machine Learning Lecture 5 (Jan 22, 203) Support Vector Machines for Classification and Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationLast updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationKernel Logistic Regression and the Import Vector Machine
Kernel Logistic Regression and the Import Vector Machine Ji Zhu and Trevor Hastie Journal of Computational and Graphical Statistics, 2005 Presented by Mingtao Ding Duke University December 8, 2011 Mingtao
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCS-E4830 Kernel Methods in Machine Learning
CS-E4830 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 27. September, 2017 Juho Rousu 27. September, 2017 1 / 45 Convex optimization Convex optimisation This
More informationBehavioral Data Mining. Lecture 7 Linear and Logistic Regression
Behavioral Data Mining Lecture 7 Linear and Logistic Regression Outline Linear Regression Regularization Logistic Regression Stochastic Gradient Fast Stochastic Methods Performance tips Linear Regression
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationMachine Learning. Support Vector Machines. Manfred Huber
Machine Learning Support Vector Machines Manfred Huber 2015 1 Support Vector Machines Both logistic regression and linear discriminant analysis learn a linear discriminant function to separate the data
More informationKernel Machines. Pradeep Ravikumar Co-instructor: Manuela Veloso. Machine Learning
Kernel Machines Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 SVM linearly separable case n training points (x 1,, x n ) d features x j is a d-dimensional vector Primal problem:
More informationSequential Minimal Optimization (SMO)
Data Science and Machine Intelligence Lab National Chiao Tung University May, 07 The SMO algorithm was proposed by John C. Platt in 998 and became the fastest quadratic programming optimization algorithm,
More informationDeep Learning for Computer Vision
Deep Learning for Computer Vision Lecture 4: Curse of Dimensionality, High Dimensional Feature Spaces, Linear Classifiers, Linear Regression, Python, and Jupyter Notebooks Peter Belhumeur Computer Science
More informationLinear Discrimination Functions
Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach
More informationSMO vs PDCO for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines
vs for SVM: Sequential Minimal Optimization vs Primal-Dual interior method for Convex Objectives for Support Vector Machines Ding Ma Michael Saunders Working paper, January 5 Introduction In machine learning,
More informationNeural Networks and Deep Learning
Neural Networks and Deep Learning Professor Ameet Talwalkar November 12, 2015 Professor Ameet Talwalkar Neural Networks and Deep Learning November 12, 2015 1 / 16 Outline 1 Review of last lecture AdaBoost
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationSupport Vector Machines
Support Vector Machines Some material on these is slides borrowed from Andrew Moore's excellent machine learning tutorials located at: http://www.cs.cmu.edu/~awm/tutorials/ Where Should We Draw the Line????
More informationAn Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal
An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve
More informationCS145: INTRODUCTION TO DATA MINING
CS145: INTRODUCTION TO DATA MINING 5: Vector Data: Support Vector Machine Instructor: Yizhou Sun yzsun@cs.ucla.edu October 18, 2017 Homework 1 Announcements Due end of the day of this Thursday (11:59pm)
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationOutline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22
Outline Basic concepts: SVM and kernels SVM primal/dual problems Chih-Jen Lin (National Taiwan Univ.) 1 / 22 Outline Basic concepts: SVM and kernels Basic concepts: SVM and kernels SVM primal/dual problems
More informationSupport Vector Machines for Classification: A Statistical Portrait
Support Vector Machines for Classification: A Statistical Portrait Yoonkyung Lee Department of Statistics The Ohio State University May 27, 2011 The Spring Conference of Korean Statistical Society KAIST,
More informationA Magiv CV Theory for Large-Margin Classifiers
A Magiv CV Theory for Large-Margin Classifiers Hui Zou School of Statistics, University of Minnesota June 30, 2018 Joint work with Boxiang Wang Outline 1 Background 2 Magic CV formula 3 Magic support vector
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationLinear Models for Regression CS534
Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationICS-E4030 Kernel Methods in Machine Learning
ICS-E4030 Kernel Methods in Machine Learning Lecture 3: Convex optimization and duality Juho Rousu 28. September, 2016 Juho Rousu 28. September, 2016 1 / 38 Convex optimization Convex optimisation This
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Support Vector Machines Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique
More informationMotivation Subgradient Method Stochastic Subgradient Method. Convex Optimization. Lecture 15 - Gradient Descent in Machine Learning
Convex Optimization Lecture 15 - Gradient Descent in Machine Learning Instructor: Yuanzhang Xiao University of Hawaii at Manoa Fall 2017 1 / 21 Today s Lecture 1 Motivation 2 Subgradient Method 3 Stochastic
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationSupport Vector Machines
Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot, we get creative in two
More informationSerious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions
BACK-PROPAGATION NETWORKS Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks Cannot approximate (learn) non-linear functions Difficult (if not impossible) to design
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationBasis Expansion and Nonlinear SVM. Kai Yu
Basis Expansion and Nonlinear SVM Kai Yu Linear Classifiers f(x) =w > x + b z(x) = sign(f(x)) Help to learn more general cases, e.g., nonlinear models 8/7/12 2 Nonlinear Classifiers via Basis Expansion
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationPASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent
Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationAn Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM
An Improved Conjugate Gradient Scheme to the Solution of Least Squares SVM Wei Chu Chong Jin Ong chuwei@gatsby.ucl.ac.uk mpeongcj@nus.edu.sg S. Sathiya Keerthi mpessk@nus.edu.sg Control Division, Department
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationLarge-scale Collaborative Ranking in Near-Linear Time
Large-scale Collaborative Ranking in Near-Linear Time Liwei Wu Depts of Statistics and Computer Science UC Davis KDD 17, Halifax, Canada August 13-17, 2017 Joint work with Cho-Jui Hsieh and James Sharpnack
More informationStochastic Subgradient Method
Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)
More informationNeural Networks. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington
Neural Networks CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Perceptrons x 0 = 1 x 1 x 2 z = h w T x Output: z x D A perceptron
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationIntroduction to Support Vector Machines
Introduction to Support Vector Machines Andreas Maletti Technische Universität Dresden Fakultät Informatik June 15, 2006 1 The Problem 2 The Basics 3 The Proposed Solution Learning by Machines Learning
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationHomework 4. Convex Optimization /36-725
Homework 4 Convex Optimization 10-725/36-725 Due Friday November 4 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More information