Lock-Free Approaches to Parallelizing Stochastic Gradient Descent
|
|
- Myron Dennis
- 5 years ago
- Views:
Transcription
1 Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright
2 minimize x f(x) = j= f j (x) Incremental Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) stepsize Robbins and Monro (950) Least Mean Squares + Recursive Estimators (960s-990s) Back Propagation in eural etworks (980s) IPS (2007ish)
3 Example: Computing the mean minimize 4 k= (x k) 2 x 0 =0 x = x 0 (x 0 ) = x 2 = x (x 2)/2 =.5 x 3 = x 2 (x 2 3)/3 =2 x 4 = x 3 (x 3 4)/4 =2.5 Stepsize = /2k In general, if we minimize SGD returns: (x z k ) 2 k= x = k= z k
4 minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j Choose directions in order 7 8 Choose a direction uniformly with replacement
5 Convergence Rates minimize x f(x) = j= f j (x) Choose a direction uniformly with replacement ASSUME: ci 2 f LI f j (x) 2 M D 0 = x 0 x opt 2 emirovski et al (2009 or 983): x k x opt k max M 2 c 2 γ k = Θ 2ck with Θ > Θ 2 4Θ 4,D 0 Constant Stepsizes γ k = ϑ 2c with ϑ<, x k x opt log(2/β) 4( β) M 2 c 2 log(2/β) reduce by β after ϑβ k k ϑ log 4D 0 c 2 ϑm 2 iterations
6 minimize x R d f(x) = j= f j (x) Algorithm Time per iteration Error after T iterations Error after items ewton O(d 2 +d 3 ) C I 2 T C I 2 Gradient O(d) C G T C G SGD O(d) (or constant) C S T C S
7 SGD and BIG Data minimize x f(x) = f j (x) j= Stochastic Gradient Descent: For each k, choose jk and compute x k+ = x k α k f jk (x k ) Ideal for big data analysis: stepsize robustness against noise in data rapid learning rates small, predictable memory footprint
8 Is SGD inherently Serial? How to parallelize SGD? Master/Worker (Bertsekas and Tsitsiklis 985) Round Robin (Langford et al, 2009) Average Runs (Zinkevich et al, 200) Average Gradients (Duchi et al, Dekel et al 200) All require massive overhead due to lock contention and synchronization Don t lock! Don t communicate! What happens when we run parallel instances of SGD without locks?
9 Sparse Function: f(x) = e E f e (x e ) Hypergraph: G = (V,E) V - coordinates on R D E - v is in e E if fe depends on xv Graph statistics: Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E Maximum Edge Size Maximum Degree Maximum Edge Degree
10 Sparse Support Vector Machines sparse vectors minimize x minimize x α E α E max( y α x T z α, 0) + λx 2 2 max( y α x T z α, 0) + λ u e α x 2 u d u edge degrees Ω = max α z α 0, = max u d u/d, ρ (0, ]
11 Matrix Completion M = L R * k x n k x r Entries Specified on set E minimize Idea: approximate minimize (L,R) r x n (u,v) E (X uv M uv ) 2 + µx X LR T (u,v) E (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Ω =2r = O(log(n)/n) ρ = O(log(n)/n)
12 Graph Cuts Image Segmentation Entity Resolution Topic Modeling minimize x (u,v) E w uvx u x v subject to T K x v =,x v 0, for v =,...,D Ω =2K = d max / E ρ =2d max / E
13 HOGWILD! Run SGD in parallel without locks. Each processor independently runs:. Sample e from E 2. Read current state of xe 3. for v in e do x v x v α[ f e (x e )] v Only assume atomicity of x v x v a Issues: Updates can be very old Processors can overwrite each others work
14 Convergence Theory Ω := max e E e := max v n {e E : v e} E ρ := max e E {ê E :ê e = } E ASSUME: ci 2 f LI f e (x) 2 M D 0 = x 0 x opt 2 ASSUME: Longest delay between an update and a memory read is τ Choose: k 2M 2 +6τρ+6τ 2 Ω /2 log(ld 0 /) c 2 Then after k gradient updates with appropriate choice of constant stepsize, we have E[x k x opt 2 ]
15 Hogs gone wild! data set size (GB) ρ Δ time (s) speedup SVM RCV E-0.0E etflix.5 2.5E E MC KDD E-03.8E JUMBO E-07.4E-07 9, CUTS DBLife E E Abdomen 8 9.2E E-04,8 4. Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling
16 Speedups Speedup Hogwild AIG RR Speedup Hogwild AIG RR Speedup Hogwild AIG RR umber of Splits umber of Splits umber of Splits SVM RCV MC etflix CUTS Abdomen Experiments run on 2 core machine 0 cores for gradients, 2 cores for data shuffling
17 JELLYFISH SGD for Matrix Factorizations. Example: Idea: approximate minimize (L,R) (Lu R T v M uv ) 2 + µ u L u 2 F + µ v R v 2 F Step : Pick (i,j) and compute residual: Step 2: Take a gradient step: Lu minimize (u,v) E R v (u,v) E (X uv M uv ) 2 + µx X LR T e =(L u R T v M uv ) ( γµu )L u γer v ( γµ v )R v γel u
18 Mixture of hundreds of models, including nuclear norm nuclear norm (a.k.a. SVD)
19 JELLYFISH Observation: With replacement sample=poor locality Idea: Bias sample to improve locality. Algorithm: Shuffle the data.. Process {L R, L 2 R 2, L 3 R 3 } in parallel 2. Process {L R 2, L 2 R 3, L 3 R } in parallel 3. Process {L R 3, L 2 R, L 3 R 2 } in parallel Big wins: o locks. o overwrites. Strong Locality.
20 Example Optimization Shuffle all the rows and columns Apply block partitioning Train on each block independently Repeat...
21 jfish 00x faster than standard solvers 25% speedup over HOGWILD! on 2 core machine 3x speedup on 40 core machine etflix data-set 00M examples 7770 rows columns
22 Theory: treats without replacement sampling like deterministic sampling. Practice: without replacement sampling is always faster. Open question: Why?
23 Simplest Problem? Least Squares minimize x k= (a T k x b k ) 2 Assume overdetermined: a T k x opt = b k k fixed order: x = x opt + (I γa k a T k )(x 0 x opt ) k= with replacement: E[x ]=x opt + I γ a k a T k (x 0 x opt ) k= Which is better?
24 Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? i= A i True for =2 (Bhatia and Kittaneh 990).?
25 What about generically? A,...,A 0 A i Given, DxD, is it true that i= i= A i If sampled from the normal distribution? E i= D Q A i = g i g i g i (0, D I D) D Tr(Q ) = +O( A i = D I E ) k=0? A i = Q i= (D/) k k + k 2 F [ ( ), ; 2; D/] 2 D 2 + D k
26 Simple Question A,...,A 0 A i Given, DxD, is it true that i= If Ai are scalars, this is always true by the arithmetic-geometric mean inequality. Is it true for matrices? True for =2 (Bhatia and Kittaneh 990). True for generic Ai i= A i? Straightforward bound: i= A i D i= A i
27 log i= A i log i= A i log Tr(A i ) i= = i= log log Tr(A i ) = log Tr i= Tr(A i ) i= A i log D A i i=
28 i= A,...,A 0 A i Given, DxD, is it true that i= o! Counterexample: + cos 2πk A k = sin 2πk A i =2 cos π Saturates the bound i= A i i= A i sin 2πk cos 2πk 2 D i= A i A i = I i= Similar counterexamples for all,d>2
29 minimize 9 k=0 cos πk 0 x +sin πk 0 x2 2 =5x 2 +5x 2 2 Stepsize = /2 x 2 f j(x) = 2 cj s j x s j +c j Choose directions in order 7 8 Choose a direction uniformly with replacement
30 ! But what about on average? A,...,A 0 A σ(i)! Given, DxD, is it true that σ S i= Using the counterexample: + cos 2πk A k = sin 2πk σ S i= Yet to find a counterexample for averaging! Is there a noncommutative arithmetic-geometric mean inequality? i= A i sin 2πk cos 2πk A σ(i) = 2 F 3 /2+/2 /2 /2 + A i = I = i=? ; = O( )
31 Summary Practical Lessons Don t lock! Locality Matters. Theory: Bias in sampling for better data access understanding is only beginning here: oncommutative arithmetic-geometric mean in full generality Biased orderings of SGD Automatic identification of locality
32 Acknowledgements See: for all references Reports: HOGWILD!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent. iu, Recht, Re, and Wright. 20. code: Parallel Stochastic Gradient Algorithms for Large-Scale Matrix Completion. Recht and Re. 20. code:
Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationarxiv: v2 [math.oc] 11 Nov 2011
Hogwild!: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent arxiv:1106.5730v [math.oc] 11 Nov 011 Feng Niu, Benjamin Recht, Christopher Ré and Stephen J. Wright Computer Sciences Department,
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationScalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems
Scalable Coordinate Descent Approaches to Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit Dhillon Department of Computer Science University of Texas
More informationA DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson
204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,
More informationCollaborative Filtering Matrix Completion Alternating Least Squares
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationParallel Matrix Factorization for Recommender Systems
Under consideration for publication in Knowledge and Information Systems Parallel Matrix Factorization for Recommender Systems Hsiang-Fu Yu, Cho-Jui Hsieh, Si Si, and Inderjit S. Dhillon Department of
More informationOptimization in Machine Learning
Optimization in Machine Learning Stephen Wright University of Wisconsin-Madison Singapore, 14 Dec 2012 Stephen Wright (UW-Madison) Optimization Singapore, 14 Dec 2012 1 / 48 1 Learning from Data: Applications
More informationParallel Numerical Algorithms
Parallel Numerical Algorithms Chapter 6 Structured and Low Rank Matrices Section 6.3 Numerical Optimization Michael T. Heath and Edgar Solomonik Department of Computer Science University of Illinois at
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationarxiv: v2 [cs.lg] 4 Oct 2016
Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University
More informationECE521 lecture 4: 19 January Optimization, MLE, regularization
ECE521 lecture 4: 19 January 2017 Optimization, MLE, regularization First four lectures Lectures 1 and 2: Intro to ML Probability review Types of loss functions and algorithms Lecture 3: KNN Convexity
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationCollaborative Filtering
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationMachine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression
Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression Due: Monday, February 13, 2017, at 10pm (Submit via Gradescope) Instructions: Your answers to the questions below,
More informationRecommendation Systems
Recommendation Systems Popularity Recommendation Systems Predicting user responses to options Offering news articles based on users interests Offering suggestions on what the user might like to buy/consume
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 1 Feature selection task 2 1 Why might you want to perform feature selection? Efficiency: - If
More informationAlgorithmic Stability and Generalization Christoph Lampert
Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts
More informationRandomized Smoothing for Stochastic Optimization
Randomized Smoothing for Stochastic Optimization John Duchi Peter Bartlett Martin Wainwright University of California, Berkeley NIPS Big Learn Workshop, December 2011 Duchi (UC Berkeley) Smoothing and
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationLow Rank Matrix Completion Formulation and Algorithm
1 2 Low Rank Matrix Completion and Algorithm Jian Zhang Department of Computer Science, ETH Zurich zhangjianthu@gmail.com March 25, 2014 Movie Rating 1 2 Critic A 5 5 Critic B 6 5 Jian 9 8 Kind Guy B 9
More informationCME323 Distributed Algorithms and Optimization. GloVe on Spark. Alex Adamson SUNet ID: aadamson. June 6, 2016
GloVe on Spark Alex Adamson SUNet ID: aadamson June 6, 2016 Introduction Pennington et al. proposes a novel word representation algorithm called GloVe (Global Vectors for Word Representation) that synthesizes
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationRegression. Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning)
Linear Regression Regression Goal: Learn a mapping from observations (features) to continuous labels given a training set (supervised learning) Example: Height, Gender, Weight Shoe Size Audio features
More informationCME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.
CME342 Parallel Methods in Numerical Analysis Matrix Computation: Iterative Methods II Outline: CG & its parallelization. Sparse Matrix-vector Multiplication. 1 Basic iterative methods: Ax = b r = b Ax
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationIterative Methods for Solving A x = b
Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http
More informationSome Relevant Topics in Optimization
Some Relevant Topics in Optimization Stephen Wright University of Wisconsin-Madison IPAM, July 2012 Stephen Wright (UW-Madison) Optimization IPAM, July 2012 1 / 118 1 Introduction/Overview 2 Gradient Methods
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationOn Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed
On Gradient-Based Optimization: Accelerated, Stochastic, Asynchronous, Distributed Michael I. Jordan University of California, Berkeley February 9, 2017 Modern Optimization and Large-Scale Data Analysis
More informationAsynchronous Parallel Computing in Signal Processing and Machine Learning
Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,
More informationDistributed Machine Learning: A Brief Overview. Dan Alistarh IST Austria
Distributed Machine Learning: A Brief Overview Dan Alistarh IST Austria Background The Machine Learning Cambrian Explosion Key Factors: 1. Large s: Millions of labelled images, thousands of hours of speech
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationWhy should you care about the solution strategies?
Optimization Why should you care about the solution strategies? Understanding the optimization approaches behind the algorithms makes you more effectively choose which algorithm to run Understanding the
More informationModern Optimization Techniques
Modern Optimization Techniques 2. Unconstrained Optimization / 2.2. Stochastic Gradient Descent Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University
More informationAn Asynchronous Parallel Stochastic Coordinate Descent Algorithm
Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,
More informationLasso Regression: Regularization for feature selection
Lasso Regression: Regularization for feature selection Emily Fox University of Washington January 18, 2017 Feature selection task 1 Why might you want to perform feature selection? Efficiency: - If size(w)
More informationA Brief Overview of Practical Optimization Algorithms in the Context of Relaxation
A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:
More information7.3 The Jacobi and Gauss-Siedel Iterative Techniques. Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP.
7.3 The Jacobi and Gauss-Siedel Iterative Techniques Problem: To solve Ax = b for A R n n. Methodology: Iteratively approximate solution x. No GEPP. 7.3 The Jacobi and Gauss-Siedel Iterative Techniques
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationLecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron
CS446: Machine Learning, Fall 2017 Lecture 14 : Online Learning, Stochastic Gradient Descent, Perceptron Lecturer: Sanmi Koyejo Scribe: Ke Wang, Oct. 24th, 2017 Agenda Recap: SVM and Hinge loss, Representer
More informationSolving linear equations with Gaussian Elimination (I)
Term Projects Solving linear equations with Gaussian Elimination The QR Algorithm for Symmetric Eigenvalue Problem The QR Algorithm for The SVD Quasi-Newton Methods Solving linear equations with Gaussian
More informationErgodic Subgradient Descent
Ergodic Subgradient Descent John Duchi, Alekh Agarwal, Mikael Johansson, Michael Jordan University of California, Berkeley and Royal Institute of Technology (KTH), Sweden Allerton Conference, September
More informationComputational and Statistical Learning Theory
Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic
More informationMulticlass Logistic Regression
Multiclass Logistic Regression Sargur. Srihari University at Buffalo, State University of ew York USA Machine Learning Srihari Topics in Linear Classification using Probabilistic Discriminative Models
More informationA Distributed Solver for Kernelized SVM
and A Distributed Solver for Kernelized Stanford ICME haoming@stanford.edu bzhe@stanford.edu June 3, 2015 Overview and 1 and 2 3 4 5 Support Vector Machines and A widely used supervised learning model,
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationAsaga: Asynchronous Parallel Saga
Rémi Leblond Fabian Pedregosa Simon Lacoste-Julien INRIA - Sierra Project-team INRIA - Sierra Project-team Department of CS & OR (DIRO) École normale supérieure, Paris École normale supérieure, Paris Université
More informationScalable Asynchronous Gradient Descent Optimization for Out-of-Core Models
Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models Chengjie Qin 1, Martin Torres 2, and Florin Rusu 2 1 GraphSQL, Inc. 2 University of California Merced August 31, 2017 Machine
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 11, 2016 Paper presentations and final project proposal Send me the names of your group member (2 or 3 students) before October 15 (this Friday)
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationOptimization and Gradient Descent
Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function
More informationLarge-Scale Matrix Factorization with Distributed Stochastic Gradient Descent
Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent KDD 2011 Rainer Gemulla, Peter J. Haas, Erik Nijkamp and Yannis Sismanis Presenter: Jiawen Yao Dept. CSE, UT Arlington 1 1
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationNumerical tensor methods and their applications
Numerical tensor methods and their applications 8 May 2013 All lectures 4 lectures, 2 May, 08:00-10:00: Introduction: ideas, matrix results, history. 7 May, 08:00-10:00: Novel tensor formats (TT, HT, QTT).
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationSubgradient methods for huge-scale optimization problems
Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationTopics we covered. Machine Learning. Statistics. Optimization. Systems! Basics of probability Tail bounds Density Estimation Exponential Families
Midterm Review Topics we covered Machine Learning Optimization Basics of optimization Convexity Unconstrained: GD, SGD Constrained: Lagrange, KKT Duality Linear Methods Perceptrons Support Vector Machines
More informationChapter 2. Matrix Arithmetic. Chapter 2
Matrix Arithmetic Matrix Addition and Subtraction Addition and subtraction act element-wise on matrices. In order for the addition/subtraction (A B) to be possible, the two matrices A and B must have the
More informationMatrix Factorization and Factorization Machines for Recommender Systems
Talk at SDM workshop on Machine Learning Methods on Recommender Systems, May 2, 215 Chih-Jen Lin (National Taiwan Univ.) 1 / 54 Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen
More informationKasetsart University Workshop. Multigrid methods: An introduction
Kasetsart University Workshop Multigrid methods: An introduction Dr. Anand Pardhanani Mathematics Department Earlham College Richmond, Indiana USA pardhan@earlham.edu A copy of these slides is available
More informationCSC321 Lecture 2: Linear Regression
CSC32 Lecture 2: Linear Regression Roger Grosse Roger Grosse CSC32 Lecture 2: Linear Regression / 26 Overview First learning algorithm of the course: linear regression Task: predict scalar-valued targets,
More informationMachine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari
Machine Learning Basics: Stochastic Gradient Descent Sargur N. srihari@cedar.buffalo.edu 1 Topics 1. Learning Algorithms 2. Capacity, Overfitting and Underfitting 3. Hyperparameters and Validation Sets
More informationDay 3 Lecture 3. Optimizing deep networks
Day 3 Lecture 3 Optimizing deep networks Convex optimization A function is convex if for all α [0,1]: f(x) Tangent line Examples Quadratics 2-norms Properties Local minimum is global minimum x Gradient
More informationPerturbed Iterate Analysis for Asynchronous Stochastic Optimization
Perturbed Iterate Analysis for Asynchronous Stochastic Optimization arxiv:1507.06970v2 [stat.ml] 25 Mar 2016 Horia Mania α,ǫ, Xinghao Pan α,ǫ, Dimitris Papailiopoulos α,ǫ Benjamin Recht α,ǫ,σ, Kannan Ramchandran
More informationLinear Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com
Linear Regression These slides were assembled by Eric Eaton, with grateful acknowledgement of the many others who made their course materials freely available online. Feel free to reuse or adapt these
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationA Quick Tour of Linear Algebra and Optimization for Machine Learning
A Quick Tour of Linear Algebra and Optimization for Machine Learning Masoud Farivar January 8, 2015 1 / 28 Outline of Part I: Review of Basic Linear Algebra Matrices and Vectors Matrix Multiplication Operators
More informationLECTURE NOTE #NEW 6 PROF. ALAN YUILLE
LECTURE NOTE #NEW 6 PROF. ALAN YUILLE 1. Introduction to Regression Now consider learning the conditional distribution p(y x). This is often easier than learning the likelihood function p(x y) and the
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Convex Optimization Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Submit them at end of class, pick them up end of next class. I
More informationOnline solution of the average cost Kullback-Leibler optimization problem
Online solution of the average cost Kullback-Leibler optimization problem Joris Bierkens Radboud University Nijmegen j.bierkens@science.ru.nl Bert Kappen Radboud University Nijmegen b.kappen@science.ru.nl
More informationDistributed Stochastic ADMM for Matrix Factorization
Distributed Stochastic ADMM for Matrix Factorization Zhi-Qin Yu Shanghai Key Laboratory of Scalable Computing and Systems Department of Computer Science and Engineering Shanghai Jiao Tong University, China
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationPreconditioning via Diagonal Scaling
Preconditioning via Diagonal Scaling Reza Takapoui Hamid Javadi June 4, 2014 1 Introduction Interior point methods solve small to medium sized problems to high accuracy in a reasonable amount of time.
More informationRestricted Boltzmann Machines for Collaborative Filtering
Restricted Boltzmann Machines for Collaborative Filtering Authors: Ruslan Salakhutdinov Andriy Mnih Geoffrey Hinton Benjamin Schwehn Presentation by: Ioan Stanculescu 1 Overview The Netflix prize problem
More informationLecture 4: Linear Algebra 1
Lecture 4: Linear Algebra 1 Sourendu Gupta TIFR Graduate School Computational Physics 1 February 12, 2010 c : Sourendu Gupta (TIFR) Lecture 4: Linear Algebra 1 CP 1 1 / 26 Outline 1 Linear problems Motivation
More informationMachine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso
Machine Learning Data Mining CS/CS/EE 155 Lecture 4: Regularization, Sparsity Lasso 1 Recap: Complete Pipeline S = {(x i, y i )} Training Data f (x, b) = T x b Model Class(es) L(a, b) = (a b) 2 Loss Function,b
More informationQ-Learning for Markov Decision Processes*
McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of
More informationarxiv: v3 [cs.lg] 15 Sep 2018
Asynchronous Stochastic Proximal Methods for onconvex onsmooth Optimization Rui Zhu 1, Di iu 1, Zongpeng Li 1 Department of Electrical and Computer Engineering, University of Alberta School of Computer
More informationECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis
ECE 8201: Low-dimensional Signal Models for High-dimensional Data Analysis Lecture 7: Matrix completion Yuejie Chi The Ohio State University Page 1 Reference Guaranteed Minimum-Rank Solutions of Linear
More informationNeural networks. Chapter 20. Chapter 20 1
Neural networks Chapter 20 Chapter 20 1 Outline Brains Neural networks Perceptrons Multilayer networks Applications of neural networks Chapter 20 2 Brains 10 11 neurons of > 20 types, 10 14 synapses, 1ms
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationSGD and Randomized projection algorithms for overdetermined linear systems
SGD and Randomized projection algorithms for overdetermined linear systems Deanna Needell Claremont McKenna College IPAM, Feb. 25, 2014 Includes joint work with Eldar, Ward, Tropp, Srebro-Ward Setup Setup
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationAn Asynchronous Mini-Batch Algorithm for. Regularized Stochastic Optimization
An Asynchronous Mini-Batch Algorithm for Regularized Stochastic Optimization Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson arxiv:505.0484v [math.oc] 8 May 05 Abstract Mini-batch optimization
More informationKernelized Perceptron Support Vector Machines
Kernelized Perceptron Support Vector Machines Emily Fox University of Washington February 13, 2017 What is the perceptron optimizing? 1 The perceptron algorithm [Rosenblatt 58, 62] Classification setting:
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More information