Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
|
|
- Fay Lane
- 6 years ago
- Views:
Transcription
1 Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function, ERGO Tech. report , P. Richtárik and M. Takáč. Efficient serial and parallel coordinate descent methods for huge-scale truss topology design, ERGO Tech. report October 06, / 43
2 Outline 1 Randomized Coordinate Descent Method Theory GPU implementation Numerical examples 2 Facing the Multicore-Challenge II (a few insights from a conference) 3 Discussion StarSs Intel Cilk Plus SIMD and auto-vectorization 2 / 43
3 Randomized Coordinate Descent Method 3 / 43
4 Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints 4 / 43
5 Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints General Iterative Algorithm 1 choose initial point x := x 0 R n 2 iterate choose direction d := d(x) choose step length α(d, x) set x = x + αd 4 / 43
6 Problem Formulation where f is convex and smooth min F (x) def = f (x) + Ψ(x), (P) x R N Ψ is convex, simple and block-separable N is huge (N 10 6 ) Ψ(x) meaning 0 smooth minimization λ x 1 sparsity-inducing term indicator function of a set block-separable constraints General Iterative Algorithm 1 choose initial point x := x 0 R n 2 iterate choose direction d := d(x) choose step length α(d, x) set x = x + αd More sophisticated direction less iterations More sophisticated direction more work per iteration 4 / 43
7 Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T 5 / 43
8 Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T Methods for choosing direction cyclic (or essentially cyclic) coordinate with the maximal directional derivative (Greedy) RANDOM (WHY?) 5 / 43
9 Why Coordinate Descent? a very old optimization method (B. O Donnell 1967) can move only in directions {e 1, e 2,..., e N }, where e i = (0, 0, 0,..., 0, 1, 0,..., 0) T Methods for choosing direction cyclic (or essentially cyclic) coordinate with the maximal directional derivative (Greedy) RANDOM (WHY?) Applications Sparse regression, LASSO Truss topology design (Sparse) group LASSO Elastic Net... L 1 regularized linear support vector machines (SVM) with various (smooth) loss functions 5 / 43
10 Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( ) x ( ) x, x 0 = ( 0 0 ) T 6 / 43
11 Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( ) x ( ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) 6 / 43
12 Randomized Coordinate Descent: a 2D example f (x) = 1 2 x T ( ) x ( ) x, x 0 = ( 0 0 ) T What we need? Generate random coordinate (we need n dimensional coin) Luckily in our case n = 2 6 / 43
13 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
14 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
15 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
16 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
17 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
18 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
19 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
20 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
21 Randomized Coordinate Descent: a 2D example f (x) = 1 T 2x x x, x0 = 0 0 T 7 / 43
22 Block Structure of R N Any vector x R N can be written uniquely as n x = U i x (i), where x (i) = Ui T x R i R N i. i=1 8 / 43
23 Block Norms and Induced Norms Block norms (norms in R i ): We equip R i with a pair of conjugate Euclidean norms: t (i) = B i t, t 1/2, where B i R N i N i Induced norms (norms in R N ): t (i) = B 1 i t, t 1/2, t R i, is a positive definite matrix. For fixed positive scalars w 1,..., w n and x R N we define Conjugate norm is defined by y W = [ n 1/2 x W = w i x (i) (i)] 2. i=1 max y, x = x W 1 where W = Diag(w 1,..., w n ). [ n i=1 w 1 i ( y (i) (i) )2] 1/2, 9 / 43
24 Assumptions on f and Ψ The gradient of f is block coordinate-wise Lipschitz with positive constants L 1,..., L n : i f (x + U i t) i f (x) (i) L i t (i), x R N, t R i, i = 1,..., n, (1) where i f (x) def = ( f (x)) (i) = U T i f (x) R i. An important consequence of (1) is the following inequality f (x + U i t) f (x) + i f (x), t + L i 2 t 2 (i). (2) We denote L = Diag(L 1,..., L n ). Ψ is block-separable: n Ψ(x) = Ψ i (x (i) ). i=1 10 / 43
25 UCDC: Uniform Coordinate Descent for Composite Functions - Algorithm An upper bound on F (x + U i t) is readily available: F (x + U i t) = f (x + U i t) + Ψ(x + U i t) (2) Algorithm 1 UCDC(x 0 ) f (x) + i f (x), t + L i 2 t 2 (i) + Ψ i(x (i) + t) + Ψ j (x (j) ). }{{} j i def =V i (x,t) for k = 0, 1, 2,... Choose block coordinate i {1, 2,..., n} with probability 1 n T (i) (x k ) = arg min{v i (x k, t) : t R i R N i } x k+1 = x k + U i T (i) (x k ) end for In each step we minimize a function of N i variables We assume Ψ is simple: T (i) (x k ) can be computed in closed form If Ψ 0, then T (i) (x k ) = 1 L i B 1 i i f (x) 11 / 43
26 The Main Result Theorem 1 Choose initial point x 0 and confidence level ρ (0, 1). Further, let the target accuracy ɛ > 0 and iteration counter k be chosen in any of the following two ways: (i) ɛ < F (x 0 ) F and k 2n max{r2 L (x0),f (x0) F } ɛ (ii) ɛ < min{r 2 L (x 0), F (x 0 ) F } and ( ) 1 + log 1 ρ + 2 2n max{r2 L (x0),f (x0) F } F (x 0) F, k 2nR2 L (x0) ɛ log F (x0) F ɛρ. If x k is the random point generated by UCDC(x 0 ) as applied to F, then P(F (x k ) F ɛ) 1 ρ. R 2 W (x 0) = max x {max x X x x 2 W : F (x) F (x 0)} 12 / 43
27 Performance on Strongly Convex Functions Theorem 2 Let F be strongly convex with respect to L with convexity parameter µ > 0 and choose accuracy level ɛ > 0, confidence level 0 < ρ < 1, and ( ) k F (x0) F, n 1 γ µ log ρɛ where γ µ is given by { γ µ = 1 µ 4, if µ 2, 1 µ, otherwise. (3) If x k is the random point generated by UCDC(x 0 ), then P(F (x k ) F ɛ) 1 ρ. 13 / 43
28 Comparison with other coordinate descent methods for composite minimization Lip Ψ block coordinate choice E(F (x k ) < ɛ) 1 iter [YT09] L( f ) separable Yes greedy O( nl( f ) x x ) ɛ O(n 2 ) 1 [SST09] L i = β 1 No O( nβ x x ) n ɛ cheap [ST10] L( f ) 1 No cyclic O( nl( f ) x x ) ɛ cheap 1 Alg 1 L i separable Yes O( n x x 0 2 L ) n ɛ cheap [YT09 ]S. Yun and P. Tseng. A block-coordinate gradient descent method for linearly constrained nonsmooth separable optimization. Journal of Optimization Theory and Applications, 140: , [SST09 ]S. Shalev-Shwartz and A. Tewari. Stochastic methods for l 1 regularized loss minimization. In: ICML [ST09 ]A. Saha and A. Tewari. On the finite time convergence of cyclic coordinate descent methods. arxiv: , / 43
29 Comparison with Nesterov s randomized coordinate descent... for achieving P(F (x k ) F ɛ) 1 ρ in the general case (F not strongly convex): Alg Ψ p i norms complexity objective [N10] 0 [N10] 0 1 n Euclid. Alg1 0 > 0 general Alg2 0 8nR 2 L (x 0) log 4(f (x 0) f ) ɛ ɛρ L i Euclid. (2n + 8SR2 I (x 0) ) log 4(f (x 0) f ) S ɛ ɛρ 1 n Euclid. 2R 2 LP 1 (x 0) f (x) + ɛ x x 0 2 L 8R 2 L (x 0) f (x) + ɛ x x 0 2 I 8R 2 I (x 0) (1 + log 1 ɛ ρ ) 2 f 2nM ɛ (1 + log 1 ρ ) F 2nR 2 L (x 0) ɛ log F (x 0) F ɛρ Symbols used in the table: S = L i, M = max{r 2 L (x 0), F (x 0 ) F }. [N10] Yu. Nesterov, Efficiency of coordinate descent methods on huge-scale optimization problems. CORE Discussion Paper # 2010/2, / 43
30 Application: Sparse Regression Problem formulation & Lipschitz constants 1 min x R N 2 Ax b γ x 1 L i = a T i a i (easy to compute) Note: Lipschitz constant of f = λ max (A T A) (hard to compute) Algorithm 1 UCDC(x 0 ) for k = 0, 1, 2,... Choose block coor. i Compute T (i) (x k ) x k+1 = x k + U i T (i) (x k ) end for UCDC for Sparse Regression Choose x 0 = 0 and set g 0 = Ax 0 b = b for k = 0, 1, 2,... do Choose i k = i {1, 2,..., N} uniformly i f (x k ) = ai T g k, i f (x k )+γ L i, if x (i) k T (i) (x k ) = i f (x k ) γ L i, if x (i) k x (i) k, otherwise, x k+1 = x k + e i T (i) (x k ), g k+1 = g k + a i T (i) (x k ). end for i f (x k )+γ L i > 0, i f (x k ) γ L i < 0, 16 / 43
31 Instance 1 run on CPU: A R , A 0 = 10 8 k/n F (x k ) F x k 0 time [s] < < , < , < , < , < , < , < , < , < , < , < , < , < , < , < , < , < , / 43
32 Instance 2 run on CPU: A R , A 0 = k/n F (x k ) F x k 0 time [s] 0.01 < , < ,837, < ,567, < ,630, < ,305, < ,242, < ,157, < ,926, < , < , < , < , < , < , < , < , < , < , < , < , < , / 43
33 Acceleration on a Graphical Processing Unit (GPU) CPU GPU 1 core 448 cores 3.00 GHz 1.15 GHz 3 GFlops 1 TFlop GPU (Graphical Processing Unit): performs a single operation in parallel on multiple data (vector addition). 19 / 43
34 GPU Architecture Tesla C Multiprocessors Each Multiprocessor has 32 cores (each 1.15GHz) Each core in one multiprocessor has to do the same operation on different data = SIMD (Single Instruction Multiple Data) 20 / 43
35 Huge and Sparse Data = Small Chance of Collision 21 / 43
36 Race Conditions and Atomic Operation 22 / 43
37 23 / 43
38 CPU (C) vs GPU (CUDA): step-by-step comparison n = / 43
39 Acceleration by GPU 25 / 43
40 Naive Implementation Implementation each thread choose its coordinate independently, matrix A is naturally stored in CSC sparse format. 26 / 43
41 Naive Implementation Data access is NOT coalesced (is NOT sequential at all) is misaligned (decrease of performance ) 27 / 43
42 Misaligned copy 28 / 43
43 Better implementation Memory access improvement all threads in ONE WRAP generate the same random number which points to begin of memory bank (from where data will be read) each thread then compute its coordinate i (one wrap is working on ALIGNED and SEQUENTIAL coordinates) 29 / 43
44 Pipelining C++ code i n t o f f s e t =.... ; f o r ( i n t i = o f f s e t ; i < N; i ++) { A[ i ] = A[ i + o f f s e t ] s ; } Left: Offset is a constant. Right: Offset is a variable, therefore compiler cannot optimize as well as previously. 30 / 43
45 Facing the Multicore-Challenge II (a few insights from a conference) 31 / 43
46 SIMD and Auto-Vectorization CPU 2.2 GHz, 4 flops/cycle, 12 cores = GFlops (AMD Magny-Cours) GPU GFlops (Tesla C2050/C2070 GPU Computing Processor)...the memory subsystem (or at least the cache) must be able to sustain sufficient bandwidth to keep all units busy / 43
47 Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 33 / 43
48 Intel Compiler 12+ -guide=n analyzes code and gives some recommendations -vec-report n report of loop vectorization, shows which loops where vectorized and which not #pragma ivdep vectorization, do not care on data dependences -ax auto-vectorization C++ code i n t C [N ] ;.... #pragma i v d e p f o r ( i n t j = 0 ; j < C [ 7 ] ; j ++) { //C[7]=2034 A[ j ]=2 B[ j ] ; } 34 / 43
49 StarSs - Introduction 35 / 43
50 StarSs - Renaming 36 / 43
51 StarSs - GPU 37 / 43
52 StarSs - GPU 38 / 43
53 Intel Cilk Plus Write parallel programs using a simple model: With only three keywords to learn, C and C++ developers move quickly into the parallel programming domain. Utilize data parallelism by simple array notations that include elemental function capabilities. Leverage existing serial tools: The serial semantics of Intel Cilk Plus allows you to debug in a familiar serial debugger. Scale for the future: The runtime system operates smoothly on systems with hundreds of cores. 39 / 43
54 Cilk - Arrays { } d e c l s p e c ( v e c t o r ) f l o a t s a x p y e l e m e n t a l ( f l o a t a, f l o a t x, f l o a t y ) return ( a x + y ) ; // Here i s how to i n v o k e such v e c t o r i z e d e l e m e n t a l // f u n c t i o n s u s i n g a r r a y n o t a t i o n : z [ : ] = s a x p y e l e m e n t a l ( a, b [ : ], c [ : ] ) ; a [ : ] = s i n ( b [ : ] ) ; // v e c t o r math l i b r a r y f u n c t i o n a [ : ] = pow ( b [ : ], c ) ; // b [ : ] c v i a a v e c t o r // math l i b r a r y f u n c t i o n 40 / 43
55 Cilk - Fibonacci - Create new thread 01 c i l k i n t f i b ( i n t n ) 02 { 03 i f ( n < 2) return n ; 04 e l s e 05 { 06 i n t x, y ; x = spawn f i b ( n 1); 09 y = spawn f i b ( n 2); sync ; return ( x+y ) ; 14 } 15 } 41 / 43
56 Any Questions? Thank you for your attention 42 / 43
57 References Georg Hager and Gerhard Wellein: Introduction to High Performance Computing for Scientists and Engineers Jesus Labarta, Rosa Badia, Eduard Ayguade, Marta Garcia, Vladimir Marjanovic: Hybrid Parallel Programming with MPI+SMPSs 43 / 43
Parallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationOn Nesterov s Random Coordinate Descent Algorithms
On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm
More informationLecture 3: Huge-scale optimization problems
Liege University: Francqui Chair 2011-2012 Lecture 3: Huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) March 9, 2012 Yu. Nesterov () Huge-scale optimization problems 1/32March 9, 2012 1
More informationRandom coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara
Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao June 8, 2014 Abstract In this paper we propose a randomized nonmonotone block
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationSubgradient methods for huge-scale optimization problems
Subgradient methods for huge-scale optimization problems Yurii Nesterov, CORE/INMA (UCL) May 24, 2012 (Edinburgh, Scotland) Yu. Nesterov Subgradient methods for huge-scale problems 1/24 Outline 1 Problems
More informationA random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints
Comput. Optim. Appl. manuscript No. (will be inserted by the editor) A random coordinate descent algorithm for optimization problems with composite objective function and linear coupled constraints Ion
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationRandomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints
Randomized Coordinate Descent Methods on Optimization Problems with Linearly Coupled Constraints By I. Necoara, Y. Nesterov, and F. Glineur Lijun Xu Optimization Group Meeting November 27, 2012 Outline
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationEfficient Learning on Large Data Sets
Department of Computer Science and Engineering Hong Kong University of Science and Technology Hong Kong Joint work with Mu Li, Chonghai Hu, Weike Pan, Bao-liang Lu Chinese Workshop on Machine Learning
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 27, 2015 Outline Linear regression Ridge regression and Lasso Time complexity (closed form solution) Iterative Solvers Regression Input: training
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationAn Asynchronous Parallel Stochastic Coordinate Descent Algorithm
Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationSparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations
Sparse Principal Component Analysis via Alternating Maximization and Efficient Parallel Implementations Martin Takáč The University of Edinburgh Joint work with Peter Richtárik (Edinburgh University) Selin
More informationClassification Logistic Regression
Announcements: Classification Logistic Regression Machine Learning CSE546 Sham Kakade University of Washington HW due on Friday. Today: Review: sub-gradients,lasso Logistic Regression October 3, 26 Sham
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationAsynchronous Parallel Computing in Signal Processing and Machine Learning
Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,
More informationA Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming
A Randomized Nonmonotone Block Proximal Gradient Method for a Class of Structured Nonlinear Programming Zhaosong Lu Lin Xiao March 9, 2015 (Revised: May 13, 2016; December 30, 2016) Abstract We propose
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationCoordinate gradient descent methods. Ion Necoara
Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate
More informationAdaptive Restarting for First Order Optimization Methods
Adaptive Restarting for First Order Optimization Methods Nesterov method for smooth convex optimization adpative restarting schemes step-size insensitivity extension to non-smooth optimization continuation
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationminimize x subject to (x 2)(x 4) u,
Math 6366/6367: Optimization and Variational Methods Sample Preliminary Exam Questions 1. Suppose that f : [, L] R is a C 2 -function with f () on (, L) and that you have explicit formulae for
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationParallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics)
Parallel programming practices for the solution of Sparse Linear Systems (motivated by computational physics and graphics) Eftychios Sifakis CS758 Guest Lecture - 19 Sept 2012 Introduction Linear systems
More informationCoordinate Descent. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization / Slides adapted from Tibshirani
Coordinate Descent Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh Convex Optimization 10-725/36-725 Slides adapted from Tibshirani Coordinate descent We ve seen some pretty sophisticated methods
More informationBlock AIR Methods. For Multicore and GPU. Per Christian Hansen Hans Henrik B. Sørensen. Technical University of Denmark
Block AIR Methods For Multicore and GPU Per Christian Hansen Hans Henrik B. Sørensen Technical University of Denmark Model Problem and Notation Parallel-beam 3D tomography exact solution exact data noise
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationarxiv: v2 [cs.lg] 10 Oct 2018
Journal of Machine Learning Research 9 (208) -49 Submitted 0/6; Published 7/8 CoCoA: A General Framework for Communication-Efficient Distributed Optimization arxiv:6.0289v2 [cs.lg] 0 Oct 208 Virginia Smith
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationJournal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19
Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationStatistical Machine Learning II Spring 2017, Learning Theory, Lecture 4
Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.
More informationIteration Complexity of Feasible Descent Methods for Convex Optimization
Journal of Machine Learning Research 15 (2014) 1523-1548 Submitted 5/13; Revised 2/14; Published 4/14 Iteration Complexity of Feasible Descent Methods for Convex Optimization Po-Wei Wang Chih-Jen Lin Department
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationarxiv: v1 [hep-lat] 7 Oct 2010
arxiv:.486v [hep-lat] 7 Oct 2 Nuno Cardoso CFTP, Instituto Superior Técnico E-mail: nunocardoso@cftp.ist.utl.pt Pedro Bicudo CFTP, Instituto Superior Técnico E-mail: bicudo@ist.utl.pt We discuss the CUDA
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationCoordinate Update Algorithm Short Course The Package TMAC
Coordinate Update Algorithm Short Course The Package TMAC Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 16 TMAC: A Toolbox of Async-Parallel, Coordinate, Splitting, and Stochastic Methods C++11 multi-threading
More informationA Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures
A Massively Parallel Eigenvalue Solver for Small Matrices on Multicore and Manycore Architectures Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences,
More information10-725/ Optimization Midterm Exam
10-725/36-725 Optimization Midterm Exam November 6, 2012 NAME: ANDREW ID: Instructions: This exam is 1hr 20mins long Except for a single two-sided sheet of notes, no other material or discussion is permitted
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationWelcome to MCS 572. content and organization expectations of the course. definition and classification
Welcome to MCS 572 1 About the Course content and organization expectations of the course 2 Supercomputing definition and classification 3 Measuring Performance speedup and efficiency Amdahl s Law Gustafson
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationParallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption. Langshi CHEN 1,2,3 Supervised by Serge PETITON 2
1 / 23 Parallel Asynchronous Hybrid Krylov Methods for Minimization of Energy Consumption Langshi CHEN 1,2,3 Supervised by Serge PETITON 2 Maison de la Simulation Lille 1 University CNRS March 18, 2013
More informationFaster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)
Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine
More informationJ. Sadeghi E. Patelli M. de Angelis
J. Sadeghi E. Patelli Institute for Risk and, Department of Engineering, University of Liverpool, United Kingdom 8th International Workshop on Reliable Computing, Computing with Confidence University of
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationl 1 and l 2 Regularization
David Rosenberg New York University February 5, 2015 David Rosenberg (New York University) DS-GA 1003 February 5, 2015 1 / 32 Tikhonov and Ivanov Regularization Hypothesis Spaces We ve spoken vaguely about
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationMachine Learning in the Data Revolution Era
Machine Learning in the Data Revolution Era Shai Shalev-Shwartz School of Computer Science and Engineering The Hebrew University of Jerusalem Machine Learning Seminar Series, Google & University of Waterloo,
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationMore Science per Joule: Bottleneck Computing
More Science per Joule: Bottleneck Computing Georg Hager Erlangen Regional Computing Center (RRZE) University of Erlangen-Nuremberg Germany PPAM 2013 September 9, 2013 Warsaw, Poland Motivation (1): Scalability
More informationCSC 576: Gradient Descent Algorithms
CSC 576: Gradient Descent Algorithms Ji Liu Department of Computer Sciences, University of Rochester December 22, 205 Introduction The gradient descent algorithm is one of the most popular optimization
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationGPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications
GPU Acceleration of Cutoff Pair Potentials for Molecular Modeling Applications Christopher Rodrigues, David J. Hardy, John E. Stone, Klaus Schulten, Wen-Mei W. Hwu University of Illinois at Urbana-Champaign
More informationGradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Gradient descent Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Gradient descent First consider unconstrained minimization of f : R n R, convex and differentiable. We want to solve
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationJulian Merten. GPU Computing and Alternative Architecture
Future Directions of Cosmological Simulations / Edinburgh 1 / 16 Julian Merten GPU Computing and Alternative Architecture Institut für Theoretische Astrophysik Zentrum für Astronomie Universität Heidelberg
More information