Parallel Coordinate Optimization

Size: px
Start display at page:

Download "Parallel Coordinate Optimization"

Transcription

1 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018

2 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D

3 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k 2 / 38

4 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+2 x k 2 / 38

5 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

6 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

7 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38

8 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x x k+3 x k+2 x k 2 / 38

9 3 / 38 Coordinate Descent Update a single coordinate at each iteration, i x k i α k f i (x k ) Easy to implement, low memory requirements, cheap iteration costs. Suitable for large-scale optimization (dimension n is large): Certain smooth (unconstrained) problems. Non-smooth problems with separable constraints/regularizers. e.g., l 1-regularization, bound constraints Faster than gradient descent if iterations n times cheaper. Adaptable to distributed settings. For truly huge-scale problems, it is absolutely necessary to parallelize.

10 4 / 38 Problem Consider the optimization problem where min F (x) := f(x) + g(x), x IR n f is loss function convex (smooth or nonsmooth) g is regularizer convex (smooth or nonsmooth), separable

11 5 / 38 Regularizer Examples g(x) = n g i (x i ), i=1 x = (x 1, x 2,..., x n ) T No regularizer: g i (x i ) 0 Weighted L1-norm: g i (x i ) = λ i x i (λ i > 0) e.g., LASSO Weighted L2-norm: g i (x i ) = λ i (x i ) 2 (λ i > 0) 0, x i X i, Box constraints: g i (x i ) = +, otherwise. e.g., SVM dual

12 6 / 38 Loss Examples Name f(x) References 1 Quadratic loss 2 Ax y 2 2 = 1 m 2 j=1 (A j:x y j ) 2 Bradley et al., 2011, Logistic loss m j=1 log(1 + exp( y ja j: x)) Richtárik & Takáč, 2011b, 2013a, Square hinge loss 1 2 (max{0, 1 y ja j: x}) 2 ) Takáč et al., 2013 L-infinity Ax y = max 1 j m A j: x y j L1-regression Ax y 1 = m j=1 A j:x y j Fercoq & Richtárik, 2013 ( Exponential loss log 1 ) m m j=1 exp(y ja j: x)

13 7 / 38 Parallel Coordinate Descent Embarrassingly parallel if objective is separable. Speedup equal to number of processors, τ. For partially-separable objectives: Assign i th processor task of updating i th component of x. Each processor communicates respective x + i to processors that require it. The i th processor needs current value of x j only if i f or 2 ii f depends on x j. Parallel implementations suitable when dependency graph is sparse.

14 Dependency Graph Given a fixed serial ordering of updates, those in red can be done in parallel. (i, j) is an arc of the dependency graph iff update function h j depends on x i Update order: {1, 2, 3, 4} 1 = h 1(x k 1, x k 3) 2 = h 2( 1, x k 2) 3 = h 3( 2, x k 3, x k 4) 4 = h 4( 2, x k 4) Better update order: {1, 3, 4, 2} 1 = h 1(x k 1, x k 3) 3 = h 3(x k 2, x k 3, x k 4) 4 = h 4(x k 2, x k 4) 2 = h 2( 1, x k 2) 8 / 38

15 Parallel Coordinate Descent Synchronous parallelism: Divide iterate updates between processors, followed by synchronization step. Very slow for large-scale problems (wait for slowest processor). Asynchronous parallelism: Each processor has access to x, chooses index i, loads components of x that are needed to compute the gradient component i f(x), then updates the ith component x i. No attempt to coordinate or synchronize with other processors. Always using stale x: convergence results restrict how stale. Many numerical results actually use asynchronous implementation, ignore synchronization step required by theory. 9 / 38

16 Totally Asynchronous Algorithm Definition: An algorithm is totally asynchronous if 1 each index i {1, 2,..., n} of x is updated at infinitely many iterations, and 2 if νj k denotes the iteration at which component j of the vector ˆxk was last updated, then νj k as k for all j = 1, 2,..., n. No condition on how stale ˆx j is, just requires that it will be updated eventually. Theorem (Bertsekas and Tsitsiklis, 1989) Suppose a mapping T (x) := x α f(x) for some α > 0 satisfies T (x) x η x x, for some η (0, 1). Then if we set α k α in Algorithm 7, the sequence {x k } converges to x. 10 / 38

17 11 / 38 Partly Asynchronous Algorithm No convergence rate for totally asynchronous, given weak assumptions on ˆx k. l contraction assumption on mapping T is quite strong. Liu et al. (2015) assume no component of ˆx k older than nonnegative integer τ (maximum delay) at any k. τ related to number of processors τ (indicator of potential parallelism) If all processors complete updates at approx same rate, τ cτ for some positive integer c. Linear convergence if essential strong convexity holds. Sublinear convergence for general convex functions. Near-linear speedup if number of processors is: O(n 1/2 ) in unconstrained optimization. O(n 1/4 ) in the separable-constrained case.

18 12 / 38 Question Under what structural assumptions does parallelization lead to acceleration?

19 13 / 38 Convergence of Randomized Coordinate Descent In IR n, randomized coordinate descent with uniform selection requires: O(n ξ(ɛ)) Strong convex F : ξ(ɛ) = log ( ) 1 ɛ iterations Smooth, or simple nonsmooth F : ξ(ɛ) = 1 ɛ Difficult nonsmooth F : ξ(ɛ) = 1 ɛ 2 When dealing with big data, we only care about n.

20 14 / 38 The Parallelization Dream Serial Parallel (1 coordinate per iteration) (τ coordinates per iteration) O(n ξ(ɛ)) iterations O ( n τ ξ(ɛ)) iterations What do we actually get? Want β = O(1). ( ) nβ O τ ξ(ɛ) Depends on extent to which we can add up individual updates. Properties of F, select of coordinates at each iteration.

21 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. f(1, 1) = 1 f(x 1, x 2 ) = 0 f(0, 0) = 1 15 / 38

22 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1

23 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1

24 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates.

25 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 x k+2 1

26 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 NO GOOD! x k+2 x k+2 1

27 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? x k 16 / 38

28 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1

29 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1

30 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? SOLVED! x

31 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38

32 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38

33 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 1 x k 1 17 / 38

34 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38

35 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... AND SO ON... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38

36 18 / 38 Averaging may be too conservative... Consider the function f(x) = (x 1 1) 2 + (x 2 1) (x n 1) 2. Evaluate at x 0 = 0, f(x 0 ) = n. We want ( f(x k ) = n 1 n) 1 2k ɛ. With averaging, we get k n ( n ) 2 log Factor of n is bad! ɛ ( ) We wanted O nβ τ ξ(ɛ)

37 19 / 38 What to do? We can write the coordinate descent update as follows, x + x + 1 β n h i e i, i=1 where h i is the update to coordinate i e i is the ith unit coordinate vector Averaging: β = n Summation: β = 1 When can we safely use β 1?

38 Three models for f with small β: When can we use small β? 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = f J (x), f J depends on x i for i J only J J ω := max J J J 2 Nonsmooth max-type f [Fercoq & Richtárik, 2013] f(x) = max z Q {z T Ax g(z)} ω := max 1 j m {i : A ji 0} 3 f with bounded Hessian [Bradley et al., 2011, Richtárik & Takáč, 2013a] f(x + h) f(x) + f(x) T h + 1 L = diag(a T A) 2 ht A T Ah σ := λ max (L 1/2 A T AL 1/2 ) ω is the degree of partial separability, σ is spectral radius. 20 / 38

39 21 / 38 Parallel Coordinate Descent Method At iteration k, select a random set S k. S k is a realization of a random set-valued mapping (or sampling) Ŝ. Update h i depends on F, x and on law describing Ŝ. Continuously interpolates between serial coordinate descent and gradient. Manipulates n and E[ Ŝ ].

40 22 / 38 ESO: Expected Separable Overapproximation We say that f admits a (β, ω)-eso with respect to (uniform) sampling Ŝ if for all x, h IR n (f, Ŝ) ESO(β, w) E [ f(x + h [ Ŝ] ) ] f(x)+ E[ Ŝ ] n where h [ Ŝ] = i Ŝ h ie i, and h 2 w := n i=1 w i(h i ) 2 We note that f(x) T h + β 2 h 2 w is separable in h. Minimize with respect to h in parallel yields update x + x i f(x)e i β w i Compute updates for i Ŝ only. Separable quadratic overapproximation of E[f] evaluated at update. i Ŝ ( f(x) T h + β ) 2 h 2 w

41 23 / 38 Convergence Rate for Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) (2R βn 2 k w (x 0, x ) ( ) F (x 0 ) F ) E[ Ŝ ] log, ɛ ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. ɛ: error tolerance n: # coordinates E[ Ŝ ]: average # updated coordinates per iteration β: step size parameter

42 24 / 38 Convergence Rate for Strongly Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) ( ) ( n β + µg (w) F (x 0 ) F ) k E[ Ŝ ] log, µ f (w) + µ g (w) ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. µ f (w): strong convexity constant of loss f µ g (w): strong convexity constant of regularizer g If µ g (w) is large, then the slowdown effect of β is eliminated.

43 25 / 38 What if problem is only partially separable? Uniform sampling: P (Ŝ = {i}) = 1 n 1 τ-nice sampling: P (Ŝ = S) = ( n τ), S = τ for shared memory systems 0, otherwise At each iteration: Choose set of i, each subset of τ coordinates chosen with the same probability. Assign each i to a dedicated processor. Compute and apply the update. All blocks are the same size τ (otherwise, probability is 0).

44 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

45 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

46 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

47 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

48 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

49 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5

50 27 / 38 What if problem is only partially separable? Binomial sampling: consider independent equally unreliable processors. Each of τ processors available with probability p b, busy with probability 1 p b. # available processors (number of blocks that can be updated in parallel) at each iteration is a binomial random variable with parameters τ and p b. Use explicit or implicit selection.

51 ESO Theory 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = J J f J (x), f J depends on x i for i J only ω := max J J J Theorem: If Ŝ is doubly uniform, then ] E [f(x + h [ Ŝ] ) f(x) + E[ Ŝ ] n where β = 1 + ( f(x) T h + β ) 2 h 2 w, ( ) (ω 1) E[ Ŝ 2 ] 1 E[ Ŝ ], w i = L i. i = 1, 2,..., n n 1 β is small if ω is small (i.e., more separable) 28 / 38

52 29 / 38 ESO Theory (Richtárik & Takáč, 2013) Parallel Coordinate Descent Methods for Big Data Optimization.

53 30 / 38 Theoretical Speedup s(r) = τ 1 + r(τ 1), r = (ω 1)/(n 1) ω often a constant that depends on n. r is a measure of density. MUCH OF BIG DATA IS HERE!

54 Theory vs Practice τ = # processors vs. theoretical (left) and experimental (right) speed-up for n = 1000 coordinates 31 / 38

55 32 / 38 Experiment 1 billion-by-2 billion LASSO problem [Richtárik & Takáč, 2012] f(x) = 1 2 Ax b 2 2, g(x) = x 1 A has rows and n = 10 9 columns. x 0 = 10 5 A :,i 0 = 20 (column) max j A j,: 0 = 35 (row) ω = 35 (degree of partial separability of f). Used approximation of τ-nice sampling Ŝ (independent sampling, τ << n). Asynchronous implementation. Older information used to update coordinates, but observed slow down is limited.

56 Experiment: Coordinate Updates For each τ, serial and parallel CD need approximately same number of coordinate updates. Method identifies active set. 33 / 38

57 34 / 38 Experiment: Iterations Doubling τ roughly translates to halving the number of iterations.

58 35 / 38 Experiment: Wall Time Doubling τ roughly translates to halving the wall time.

59 36 / 38 Citation Algorithm Paper (Bradley et al, 2011) Shotgun Parallel coordinate descent for L1-regularized loss minimization. ICML, 2011 (arxiv: ) (Richtárik & Takáč, 2011) SCD Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings, 27-32, 2012 (Opt Online 08/2011) (Richtárik & Takáč, 2012) PCDM Parallel coordinate descent methods for big data optimization. Mathematical Programming, 2015 (arxiv: ) (Fercoq & Richtárik, 2013) SPCDM Smooth minimization of nonsmooth functions with parallel coordinate descent method (arxiv: ) (Richtárik & Takáč, 2013a) HYDRA Distributed coordinate descent method for learning with big data (arxiv: ) (Liu et al., 2013) AsySCD An asynchronous parallel stochastic coordinate descent algorithm. ICML 2014 (arxiv: ) (Fercoq & Richtárik, 2013) APPROX Accelerated, parallel and proximal coordinate descent (arxiv: ) (Yang, 2013) DisDCA Trading computation for communication: distributed stochastic dual coordinate ascent. NIPS 2013 (Bian et al, 2013) PCDN Parallel coordinate descent Newton method for efficient l 1 -regularized minimization (arxiv: ) (Liu & Wright, 2014) AsySPCD Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), , 2015 (arxiv: ) (Mahajan et al, 2014) DBCD A distributed block coordinate descent method for training l 1 -regularized linear classifiers. arxiv: , 2014

60 37 / 38 Citation Algorithm Paper (Fercoq et al, 2014) Hydra2 Fast distributed coordinate descent for non-strongly convex losses. MLSP 2014 (arxiv: ) (Mareček, Richtárik and Takáč, 2014) DBCD Distributed block coordinate descent for minimizing partially separable functions. Numerical Analysis and Opt., Springer Proc. in Math. and Stat. (arxiv: ) (Jaggi, Smith, Takáč et al, 2014) CoCoA Communication-efficient distributed dual coordinate ascent. NIPS 2014 (arxiv: ) (Qu, Richtárik & Zhang, 2014) QUARTZ Randomized dual coordinate ascent with arbitrary sampling. arxiv: , 2014 (Ma, Smith, Jaggi et al, 2015) CoCoA+ Adding vs. averaging in distributed primal-dual optimization. ICML 2015 (Tappenden, Takáč & Richtárik, 2015) PCDM On the complexity of parallel coordinate descent. arxiv: , 2015 (Hsieh, Yu & Dhillon, 2015) PASSCoDe PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent. ICML, 2015 (Peng et al, 2016) ARock ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput., 2016 (You et al, 2016) Asy-GCD Asynchronous parallel greedy coordinate descent. NIPS,

61 38 / 38 Conclusions Coordinate descent scales very well to big data problems of special structure. Requires (partial) separability/sparse dependency graph. Care is needed when combining updates (add them up? average?) Sampling strategies that take into account unreliable processors.

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

Importance Sampling for Minibatches

Importance Sampling for Minibatches Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,

More information

ARock: an algorithmic framework for asynchronous parallel coordinate updates

ARock: an algorithmic framework for asynchronous parallel coordinate updates ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM

Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization

Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization

Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu Department of Computer Science, University of Rochester {lianxiangru,huangyj0,raingomm,ji.liu.uwisc}@gmail.com

More information

Asynchronous Parallel Computing in Signal Processing and Machine Learning

Asynchronous Parallel Computing in Signal Processing and Machine Learning Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,

More information

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM

Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Divide-and-combine Strategies in Statistical Modeling for Massive Data

Divide-and-combine Strategies in Statistical Modeling for Massive Data Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Coordinate gradient descent methods. Ion Necoara

Coordinate gradient descent methods. Ion Necoara Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming

Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone

More information

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,

More information

On Nesterov s Random Coordinate Descent Algorithms

On Nesterov s Random Coordinate Descent Algorithms On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

F (x) := f(x) + g(x), (1.1)

F (x) := f(x) + g(x), (1.1) ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT: PARALLELISM AND CONVERGENCE PROPERTIES JI LIU AND STEPHEN J. WRIGHT Abstract. We describe an asynchronous parallel stochastic proximal coordinate descent algorithm

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44

Convex Optimization. Newton s method. ENSAE: Optimisation 1/44 Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Accelerated Block-Coordinate Relaxation for Regularized Optimization

Accelerated Block-Coordinate Relaxation for Regularized Optimization Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Adaptive Primal Dual Optimization for Image Processing and Learning

Adaptive Primal Dual Optimization for Image Processing and Learning Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Lecture 25: November 27

Lecture 25: November 27 10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization

Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

The Randomized Newton Method for Convex Optimization

The Randomized Newton Method for Convex Optimization The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:

More information

Newton s Method. Javier Peña Convex Optimization /36-725

Newton s Method. Javier Peña Convex Optimization /36-725 Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

Lecture 23: November 21

Lecture 23: November 21 10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:

More information

Lecture 17: October 27

Lecture 17: October 27 0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones

Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization,

More information

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent

PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic

More information

Adding vs. Averaging in Distributed Primal-Dual Optimization

Adding vs. Averaging in Distributed Primal-Dual Optimization Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,

More information

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1

SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

LASSO Review, Fused LASSO, Parallel LASSO Solvers

LASSO Review, Fused LASSO, Parallel LASSO Solvers Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable

More information

Accelerated primal-dual methods for linearly constrained convex problems

Accelerated primal-dual methods for linearly constrained convex problems Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize

More information

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)

More information

Max Margin-Classifier

Max Margin-Classifier Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization

More information

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725

Gradient Descent. Ryan Tibshirani Convex Optimization /36-725 Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like

More information

Chapter 2. Optimization. Gradients, convexity, and ALS

Chapter 2. Optimization. Gradients, convexity, and ALS Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction

More information

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)

Network Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China) Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu

More information

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J

r=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J 7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers

More information

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification

A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Regression with Numerical Optimization. Logistic

Regression with Numerical Optimization. Logistic CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

ADMM and Fast Gradient Methods for Distributed Optimization

ADMM and Fast Gradient Methods for Distributed Optimization ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work

More information

An interior-point gradient method for large-scale totally nonnegative least squares problems

An interior-point gradient method for large-scale totally nonnegative least squares problems An interior-point gradient method for large-scale totally nonnegative least squares problems Michael Merritt and Yin Zhang Technical Report TR04-08 Department of Computational and Applied Mathematics Rice

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Composite nonlinear models at scale

Composite nonlinear models at scale Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara

Random coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.

More information

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Newton s Method. Ryan Tibshirani Convex Optimization /36-725 Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)

Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training

More information

Optimization Methods for Machine Learning

Optimization Methods for Machine Learning Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction

More information

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent

Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information