Parallel Coordinate Optimization
|
|
- Candice Delphia McDonald
- 5 years ago
- Views:
Transcription
1 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018
2 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D
3 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k 2 / 38
4 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+2 x k 2 / 38
5 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38
6 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38
7 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x k+3 x k+2 x k 2 / 38
8 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D x x k+3 x k+2 x k 2 / 38
9 3 / 38 Coordinate Descent Update a single coordinate at each iteration, i x k i α k f i (x k ) Easy to implement, low memory requirements, cheap iteration costs. Suitable for large-scale optimization (dimension n is large): Certain smooth (unconstrained) problems. Non-smooth problems with separable constraints/regularizers. e.g., l 1-regularization, bound constraints Faster than gradient descent if iterations n times cheaper. Adaptable to distributed settings. For truly huge-scale problems, it is absolutely necessary to parallelize.
10 4 / 38 Problem Consider the optimization problem where min F (x) := f(x) + g(x), x IR n f is loss function convex (smooth or nonsmooth) g is regularizer convex (smooth or nonsmooth), separable
11 5 / 38 Regularizer Examples g(x) = n g i (x i ), i=1 x = (x 1, x 2,..., x n ) T No regularizer: g i (x i ) 0 Weighted L1-norm: g i (x i ) = λ i x i (λ i > 0) e.g., LASSO Weighted L2-norm: g i (x i ) = λ i (x i ) 2 (λ i > 0) 0, x i X i, Box constraints: g i (x i ) = +, otherwise. e.g., SVM dual
12 6 / 38 Loss Examples Name f(x) References 1 Quadratic loss 2 Ax y 2 2 = 1 m 2 j=1 (A j:x y j ) 2 Bradley et al., 2011, Logistic loss m j=1 log(1 + exp( y ja j: x)) Richtárik & Takáč, 2011b, 2013a, Square hinge loss 1 2 (max{0, 1 y ja j: x}) 2 ) Takáč et al., 2013 L-infinity Ax y = max 1 j m A j: x y j L1-regression Ax y 1 = m j=1 A j:x y j Fercoq & Richtárik, 2013 ( Exponential loss log 1 ) m m j=1 exp(y ja j: x)
13 7 / 38 Parallel Coordinate Descent Embarrassingly parallel if objective is separable. Speedup equal to number of processors, τ. For partially-separable objectives: Assign i th processor task of updating i th component of x. Each processor communicates respective x + i to processors that require it. The i th processor needs current value of x j only if i f or 2 ii f depends on x j. Parallel implementations suitable when dependency graph is sparse.
14 Dependency Graph Given a fixed serial ordering of updates, those in red can be done in parallel. (i, j) is an arc of the dependency graph iff update function h j depends on x i Update order: {1, 2, 3, 4} 1 = h 1(x k 1, x k 3) 2 = h 2( 1, x k 2) 3 = h 3( 2, x k 3, x k 4) 4 = h 4( 2, x k 4) Better update order: {1, 3, 4, 2} 1 = h 1(x k 1, x k 3) 3 = h 3(x k 2, x k 3, x k 4) 4 = h 4(x k 2, x k 4) 2 = h 2( 1, x k 2) 8 / 38
15 Parallel Coordinate Descent Synchronous parallelism: Divide iterate updates between processors, followed by synchronization step. Very slow for large-scale problems (wait for slowest processor). Asynchronous parallelism: Each processor has access to x, chooses index i, loads components of x that are needed to compute the gradient component i f(x), then updates the ith component x i. No attempt to coordinate or synchronize with other processors. Always using stale x: convergence results restrict how stale. Many numerical results actually use asynchronous implementation, ignore synchronization step required by theory. 9 / 38
16 Totally Asynchronous Algorithm Definition: An algorithm is totally asynchronous if 1 each index i {1, 2,..., n} of x is updated at infinitely many iterations, and 2 if νj k denotes the iteration at which component j of the vector ˆxk was last updated, then νj k as k for all j = 1, 2,..., n. No condition on how stale ˆx j is, just requires that it will be updated eventually. Theorem (Bertsekas and Tsitsiklis, 1989) Suppose a mapping T (x) := x α f(x) for some α > 0 satisfies T (x) x η x x, for some η (0, 1). Then if we set α k α in Algorithm 7, the sequence {x k } converges to x. 10 / 38
17 11 / 38 Partly Asynchronous Algorithm No convergence rate for totally asynchronous, given weak assumptions on ˆx k. l contraction assumption on mapping T is quite strong. Liu et al. (2015) assume no component of ˆx k older than nonnegative integer τ (maximum delay) at any k. τ related to number of processors τ (indicator of potential parallelism) If all processors complete updates at approx same rate, τ cτ for some positive integer c. Linear convergence if essential strong convexity holds. Sublinear convergence for general convex functions. Near-linear speedup if number of processors is: O(n 1/2 ) in unconstrained optimization. O(n 1/4 ) in the separable-constrained case.
18 12 / 38 Question Under what structural assumptions does parallelization lead to acceleration?
19 13 / 38 Convergence of Randomized Coordinate Descent In IR n, randomized coordinate descent with uniform selection requires: O(n ξ(ɛ)) Strong convex F : ξ(ɛ) = log ( ) 1 ɛ iterations Smooth, or simple nonsmooth F : ξ(ɛ) = 1 ɛ Difficult nonsmooth F : ξ(ɛ) = 1 ɛ 2 When dealing with big data, we only care about n.
20 14 / 38 The Parallelization Dream Serial Parallel (1 coordinate per iteration) (τ coordinates per iteration) O(n ξ(ɛ)) iterations O ( n τ ξ(ɛ)) iterations What do we actually get? Want β = O(1). ( ) nβ O τ ξ(ɛ) Depends on extent to which we can add up individual updates. Properties of F, select of coordinates at each iteration.
21 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. f(1, 1) = 1 f(x 1, x 2 ) = 0 f(0, 0) = 1 15 / 38
22 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1
23 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. 2 x k 1
24 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates.
25 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 x k+2 1
26 15 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 Just compute for more/all coordinates and then add up the updates. x k+2 2 NO GOOD! x k+2 x k+2 1
27 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? x k 16 / 38
28 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1
29 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? 2 x k 1
30 16 / 38 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 + x 2 1) 2 What about averaging the updates?? SOLVED! x
31 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38
32 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k 1 17 / 38
33 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 1 x k 1 17 / 38
34 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38
35 Naive Parallelization Consider the function f(x 1, x 2 ) = (x 1 1) 2 + (x 2 1) 2 What about averaging the updates?? COULD BE TOO CONSERVATIVE... AND SO ON... 2 x k+2 2 x k+2 x k+2 1 x k 1 17 / 38
36 18 / 38 Averaging may be too conservative... Consider the function f(x) = (x 1 1) 2 + (x 2 1) (x n 1) 2. Evaluate at x 0 = 0, f(x 0 ) = n. We want ( f(x k ) = n 1 n) 1 2k ɛ. With averaging, we get k n ( n ) 2 log Factor of n is bad! ɛ ( ) We wanted O nβ τ ξ(ɛ)
37 19 / 38 What to do? We can write the coordinate descent update as follows, x + x + 1 β n h i e i, i=1 where h i is the update to coordinate i e i is the ith unit coordinate vector Averaging: β = n Summation: β = 1 When can we safely use β 1?
38 Three models for f with small β: When can we use small β? 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = f J (x), f J depends on x i for i J only J J ω := max J J J 2 Nonsmooth max-type f [Fercoq & Richtárik, 2013] f(x) = max z Q {z T Ax g(z)} ω := max 1 j m {i : A ji 0} 3 f with bounded Hessian [Bradley et al., 2011, Richtárik & Takáč, 2013a] f(x + h) f(x) + f(x) T h + 1 L = diag(a T A) 2 ht A T Ah σ := λ max (L 1/2 A T AL 1/2 ) ω is the degree of partial separability, σ is spectral radius. 20 / 38
39 21 / 38 Parallel Coordinate Descent Method At iteration k, select a random set S k. S k is a realization of a random set-valued mapping (or sampling) Ŝ. Update h i depends on F, x and on law describing Ŝ. Continuously interpolates between serial coordinate descent and gradient. Manipulates n and E[ Ŝ ].
40 22 / 38 ESO: Expected Separable Overapproximation We say that f admits a (β, ω)-eso with respect to (uniform) sampling Ŝ if for all x, h IR n (f, Ŝ) ESO(β, w) E [ f(x + h [ Ŝ] ) ] f(x)+ E[ Ŝ ] n where h [ Ŝ] = i Ŝ h ie i, and h 2 w := n i=1 w i(h i ) 2 We note that f(x) T h + β 2 h 2 w is separable in h. Minimize with respect to h in parallel yields update x + x i f(x)e i β w i Compute updates for i Ŝ only. Separable quadratic overapproximation of E[f] evaluated at update. i Ŝ ( f(x) T h + β ) 2 h 2 w
41 23 / 38 Convergence Rate for Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) (2R βn 2 k w (x 0, x ) ( ) F (x 0 ) F ) E[ Ŝ ] log, ɛ ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. ɛ: error tolerance n: # coordinates E[ Ŝ ]: average # updated coordinates per iteration β: step size parameter
42 24 / 38 Convergence Rate for Strongly Convex f If (f, Ŝ) ESO(β, w), then [Richtárik & Takáč, 2011b] ( ) ( ) ( n β + µg (w) F (x 0 ) F ) k E[ Ŝ ] log, µ f (w) + µ g (w) ɛρ which implies that P (F (x k ) F ɛ) 1 ρ. µ f (w): strong convexity constant of loss f µ g (w): strong convexity constant of regularizer g If µ g (w) is large, then the slowdown effect of β is eliminated.
43 25 / 38 What if problem is only partially separable? Uniform sampling: P (Ŝ = {i}) = 1 n 1 τ-nice sampling: P (Ŝ = S) = ( n τ), S = τ for shared memory systems 0, otherwise At each iteration: Choose set of i, each subset of τ coordinates chosen with the same probability. Assign each i to a dedicated processor. Compute and apply the update. All blocks are the same size τ (otherwise, probability is 0).
44 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
45 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
46 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
47 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
48 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
49 26 / 38 What if problem is only partially separable? Doubly uniform (DU) sampling: P (Ŝ = S) = q S ( n S ) Generates all sets of equal cardinality with equal probability. Can model unreliable processors/machines. Let q τ = P ( Ŝ = τ), with n = 5 coordinates. Ŝ q 1 q 2 q 3 q 4 q 5
50 27 / 38 What if problem is only partially separable? Binomial sampling: consider independent equally unreliable processors. Each of τ processors available with probability p b, busy with probability 1 p b. # available processors (number of blocks that can be updated in parallel) at each iteration is a binomial random variable with parameters τ and p b. Use explicit or implicit selection.
51 ESO Theory 1 Smooth partially separable f [Richtárik & Takáč, 2011b] f(x + te i ) f(x) + f(x) T (te i ) + L i 2 t2 f(x) = J J f J (x), f J depends on x i for i J only ω := max J J J Theorem: If Ŝ is doubly uniform, then ] E [f(x + h [ Ŝ] ) f(x) + E[ Ŝ ] n where β = 1 + ( f(x) T h + β ) 2 h 2 w, ( ) (ω 1) E[ Ŝ 2 ] 1 E[ Ŝ ], w i = L i. i = 1, 2,..., n n 1 β is small if ω is small (i.e., more separable) 28 / 38
52 29 / 38 ESO Theory (Richtárik & Takáč, 2013) Parallel Coordinate Descent Methods for Big Data Optimization.
53 30 / 38 Theoretical Speedup s(r) = τ 1 + r(τ 1), r = (ω 1)/(n 1) ω often a constant that depends on n. r is a measure of density. MUCH OF BIG DATA IS HERE!
54 Theory vs Practice τ = # processors vs. theoretical (left) and experimental (right) speed-up for n = 1000 coordinates 31 / 38
55 32 / 38 Experiment 1 billion-by-2 billion LASSO problem [Richtárik & Takáč, 2012] f(x) = 1 2 Ax b 2 2, g(x) = x 1 A has rows and n = 10 9 columns. x 0 = 10 5 A :,i 0 = 20 (column) max j A j,: 0 = 35 (row) ω = 35 (degree of partial separability of f). Used approximation of τ-nice sampling Ŝ (independent sampling, τ << n). Asynchronous implementation. Older information used to update coordinates, but observed slow down is limited.
56 Experiment: Coordinate Updates For each τ, serial and parallel CD need approximately same number of coordinate updates. Method identifies active set. 33 / 38
57 34 / 38 Experiment: Iterations Doubling τ roughly translates to halving the number of iterations.
58 35 / 38 Experiment: Wall Time Doubling τ roughly translates to halving the wall time.
59 36 / 38 Citation Algorithm Paper (Bradley et al, 2011) Shotgun Parallel coordinate descent for L1-regularized loss minimization. ICML, 2011 (arxiv: ) (Richtárik & Takáč, 2011) SCD Efficient serial and parallel coordinate descent methods for huge-scale truss topology design. Operations Research Proceedings, 27-32, 2012 (Opt Online 08/2011) (Richtárik & Takáč, 2012) PCDM Parallel coordinate descent methods for big data optimization. Mathematical Programming, 2015 (arxiv: ) (Fercoq & Richtárik, 2013) SPCDM Smooth minimization of nonsmooth functions with parallel coordinate descent method (arxiv: ) (Richtárik & Takáč, 2013a) HYDRA Distributed coordinate descent method for learning with big data (arxiv: ) (Liu et al., 2013) AsySCD An asynchronous parallel stochastic coordinate descent algorithm. ICML 2014 (arxiv: ) (Fercoq & Richtárik, 2013) APPROX Accelerated, parallel and proximal coordinate descent (arxiv: ) (Yang, 2013) DisDCA Trading computation for communication: distributed stochastic dual coordinate ascent. NIPS 2013 (Bian et al, 2013) PCDN Parallel coordinate descent Newton method for efficient l 1 -regularized minimization (arxiv: ) (Liu & Wright, 2014) AsySPCD Asynchronous stochastic coordinate descent: parallelism and convergence properties. SIAM J. Optim. 25(1), , 2015 (arxiv: ) (Mahajan et al, 2014) DBCD A distributed block coordinate descent method for training l 1 -regularized linear classifiers. arxiv: , 2014
60 37 / 38 Citation Algorithm Paper (Fercoq et al, 2014) Hydra2 Fast distributed coordinate descent for non-strongly convex losses. MLSP 2014 (arxiv: ) (Mareček, Richtárik and Takáč, 2014) DBCD Distributed block coordinate descent for minimizing partially separable functions. Numerical Analysis and Opt., Springer Proc. in Math. and Stat. (arxiv: ) (Jaggi, Smith, Takáč et al, 2014) CoCoA Communication-efficient distributed dual coordinate ascent. NIPS 2014 (arxiv: ) (Qu, Richtárik & Zhang, 2014) QUARTZ Randomized dual coordinate ascent with arbitrary sampling. arxiv: , 2014 (Ma, Smith, Jaggi et al, 2015) CoCoA+ Adding vs. averaging in distributed primal-dual optimization. ICML 2015 (Tappenden, Takáč & Richtárik, 2015) PCDM On the complexity of parallel coordinate descent. arxiv: , 2015 (Hsieh, Yu & Dhillon, 2015) PASSCoDe PASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent. ICML, 2015 (Peng et al, 2016) ARock ARock: an algorithmic framework for asynchronous parallel coordinate updates. SIAM J. Sci. Comput., 2016 (You et al, 2016) Asy-GCD Asynchronous parallel greedy coordinate descent. NIPS,
61 38 / 38 Conclusions Coordinate descent scales very well to big data problems of special structure. Requires (partial) separability/sparse dependency graph. Care is needed when combining updates (add them up? average?) Sampling strategies that take into account unreliable processors.
Coordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationFAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč
FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom
More informationRandomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity
Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University
More informationImportance Sampling for Minibatches
Importance Sampling for Minibatches Dominik Csiba School of Mathematics University of Edinburgh 07.09.2016, Birmingham Dominik Csiba (University of Edinburgh) Importance Sampling for Minibatches 07.09.2016,
More informationARock: an algorithmic framework for asynchronous parallel coordinate updates
ARock: an algorithmic framework for asynchronous parallel coordinate updates Zhimin Peng, Yangyang Xu, Ming Yan, Wotao Yin ( UCLA Math, U.Waterloo DCO) UCLA CAM Report 15-37 ShanghaiTech SSDS 15 June 25,
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationSupplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM
Supplement: Distributed Box-constrained Quadratic Optimization for Dual Linear SVM Ching-pei Lee LEECHINGPEI@GMAIL.COM Dan Roth DANR@ILLINOIS.EDU University of Illinois at Urbana-Champaign, 201 N. Goodwin
More informationLecture 3: Minimizing Large Sums. Peter Richtárik
Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMini-Batch Primal and Dual Methods for SVMs
Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314
More informationEfficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization
Efficient Serial and Parallel Coordinate Descent Methods for Huge-Scale Convex Optimization Martin Takáč The University of Edinburgh Based on: P. Richtárik and M. Takáč. Iteration complexity of randomized
More informationParallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization
Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu
More informationParallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence
Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationAsynchronous Parallel Stochastic Gradient for Nonconvex Optimization
Asynchronous Parallel Stochastic Gradient for Nonconvex Optimization Xiangru Lian, Yijun Huang, Yuncheng Li, and Ji Liu Department of Computer Science, University of Rochester {lianxiangru,huangyj0,raingomm,ji.liu.uwisc}@gmail.com
More informationAsynchronous Parallel Computing in Signal Processing and Machine Learning
Asynchronous Parallel Computing in Signal Processing and Machine Learning Wotao Yin (UCLA Math) joint with Zhimin Peng (UCLA), Yangyang Xu (IMA), Ming Yan (MSU) Optimization and Parsimonious Modeling IMA,
More informationDistributed Box-Constrained Quadratic Optimization for Dual Linear SVM
Distributed Box-Constrained Quadratic Optimization for Dual Linear SVM Lee, Ching-pei University of Illinois at Urbana-Champaign Joint work with Dan Roth ICML 2015 Outline Introduction Algorithm Experiments
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationFast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee
Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationCoordinate gradient descent methods. Ion Necoara
Coordinate gradient descent methods Ion Necoara January 2, 207 ii Contents Coordinate gradient descent methods. Motivation..................................... 5.. Coordinate minimization versus coordinate
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationRandomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming
Randomized Block Coordinate Non-Monotone Gradient Method for a Class of Nonlinear Programming Zhaosong Lu Lin Xiao June 25, 2013 Abstract In this paper we propose a randomized block coordinate non-monotone
More informationAn Asynchronous Parallel Stochastic Coordinate Descent Algorithm
Ji Liu JI-LIU@CS.WISC.EDU Stephen J. Wright SWRIGHT@CS.WISC.EDU Christopher Ré CHRISMRE@STANFORD.EDU Victor Bittorf BITTORF@CS.WISC.EDU Srikrishna Sridhar SRIKRIS@CS.WISC.EDU Department of Computer Sciences,
More informationOn Nesterov s Random Coordinate Descent Algorithms
On Nesterov s Random Coordinate Descent Algorithms Zheng Xu University of Texas At Arlington February 19, 2015 1 Introduction Full-Gradient Descent Coordinate Descent 2 Random Coordinate Descent Algorithm
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationF (x) := f(x) + g(x), (1.1)
ASYNCHRONOUS STOCHASTIC COORDINATE DESCENT: PARALLELISM AND CONVERGENCE PROPERTIES JI LIU AND STEPHEN J. WRIGHT Abstract. We describe an asynchronous parallel stochastic proximal coordinate descent algorithm
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationCoordinate descent methods
Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationConvex Optimization. Newton s method. ENSAE: Optimisation 1/44
Convex Optimization Newton s method ENSAE: Optimisation 1/44 Unconstrained minimization minimize f(x) f convex, twice continuously differentiable (hence dom f open) we assume optimal value p = inf x f(x)
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationAccelerated Block-Coordinate Relaxation for Regularized Optimization
Accelerated Block-Coordinate Relaxation for Regularized Optimization Stephen J. Wright Computer Sciences University of Wisconsin, Madison October 09, 2012 Problem descriptions Consider where f is smooth
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationCoordinate Descent Faceoff: Primal or Dual?
JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University
More informationMachine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014
Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationA DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson
204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationAdaptive Primal Dual Optimization for Image Processing and Learning
Adaptive Primal Dual Optimization for Image Processing and Learning Tom Goldstein Rice University tag7@rice.edu Ernie Esser University of British Columbia eesser@eos.ubc.ca Richard Baraniuk Rice University
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationLecture 25: November 27
10-725: Optimization Fall 2012 Lecture 25: November 27 Lecturer: Ryan Tibshirani Scribes: Matt Wytock, Supreeth Achar Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationEfficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization
Efficient Quasi-Newton Proximal Method for Large Scale Sparse Optimization Xiaocheng Tang Department of Industrial and Systems Engineering Lehigh University Bethlehem, PA 18015 xct@lehigh.edu Katya Scheinberg
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationThe Randomized Newton Method for Convex Optimization
The Randomized Newton Method for Convex Optimization Vaden Masrani UBC MLRG April 3rd, 2018 Introduction We have some unconstrained, twice-differentiable convex function f : R d R that we want to minimize:
More informationNewton s Method. Javier Peña Convex Optimization /36-725
Newton s Method Javier Peña Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, f ( (y) = max y T x f(x) ) x Properties and
More informationAdaptive Probabilities in Stochastic Optimization Algorithms
Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights
More informationLecture 23: November 21
10-725/36-725: Convex Optimization Fall 2016 Lecturer: Ryan Tibshirani Lecture 23: November 21 Scribes: Yifan Sun, Ananya Kumar, Xin Lu Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer:
More informationLecture 17: October 27
0-725/36-725: Convex Optimiation Fall 205 Lecturer: Ryan Tibshirani Lecture 7: October 27 Scribes: Brandon Amos, Gines Hidalgo Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationAsynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones
Asynchronous Algorithms for Conic Programs, including Optimal, Infeasible, and Unbounded Ones Wotao Yin joint: Fei Feng, Robert Hannah, Yanli Liu, Ernest Ryu (UCLA, Math) DIMACS: Distributed Optimization,
More informationPASSCoDe: Parallel ASynchronous Stochastic dual Co-ordinate Descent
Cho-Jui Hsieh Hsiang-Fu Yu Inderjit S. Dhillon Department of Computer Science, The University of Texas, Austin, TX 787, USA CJHSIEH@CS.UTEXAS.EDU ROFUYU@CS.UTEXAS.EDU INDERJIT@CS.UTEXAS.EDU Abstract Stochastic
More informationAdding vs. Averaging in Distributed Primal-Dual Optimization
Chenxin Ma Industrial and Systems Engineering, Lehigh University, USA Virginia Smith University of California, Berkeley, USA Martin Jaggi ETH Zürich, Switzerland Michael I. Jordan University of California,
More informationSOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1
SOR- and Jacobi-type Iterative Methods for Solving l 1 -l 2 Problems by Way of Fenchel Duality 1 Masao Fukushima 2 July 17 2010; revised February 4 2011 Abstract We present an SOR-type algorithm and a
More informationAsynchronous Non-Convex Optimization For Separable Problem
Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent
More informationLASSO Review, Fused LASSO, Parallel LASSO Solvers
Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable
More informationAccelerated primal-dual methods for linearly constrained convex problems
Accelerated primal-dual methods for linearly constrained convex problems Yangyang Xu SIAM Conference on Optimization May 24, 2017 1 / 23 Accelerated proximal gradient For convex composite problem: minimize
More informationQuasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization
Quasi-Newton Methods Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization 10-725 Last time: primal-dual interior-point methods Given the problem min x f(x) subject to h(x)
More informationMax Margin-Classifier
Max Margin-Classifier Oliver Schulte - CMPT 726 Bishop PRML Ch. 7 Outline Maximum Margin Criterion Math Maximizing the Margin Non-Separable Data Kernels and Non-linear Mappings Where does the maximization
More informationGradient Descent. Ryan Tibshirani Convex Optimization /36-725
Gradient Descent Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: canonical convex programs Linear program (LP): takes the form min x subject to c T x Gx h Ax = b Quadratic program (QP): like
More informationChapter 2. Optimization. Gradients, convexity, and ALS
Chapter 2 Optimization Gradients, convexity, and ALS Contents Background Gradient descent Stochastic gradient descent Newton s method Alternating least squares KKT conditions 2 Motivation We can solve
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Nov 2, 2016 Outline SGD-typed algorithms for Deep Learning Parallel SGD for deep learning Perceptron Prediction value for a training data: prediction
More informationNetwork Newton. Aryan Mokhtari, Qing Ling and Alejandro Ribeiro. University of Pennsylvania, University of Science and Technology (China)
Network Newton Aryan Mokhtari, Qing Ling and Alejandro Ribeiro University of Pennsylvania, University of Science and Technology (China) aryanm@seas.upenn.edu, qingling@mail.ustc.edu.cn, aribeiro@seas.upenn.edu
More informationr=1 r=1 argmin Q Jt (20) After computing the descent direction d Jt 2 dt H t d + P (x + d) d i = 0, i / J
7 Appendix 7. Proof of Theorem Proof. There are two main difficulties in proving the convergence of our algorithm, and none of them is addressed in previous works. First, the Hessian matrix H is a block-structured
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning First-Order Methods, L1-Regularization, Coordinate Descent Winter 2016 Some images from this lecture are taken from Google Image Search. Admin Room: We ll count final numbers
More informationA Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification
JMLR: Workshop and Conference Proceedings 1 16 A Study on Trust Region Update Rules in Newton Methods for Large-scale Linear Classification Chih-Yang Hsia r04922021@ntu.edu.tw Dept. of Computer Science,
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationRegression with Numerical Optimization. Logistic
CSG220 Machine Learning Fall 2008 Regression with Numerical Optimization. Logistic regression Regression with Numerical Optimization. Logistic regression based on a document by Andrew Ng October 3, 204
More informationSDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization
Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationAn interior-point gradient method for large-scale totally nonnegative least squares problems
An interior-point gradient method for large-scale totally nonnegative least squares problems Michael Merritt and Yin Zhang Technical Report TR04-08 Department of Computational and Applied Mathematics Rice
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationComposite nonlinear models at scale
Composite nonlinear models at scale Dmitriy Drusvyatskiy Mathematics, University of Washington Joint work with D. Davis (Cornell), M. Fazel (UW), A.S. Lewis (Cornell) C. Paquette (Lehigh), and S. Roy (UW)
More informationConstrained Optimization and Lagrangian Duality
CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may
More informationRandom coordinate descent algorithms for. huge-scale optimization problems. Ion Necoara
Random coordinate descent algorithms for huge-scale optimization problems Ion Necoara Automatic Control and Systems Engineering Depart. 1 Acknowledgement Collaboration with Y. Nesterov, F. Glineur ( Univ.
More informationNewton s Method. Ryan Tibshirani Convex Optimization /36-725
Newton s Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: dual correspondences Given a function f : R n R, we define its conjugate f : R n R, Properties and examples: f (y) = max x
More informationLARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION
LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLinear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers)
Support vector machines In a nutshell Linear classifiers selecting hyperplane maximizing separation margin between classes (large margin classifiers) Solution only depends on a small subset of training
More informationOptimization Methods for Machine Learning
Optimization Methods for Machine Learning Sathiya Keerthi Microsoft Talks given at UC Santa Cruz February 21-23, 2017 The slides for the talks will be made available at: http://www.keerthis.com/ Introduction
More informationLock-Free Approaches to Parallelizing Stochastic Gradient Descent
Lock-Free Approaches to Parallelizing Stochastic Gradient Descent Benjamin Recht Department of Computer Sciences University of Wisconsin-Madison with Feng iu Christopher Ré Stephen Wright minimize x f(x)
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More information