Linearized Alternating Direction Method: Two Blocks and Multiple Blocks. Zhouchen Lin 林宙辰北京大学
|
|
- Leonard Booker
- 5 years ago
- Views:
Transcription
1 Linearized Alternating Direction Method: Two Blocks and Multiple Blocks Zhouchen Lin 林宙辰北京大学 Dec. 3, 014
2 Outline Alternating Direction Method (ADM) Linearized Alternating Direction Method (LADM) Two Blocks LADM Multiple Blocks Proximal LADM Multiple Blocks Conclusions Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
3 Background Optimization is everywhere Compressed Sensing: x kxk 1 ; s:t: Ax = b: RPCA w/ Missing Value: kak + kek 1 ; s:t: ¼ (A + E) = d: LASSO: x kax bk ; s:t: kxk 1 <= ": Image Restoration: x kax bk + jjrxjj 1 ; s:t: 0 <= x <= 55: Covariance Selection: X s:t: tr( X) log(det(x)) + ½eT jxje; X S n ; where S n = fx < 0j I 4 X 4 max Ig Pose Estimation: Q tr(wq); s:t: tr(a iq) = 0; i = 1; ; m; Q < 0; rank(q) <= 1:
4 Alternating Direction Method Model Problem: (ADM) x 1 ;x f 1 (x 1 ) + f (x ); s:t: A 1 (x 1 ) + A (x ) = b; where f i are convex functions and A i are linear mappings. ~L(x 1 ; x ; ) = f 1 (x 1 ) + f (x ) + h ; A 1 (x 1 ) + A (x ) bi + ka 1(x 1 ) + A (x ) bk F ; x k+1 1 = arg L(x1 ~ ; x k x 1 ; k); x k+1 = arg L(x ~ k+1 x 1 ; x ; k); k+1 = k + k[a 1 (x k+1 1 ) + A (x k+1 ) b]: Update k Augmented Lagrangian Function Assume: Easy
5 Linearized Alternating Direction Method (LADM) x k+1 1 = arg f 1 (x 1 ) + k x ka 1 (x 1 ) + A (x k 1 ) b + k= kk ; x k+1 = arg f (x ) + k ka (x k+1 1 ) + A (x ) b + k= kk x x Proximal Operation f 1 (x) + kx wk F f (x) + x kx wk F arg x kxk 1 + kx wk = T 1(w); T " (x) = sgn(x) max(jxj "; 0): T ε (x) x arg X kxk + " kx Wk F = " 1(W) = UT " 1(S)V T ; where W = USV T is the singular value decomposition (SVD) of W. Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
6 Linearized Alternating Direction Introducing auxiliary variables: Method (LADM) x 1 ;x ;x 3 ;x 4 f 1 (x 1 ) + f (x ); s:t: x 1 = x 3 ; x = x 4 ; A 1 (x 3 ) + A (x 4 ) = b: ~L(x 1 ; x ; x 3 ; x 4 ; 1; ; 3) = f 1 (x 1 ) + f (x ) + h 1; x 1 x 3 i + h ; x x 4 i + h 3; A 1 (x 3 ) + A (x 4 ) bi + kx1 x 3 k F + kx x 4 k F + ka 1(x 3 ) + A (x 4 ) bk F ; Three drawbacks: 1. More blocks! more memory & slower convergence.. Matrix inversion is expensive. 3. Convergence is NOT guaranteed! Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
7 Linearized Alternating Direction Method (LADM) x k+1 1 = arg f 1 (x 1 ) + k x ka 1 (x 1 ) + A (x k 1 ) b + k= kk ; x k+1 = arg f (x ) + k ka (x k+1 1 ) + A (x ) b + k= kk x x f 1 (x) + kx wk F f (x) + x kx wk F Linearize the quadratic term x k+1 1 = arg f 1 (x 1 ) + ha 1 ( k) + ka 1 (A 1(x k 1 ) + A (x k ) b); x 1 x k 1 i x 1 + k 1 kx 1 x k 1k = arg x 1 f 1 (x 1 ) + k 1 x k+1 = arg x f (x ) ha (x); yi = hx; A(y)i; 8x; y kx 1 x k 1 + A 1( k + k(a 1 (x k 1) + A (x k 1) b))=( k 1)k ; Adjoint Operator + k kx x k + A ( k + k(a 1 (x k+1 1 ) + A (x k ) b))=( k )k : Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
8 LADM with Adaptive Penalty (LADMAP) Theorem: If f kg is non-decreasing and upper bounded, i > ka i k, i = 1;, then the sequence f(x k 1 ; xk ; k)g converges to a KKT point of the model problem. Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
9 LADM with Adaptive Penalty Adaptive Penalty x k+1 1 = arg x 1 f 1 (x 1 ) (LADMAP) + k 1 kx 1 x k 1 + A 1( k + k(a 1 (x k 1) + A (x k 1) b))=( k 1)k ; x k+1 = arg x f (x ) + k kx x k + A ( k + k(a 1 (x k+1 1 ) + A (x k ) b))=( k )k : k 1(x k+1 1 x k 1) A 1( k + k(a 1 (x k 1) + A (x k 1) 1 (x k+1 1 ) k (x k+1 x k ) A ( k + k(a 1 (x k+1 1 ) + A (x k ) (x k+1 ) KKT condition: 9(x ; y ; ) such that Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS A 1 (x 1 ) + A (x ) b = 0; A 1 ( 1 (x 1 ); A ( (x ):
10 LADM with Adaptive Penalty (LADMAP) Both k 1kx k+1 1 x k 1k=kA 1(b)k and k kx k+1 x k k=ka (b)k should be small. i = ka i k ) Approximate ka i (b)k by p ikbk Adaptive Penalty k+1 = ( max ; ½ k); ½= ½ k+1 ½0 ; if k max( p 1kx 1 x k 1k F ; 1; otherwise; where ½ 0 1 is a constant. Loop until ka 1 (x k+1 1 ) + A (x k+1 ) bk F < " 1 ; p kx k+1 x k k F )=kbk F < " ; k+1 k max( p 1kx 1 x k k+1 1k F ; p kx x k k)=kbk F < " : Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
11 LADM with Adaptive Penalty Choice of parameters (LADMAP) 1. 0 = ", where / the size of b. 0 should not be too large, so that k increases in the rst few iterations.. ½ 0 1 should be chosen such that k increases steadily (but not necessarily every iteration). Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011. k
12 LADM with Adaptive Penalty An example (LRR): (LADMAP) Z;E kzk + ¹kEk 1 ; s:t: X = XZ + E: A 1 (Z) = XZ; A (E) = E: A 1(Z) = X T Z; A (E) = E; 1 = kxk ; = 1: Lin et al., Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation, NIPS 011.
13 Experiment Table 1: Comparison among APG, ADM, LADM and LADMAP on the synthetic data. For each quadruple (s, p, d, ~r), the LRR problem, with ¹ = 0:1, was solved for the same data using di erent algorithms. We present typical running time (in 10 3 seconds), iteration number, relative error (%) of output solution ( ^E; ^Z) and the clustering accuracy (%) of tested algorithms, respectively. Size (s, p, d, ~r) Method Time Iter. (10, 0,00, 5) (15, 0,300, 5) (0, 5, 500, 5) (30, 30, 900, 5) k^z Z 0 k kz 0 k k ^E E 0 k ke 0 k Acc. APG ADM LADM LADMAP APG ADM LADM LADMAP APG ADM LADM LADMAP APG ADM LADM N.A. N.A. N.A. N.A. N.A. LADMAP
14 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Model problem: x 1 ; ;x n nx f i (x i ); s:t: nx A i (x i ) = b: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. X kxk + 1 ¹ kb P (X)k ; s:t: X 0; + X;e kxk + 1 ¹ kek ; s:t: b = P (X) + e; X 0; + kxk + 1 X;Y;e ¹ kek ; s:t: b = P (Y) + e; X = Y; Y 0; + kxk + 1 X;Y;e ¹ kek + Â Y 0 (Y); s:t: b = P (Y) + e; X = Y:
15 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Can we naively generalize two-block LADMAP for multi-block problems? No! Actually, the naive generalization of LADMAP may be divergent, e.g., when applied to the following problem with n 5: x 1 ; ;x n nx kx i k 1 ; s:t: nx A i x i = b: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. C. Chen et al. The Direct Extension of ADMM for Multi-block Convex Minimization Problems is Not Necessarily Convergent. Preprint.
16 x k+1 i LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) = arg x i f i (x i )+ i k 0 0 x i x k i + k i k i = 1; ; n; k+1 = k + k à n X 11 nx A i (x k j ) baa =( i k) j=1 A i (x k+1 i ) b k+1 = ( max ; ½ k); i P1 j=1! ; A i (x k+1 P j ) + n A i (x k j ) j=i Parallel! ; where ½= ½ ½0 ; if k max p i x k+1 i 1; otherwise; x k i ; i = 1; ; n ª = kbk < " ; with ½ 0 > 1 being a constant and 0 < " 1 being a threshold. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
17 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Theorem: If f kg is non-decreasing and upper bounded, i > nka i k, i = 1; ; n, then f(fx k i g; k)g generated by LADMPSAP converges to a KKT point of the problem. Remark: When n =, LADMPSAP is weaker than LADMAP: i > ka i k vs. i > ka i k : Related work: He & Yuan, Linearized Alternating Direction Method with Gaussian Back Substitution for Separable Convex Programg, preprint. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
18 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Model problem: x 1 ; ;x n where X i µ R d i nx f i (x i ); s:t: nx A i (x i ) = b; x i X i ; i = 1; ; n; is a closed convex set. x 1 ; ;x n nx f i (x i )+ nx i=n+1 + Â xi X i n (x i ); s:t: nx A i (x i ) = b; x i = x n+i ; i = 1; ; n: Theorem: If f kg is non-decreasing and upper bounded, x n+1 ; ; x n are auxiliary variables, i > nka i k +, n+i >, i = 1; ; n, then f(fx k i g; k)g generated by LADMPSAP converges to a KKT point of the problem. i > n(ka i k + 1); n+i > n; i = 1; ; n Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
19 Experiment X kxk + 1 ¹ kb P (X)k ; s:t: X 0; + X;Y;e kxk + 1 ¹ kek + Â Y 0 (Y); s:t: b = P (Y) + e; X = Y: Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
20 Experiment Table 1: Numerical comparison on the NMC problem with synthetic data, average of 10 runs. q, t and d r denote, respectively, sample ratio, the number of measurements t = q(mn) and the \degree of freedom" de ned by d r = r(m + n r) for a matrix with rank r and q. Here we set m = n and x r = 10 in all the tests. X LADM LADMPSAP n q t=d r Iter. Time(s) RelErr FA Iter. Time(s) RelErr FA % E-5 6.1E E % E E E % E E E % E E E % E E E-5 0 Table 1: Numerical comparison on the image inpainting problem. Method #Iter. Time(s) PSNR FA FPCA dB 9.41E-4 LADM dB.9E-3 LADMPSAP dB 0 Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
21 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Enhanced convergence results: Theorem 1: If f kg is non-decreasing and +1 P k=1 1 k = +1, i > nka i i (x) is bounded, i = 1; ; n, then the sequence fx k i g generated by LADMP- SAP converges to an optimal solution to the model problem. Theorem : If f kg is non-decreasing, i > nka i i (x) is bounded, i = 1; ; n, then +1 P = +1 is also the necessary condition for the global k=1 1 k convergence of fx k i g generated by LADMPSAP to an optimal solution to the model problem. With the above analysis, when all the subgradients of the component objective functions are bounded we can remove the upper bound max. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015.
22 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) De ne x = (x T 1 ; ; xt n) T, x = ((x 1 )T ; ; (x )T ) T and f(x) = n P where (x 1; ; x ; ) is a KKT point of the model problem. f i (x i ), Proposition: ~x is an optimal solution to the model problem i there exists > 0, such that f(~x) f(x ) + nx ha i ( ); ~x i x nx i i + A i (~x i ) b = 0: Our criterion for checking the optimality of a solution is much simpler than that in He et al. 011, which has to compare with all (x 1 ; ; x n ; ) R d 1 R d n R m. Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. B. S. He and X. Yuan. On the O(1/t) convergence rate of alternating direction method. Preprint, 011.
23 LADM with Parallel Splitting and Adaptive Penalty (LADMPSAP) Theorem 3: De ne ¹x K = K P f(¹x K ) f(x )+ nx k=0 k x k+1, where k = 1 k = P K j=0 A i ( ); ¹x K i x i + 0 nx A i (¹x K i ) b 1 j. Then <= C 0 = à ( )! where 1 ka i k P = (n+1) max 1; ; i = 1; ; n and C i nka i k 0 = n à i KX k=0 1 k! (1) x 0 i x i ; A much simpler proof of convergence rate (in ergodic sense)! Lin et al., Linearized Alternating Direction Method with Parallel Splitting and Adaptive Penalty for Separable Convex Programs in Machine Learning, ML, 015. B. S. He and X. Yuan. On the O(1/t) convergence rate of alternating direction method. Preprint, 011.
24 Proximal LADMPSAP Even more general problem: x 1 ; ;x n nx f i (x i ); s:t: nx A i (x i ) = b: f i (x i ) = g i (x i ) + h i (x i ); where both g i and h i are convex, g i is C 1;1 : krg i (x) rg i (y)k <= L i kx yk; 8x; y R d i ; and h i may not be di erentiable but its proximal operation is easily solvable.
25 Proximal LADMPSAP Linearize the augmented term to obtain: x k+1 i (k) i = arg x i h i (x i ) + g i (x i ) + ¾(k) i Further linearize g i : x k+1 i = arg x i x i x k i + Ay i h i (x i ) + g i (xk i ) + ¾(k) i A y i (^ k)=¾ (k) i (^ k)=¾ (k) i +hrg i (x k i ) + Ay(^ k); i x i x k i i + (k) i = arg h i (x i ) + (k) x i Convergence condition: i x i x k i + 1 (k) i ; i = 1; ; n; x i x k i [A y (^ k) i + rg i (x k i )] : = T i + k i, where T i L i and i > nka i k are both positive constants.
26 Experiment Group Sparse Logistic Regression with Overlap w;b 1 s sx log 1 + exp y i (w T x i + b) + ¹ tx ks j wk; (1) where x i and y i, i = 1; ; s, are the training data and labels, respectively, and w and b parameterize the linear classi er. S j, j = 1; ; t, are the selection matrices, with only one 1 at each row and the rest entries are all zeros. The groups of entries, S j w, j = 1; ; t, may overlap each other. Introducing ¹w = (w T ; b) T, ¹x i = (x T i ; 1)T, z = (z T 1 ; zt ; ; zt t )T, and ¹ S = (S; 0), where S = (S T 1 ; ; S T t ) T, (1) can be rewritten as j=1 ¹w;z 1 s sx log 1 + exp y i ( ¹w T ¹x i ) tx + ¹ kz j k; s:t: z = S ¹ ¹w; () j=1 The Lipschitz constant of the gradient of logistic function with respect to ¹w can be proven to be L ¹w 1 4s k ¹ Xk, where ¹ X = (¹x 1 ; ¹x ; ; ¹x s ).
27 Experiment (s, p, t, q) Method Time #Iter. (300, 901, 100, 10) (450, 1351, 150, 15) (600, 1801, 00, 0) k ^¹w ¹w k k ¹w k k^z z k kz k ADM LADM LADMPS LADMPSAP pladmpsap ADM LADM LADMPS LADMPSAP pladmpsap ADM LADM LADMPS LADMPSAP pladmpsap
28 Conclusions LADMAP, LADMPSAP, and P-LADMPSAP are very general methods for solving various convex programs. Adaptive penalty is important for fast convergence.
29 Thanks!
Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation
Linearized Alternating Direction Method with Adaptive Penalty for Low-Rank Representation Zhouchen Lin Visual Computing Group Microsoft Research Asia Risheng Liu Zhixun Su School of Mathematical Sciences
More informationClosed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implications. Zhouchen Lin ( 林宙辰 ) 北京大学 Nov. 7, 2015
Closed-Form Solutions in Low-Rank Subspace Recovery Models and Their Implications Zhouchen Lin ( 林宙辰 ) 北京大学 Nov. 7, 2015 Outline Sparsity vs. Low-rankness Closed-Form Solutions of Low-rank Models Applications
More informationShiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 9. Alternating Direction Method of Multipliers
Shiqian Ma, MAT-258A: Numerical Optimization 1 Chapter 9 Alternating Direction Method of Multipliers Shiqian Ma, MAT-258A: Numerical Optimization 2 Separable convex optimization a special case is min f(x)
More informationRecent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables
Recent Developments of Alternating Direction Method of Multipliers with Multi-Block Variables Department of Systems Engineering and Engineering Management The Chinese University of Hong Kong 2014 Workshop
More informationWHY DUALITY? Gradient descent Newton s method Quasi-newton Conjugate gradients. No constraints. Non-differentiable ???? Constrained problems? ????
DUALITY WHY DUALITY? No constraints f(x) Non-differentiable f(x) Gradient descent Newton s method Quasi-newton Conjugate gradients etc???? Constrained problems? f(x) subject to g(x) apple 0???? h(x) =0
More informationSupport Vector Machines and Kernel Methods
2018 CS420 Machine Learning, Lecture 3 Hangout from Prof. Andrew Ng. http://cs229.stanford.edu/notes/cs229-notes3.pdf Support Vector Machines and Kernel Methods Weinan Zhang Shanghai Jiao Tong University
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationNesterov s Optimal Gradient Methods
Yurii Nesterov http://www.core.ucl.ac.be/~nesterov Nesterov s Optimal Gradient Methods Xinhua Zhang Australian National University NICTA 1 Outline The problem from machine learning perspective Preliminaries
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationDivide-and-combine Strategies in Statistical Modeling for Massive Data
Divide-and-combine Strategies in Statistical Modeling for Massive Data Liqun Yu Washington University in St. Louis March 30, 2017 Liqun Yu (WUSTL) D&C Statistical Modeling for Massive Data March 30, 2017
More informationEE613 Machine Learning for Engineers. Kernel methods Support Vector Machines. jean-marc odobez 2015
EE613 Machine Learning for Engineers Kernel methods Support Vector Machines jean-marc odobez 2015 overview Kernel methods introductions and main elements defining kernels Kernelization of k-nn, K-Means,
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationYou should be able to...
Lecture Outline Gradient Projection Algorithm Constant Step Length, Varying Step Length, Diminishing Step Length Complexity Issues Gradient Projection With Exploration Projection Solving QPs: active set
More informationMinimizing the Difference of L 1 and L 2 Norms with Applications
1/36 Minimizing the Difference of L 1 and L 2 Norms with Department of Mathematical Sciences University of Texas Dallas May 31, 2017 Partially supported by NSF DMS 1522786 2/36 Outline 1 A nonconvex approach:
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationProximal splitting methods on convex problems with a quadratic term: Relax!
Proximal splitting methods on convex problems with a quadratic term: Relax! The slides I presented with added comments Laurent Condat GIPSA-lab, Univ. Grenoble Alpes, France Workshop BASP Frontiers, Jan.
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationthen kaxk 1 = j a ij x j j ja ij jjx j j: Changing the order of summation, we can separate the summands, kaxk 1 ja ij jjx j j: let then c = max 1jn ja
Homework Haimanot Kassa, Jeremy Morris & Isaac Ben Jeppsen October 7, 004 Exercise 1 : We can say that kxk = kx y + yk And likewise So we get kxk kx yk + kyk kxk kyk kx yk kyk = ky x + xk kyk ky xk + kxk
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationADMM and Fast Gradient Methods for Distributed Optimization
ADMM and Fast Gradient Methods for Distributed Optimization João Xavier Instituto Sistemas e Robótica (ISR), Instituto Superior Técnico (IST) European Control Conference, ECC 13 July 16, 013 Joint work
More informationACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING
ACCELERATED FIRST-ORDER PRIMAL-DUAL PROXIMAL METHODS FOR LINEARLY CONSTRAINED COMPOSITE CONVEX PROGRAMMING YANGYANG XU Abstract. Motivated by big data applications, first-order methods have been extremely
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.16
XVI - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No.16 A slightly changed ADMM for convex optimization with three separable operators Bingsheng He Department of
More informationDual methods and ADMM. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725
Dual methods and ADMM Barnabas Poczos & Ryan Tibshirani Convex Optimization 10-725/36-725 1 Given f : R n R, the function is called its conjugate Recall conjugate functions f (y) = max x R n yt x f(x)
More informationThe Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017
The Kernel Trick, Gram Matrices, and Feature Extraction CS6787 Lecture 4 Fall 2017 Momentum for Principle Component Analysis CS6787 Lecture 3.1 Fall 2017 Principle Component Analysis Setting: find the
More informationCOMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017
COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University PRINCIPAL COMPONENT ANALYSIS DIMENSIONALITY
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More information2 Regularized Image Reconstruction for Compressive Imaging and Beyond
EE 367 / CS 448I Computational Imaging and Display Notes: Compressive Imaging and Regularized Image Reconstruction (lecture ) Gordon Wetzstein gordon.wetzstein@stanford.edu This document serves as a supplement
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationLecture 7: September 17
10-725: Optimization Fall 2013 Lecture 7: September 17 Lecturer: Ryan Tibshirani Scribes: Serim Park,Yiming Gu 7.1 Recap. The drawbacks of Gradient Methods are: (1) requires f is differentiable; (2) relatively
More informationarxiv: v1 [math.oc] 2 Sep 2011
arxiv:1109.0367v1 [math.oc] 2 Sep 2011 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040
More informationNorms and Perturbation theory for linear systems
CHAPTER 7 Norms and Perturbation theory for linear systems Exercise 7.7: Consistency of sum norm? Observe that the sum norm is a matrix norm. This follows since it is equal to the l -norm of the vector
More informationOptimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections
Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The
More informationChapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices
Chapter 7 Iterative methods for large sparse linear systems In this chapter we revisit the problem of solving linear systems of equations, but now in the context of large sparse systems. The price to pay
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationRobust PCA. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng
Robust PCA CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Robust PCA 1 / 52 Previously...
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationIterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming
Iterative Reweighted Minimization Methods for l p Regularized Unconstrained Nonlinear Programming Zhaosong Lu October 5, 2012 (Revised: June 3, 2013; September 17, 2013) Abstract In this paper we study
More informationImage Compression Using Simulated Annealing
Image Compression Using Simulated Annealing Aritra Dutta,GeonwooKim,MeiqinLi, Carlos Ortiz Marrero,MohitSinha,ColeStiegler k August 5 14, 2015 Mathematical Modeling in Industry XIX Institute of Mathematics
More informationA Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations
A Customized ADMM for Rank-Constrained Optimization Problems with Approximate Formulations Chuangchuang Sun and Ran Dai Abstract This paper proposes a customized Alternating Direction Method of Multipliers
More informationDistributed Optimization via Alternating Direction Method of Multipliers
Distributed Optimization via Alternating Direction Method of Multipliers Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato Stanford University ITMANET, Stanford, January 2011 Outline precursors dual decomposition
More informationInexact Alternating Direction Method of Multipliers for Separable Convex Optimization
Inexact Alternating Direction Method of Multipliers for Separable Convex Optimization Hongchao Zhang hozhang@math.lsu.edu Department of Mathematics Center for Computation and Technology Louisiana State
More informationLinear Regression (continued)
Linear Regression (continued) Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 6, 2017 1 / 39 Outline 1 Administration 2 Review of last lecture 3 Linear regression
More informationIs the test error unbiased for these programs?
Is the test error unbiased for these programs? Xtrain avg N o Preprocessing by de meaning using whole TEST set 2017 Kevin Jamieson 1 Is the test error unbiased for this program? e Stott see non for f x
More informationON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS
ON THE GLOBAL AND LINEAR CONVERGENCE OF THE GENERALIZED ALTERNATING DIRECTION METHOD OF MULTIPLIERS WEI DENG AND WOTAO YIN Abstract. The formulation min x,y f(x) + g(y) subject to Ax + By = b arises in
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationHomework 5. Convex Optimization /36-725
Homework 5 Convex Optimization 10-725/36-725 Due Tuesday November 22 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationAnnouncements. stuff stat 538 Zaid Hardhaoui. Statistics. cry. spring. g fa. inference. VC dimension covering
Announcements spring Convex Optimization next quarter ML stuff EE 578 Margam FaZe CS 547 Tim Althoff Modeling how to formulate real world problems as convex optimization Data science constrained optimization
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationProbabilistic Graphical Models
School of Computer Science Probabilistic Graphical Models Distributed ADMM for Gaussian Graphical Models Yaoliang Yu Lecture 29, April 29, 2015 Eric Xing @ CMU, 2005-2015 1 Networks / Graphs Eric Xing
More informationDiscriminative Models
No.5 Discriminative Models Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering York University, Toronto, Canada Outline Generative vs. Discriminative models
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationHYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX PROGRAMMING
SIAM J. OPTIM. Vol. 8, No. 1, pp. 646 670 c 018 Society for Industrial and Applied Mathematics HYBRID JACOBIAN AND GAUSS SEIDEL PROXIMAL BLOCK COORDINATE UPDATE METHODS FOR LINEARLY CONSTRAINED CONVEX
More informationLinear Models for Regression. Sargur Srihari
Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood
More informationCh 4. Linear Models for Classification
Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationLearning Task Grouping and Overlap in Multi-Task Learning
Learning Task Grouping and Overlap in Multi-Task Learning Abhishek Kumar Hal Daumé III Department of Computer Science University of Mayland, College Park 20 May 2013 Proceedings of the 29 th International
More informationLeast Squares SVM Regression
Least Squares SVM Regression Consider changing SVM to LS SVM by making following modifications: min (w,e) ½ w 2 + ½C Σ e(i) 2 subject to d(i) (w T Φ( x(i))+ b) = e(i), i, and C>0. Note that e(i) is error
More informationcxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c
Warm up D cai.yo.ie p IExrL9CxsYD Sglx.Ddl f E Luo fhlexi.si dbll Fix any a, b, c > 0. 1. What is the x 2 R that minimizes ax 2 + bx + c x a b Ta OH 2 ax 16 0 x 1 Za fhkxiiso3ii draulx.h dp.d 2. What is
More informationSparse Optimization Lecture: Dual Methods, Part I
Sparse Optimization Lecture: Dual Methods, Part I Instructor: Wotao Yin July 2013 online discussions on piazza.com Those who complete this lecture will know dual (sub)gradient iteration augmented l 1 iteration
More informationAlgorithms for constrained local optimization
Algorithms for constrained local optimization Fabio Schoen 2008 http://gol.dsi.unifi.it/users/schoen Algorithms for constrained local optimization p. Feasible direction methods Algorithms for constrained
More informationMATH 680 Fall November 27, Homework 3
MATH 680 Fall 208 November 27, 208 Homework 3 This homework is due on December 9 at :59pm. Provide both pdf, R files. Make an individual R file with proper comments for each sub-problem. Subgradients and
More information4 Frequent Directions
4 Frequent Directions Edo Liberty[3] discovered a strong connection between matrix sketching and frequent items problems. In FREQUENTITEMS problem, we are given a stream S = hs 1,s 2,...,s n i of n items
More informationDistributed Convex Optimization
Master Program 2013-2015 Electrical Engineering Distributed Convex Optimization A Study on the Primal-Dual Method of Multipliers Delft University of Technology He Ming Zhang, Guoqiang Zhang, Richard Heusdens
More informationStatistical Methods for SVM
Statistical Methods for SVM Support Vector Machines Here we approach the two-class classification problem in a direct way: We try and find a plane that separates the classes in feature space. If we cannot,
More informationNumerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725
Numerical Linear Algebra Primer Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: proximal gradient descent Consider the problem min g(x) + h(x) with g, h convex, g differentiable, and h simple
More informationCoordinate Update Algorithm Short Course Proximal Operators and Algorithms
Coordinate Update Algorithm Short Course Proximal Operators and Algorithms Instructor: Wotao Yin (UCLA Math) Summer 2016 1 / 36 Why proximal? Newton s method: for C 2 -smooth, unconstrained problems allow
More informationLINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
LINEAR CLASSIFIERS Classification: Problem Statement 2 In regression, we are modeling the relationship between a continuous input variable x and a continuous target variable t. In classification, the input
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More information6. Proximal gradient method
L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping
More informationLINEARIZED AUGMENTED LAGRANGIAN AND ALTERNATING DIRECTION METHODS FOR NUCLEAR NORM MINIMIZATION
LINEARIZED AUGMENTED LAGRANGIAN AND ALTERNATING DIRECTION METHODS FOR NUCLEAR NORM MINIMIZATION JUNFENG YANG AND XIAOMING YUAN Abstract. The nuclear norm is widely used to induce low-rank solutions for
More informationKernel Methods and Support Vector Machines
Kernel Methods and Support Vector Machines Oliver Schulte - CMPT 726 Bishop PRML Ch. 6 Support Vector Machines Defining Characteristics Like logistic regression, good for continuous input features, discrete
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass
More informationLeast Squares Regression
E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationLecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.
Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods
More informationAn ADMM algorithm for optimal sensor and actuator selection
An ADMM algorithm for optimal sensor and actuator selection Neil K. Dhingra, Mihailo R. Jovanović, and Zhi-Quan Luo 53rd Conference on Decision and Control, Los Angeles, California, 2014 1 / 25 2 / 25
More informationA Proof of Theorem 1. Dual Weighted Generalized Nuclear Norm. C Proof of Theorem 2
A Proof of Theorem Proof or all Z = P i c i i with kak A = P i c i, where i = Au i v T i BT, we can construct the i-th column of W and H as Clearly, we have Z = WH T and w i = p c i Au i and h i = p c
More information1 Computing with constraints
Notes for 2017-04-26 1 Computing with constraints Recall that our basic problem is minimize φ(x) s.t. x Ω where the feasible set Ω is defined by equality and inequality conditions Ω = {x R n : c i (x)
More informationPositive Definite Matrix
1/29 Chia-Ping Chen Professor Department of Computer Science and Engineering National Sun Yat-sen University Linear Algebra Positive Definite, Negative Definite, Indefinite 2/29 Pure Quadratic Function
More informationsublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)
sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank
More informationarxiv: v1 [math.oc] 23 May 2017
A DERANDOMIZED ALGORITHM FOR RP-ADMM WITH SYMMETRIC GAUSS-SEIDEL METHOD JINCHAO XU, KAILAI XU, AND YINYU YE arxiv:1705.08389v1 [math.oc] 23 May 2017 Abstract. For multi-block alternating direction method
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationDual and primal-dual methods
ELE 538B: Large-Scale Optimization for Data Science Dual and primal-dual methods Yuxin Chen Princeton University, Spring 2018 Outline Dual proximal gradient method Primal-dual proximal gradient method
More informationContraction Methods for Convex Optimization and Monotone Variational Inequalities No.18
XVIII - 1 Contraction Methods for Convex Optimization and Monotone Variational Inequalities No18 Linearized alternating direction method with Gaussian back substitution for separable convex optimization
More informationNONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
NONLINEAR CLASSIFICATION AND REGRESSION Nonlinear Classification and Regression: Outline 2 Multi-Layer Perceptrons The Back-Propagation Learning Algorithm Generalized Linear Models Radial Basis Function
More informationLeast Squares Regression
CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the
More information