Lecture 20: November 7

Similar documents
Lecture Notes on Linear Regression

Generalized Linear Methods

Supplement: Proofs and Technical Details for The Solution Path of the Generalized Lasso

Feature Selection: Part 1

Solutions to exam in SF1811 Optimization, Jan 14, 2015

MMA and GCMMA two methods for nonlinear optimization

Lecture 10 Support Vector Machines II

Lecture 23: November 21

Lagrange Multipliers Kernel Trick

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Lecture 3: Dual problems and Kernels

The Minimum Universal Cost Flow in an Infeasible Flow Network

Support Vector Machines

Some modelling aspects for the Matlab implementation of MMA

Support Vector Machines

Markov Chain Monte Carlo Lecture 6

Kernel Methods and SVMs Extension

C4B Machine Learning Answers II. = σ(z) (1 σ(z)) 1 1 e z. e z = σ(1 σ) (1 + e z )

Natural Language Processing and Information Retrieval

Support Vector Machines CS434

Convex Optimization. Optimality conditions. (EE227BT: UC Berkeley) Lecture 9 (Optimality; Conic duality) 9/25/14. Laurent El Ghaoui.

SELECTED SOLUTIONS, SECTION (Weak duality) Prove that the primal and dual values p and d defined by equations (4.3.2) and (4.3.3) satisfy p d.

Ensemble Methods: Boosting

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

10-701/ Machine Learning, Fall 2005 Homework 3

Singular Value Decomposition: Theory and Applications

A Delay-tolerant Proximal-Gradient Algorithm for Distributed Learning

Support Vector Machines CS434

CSE 546 Midterm Exam, Fall 2014(with Solution)

Support Vector Machines

Lecture 4: September 12

Maximal Margin Classifier

CSC 411 / CSC D11 / CSC C11

CSCI B609: Foundations of Data Science

Estimation: Part 2. Chapter GREG estimation

Which Separator? Spring 1

Lecture 17: Lee-Sidford Barrier

Supporting Information

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 16

Stochastic Optimization Methods

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

Support Vector Machines

PHYS 705: Classical Mechanics. Calculus of Variations II

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Structural Extensions of Support Vector Machines. Mark Schmidt March 30, 2009

The Geometry of Logit and Probit

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

Research Article. Almost Sure Convergence of Random Projected Proximal and Subgradient Algorithms for Distributed Nonsmooth Convex Optimization

COS 521: Advanced Algorithms Game Theory and Linear Programming

Solutions HW #2. minimize. Ax = b. Give the dual problem, and make the implicit equality constraints explicit. Solution.

Case A. P k = Ni ( 2L i k 1 ) + (# big cells) 10d 2 P k.

Support Vector Machines. Jie Tang Knowledge Engineering Group Department of Computer Science and Technology Tsinghua University 2012

1 Convex Optimization

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

Lecture 12: Discrete Laplacian

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

1 Gradient descent for convex functions: univariate case

Week 5: Neural Networks

Module 9. Lecture 6. Duality in Assignment Problems

APPROXIMATE PRICES OF BASKET AND ASIAN OPTIONS DUPONT OLIVIER. Premia 14

Affine transformations and convexity

Report on Image warping

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Linear Feature Engineering 11

10-801: Advanced Optimization and Randomized Methods Lecture 2: Convex functions (Jan 15, 2014)

Lecture 14: Bandits with Budget Constraints

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

6.854J / J Advanced Algorithms Fall 2008

A Study on L2-Loss (Squared Hinge-Loss) Multi-Class SVM

arxiv: v2 [math.st] 4 Nov 2012

EEE 241: Linear Systems

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

CS 229, Public Course Problem Set #3 Solutions: Learning Theory and Unsupervised Learning

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Assortment Optimization under MNL

Global Optimization of Truss. Structure Design INFORMS J. N. Hooker. Tallys Yunes. Slide 1

Machine Learning & Data Mining CS/CNS/EE 155. Lecture 4: Regularization, Sparsity & Lasso

Lecture 11. minimize. c j x j. j=1. 1 x j 0 j. +, b R m + and c R n +

VQ widely used in coding speech, image, and video

Feature Selection in Multi-instance Learning

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

Société de Calcul Mathématique SA

The Expectation-Maximization Algorithm

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Lecture 10 Support Vector Machines. Oct

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

arxiv: v1 [math.oc] 6 Jan 2016

A NEW ALGORITHM FOR FINDING THE MINIMUM DISTANCE BETWEEN TWO CONVEX HULLS. Dougsoo Kaown, B.Sc., M.Sc. Dissertation Prepared for the Degree of

On the Global Linear Convergence of the ADMM with Multi-Block Variables

Module 2. Random Processes. Version 2 ECE IIT, Kharagpur

The exam is closed book, closed notes except your one-page cheat sheet.

Inexact Alternating Minimization Algorithm for Distributed Optimization with an Application to Distributed MPC

Erratum: A Generalized Path Integral Control Approach to Reinforcement Learning

14 Lagrange Multipliers

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

On the Multicriteria Integer Network Flow Problem

Transcription:

0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer: These notes have not been subjected to the usual scrutny reserved for formal publcatons. They may be dstrbuted outsde ths class only wth the permsson of the Instructor. 20. Background of Coordnate Descent We have studed a lot of sophstcated methods to solve the convex mnmzaton problem, e.g. gradent descent, proxmal gradent descent, stochastc gradent descent, Newton s method, Quas-Newton method, Proxmal Newton method, Barrer method, and prmal-dual nteror pont method. These methods are updatng the varables from all coordnates at the same tme. But these coordnates may not be equally mportant. It s possble one coordnate nfluences the crteron value more than other coordnates do. So what f now we can focus on mnmzng the crteron accordng to each coordnate separately? We mght be nterested n frst answerng the followng questons. Q: Gven convex, dfferentable functon f : R n R, f we are at a pont x such that f(x) s mnmzed along each coordnate axs, then have we found a global mnmzer? That s, does f(x + δe ) f(x) for all δ, = f(x) = mn z f(z)? Note that e = (0,..,,..., 0) R n, the th standard bass vector. A: Yes! Proof: 0 = f(x) = ( f x (x),..., f x n (x)) (20.) Q: Same queston, but now for f convex, and not dfferentable? A: No. Check the counter example n Fgure 20.. If we are now at the ntersecton of two red lnes where the functon f s not dfferentable, no matter how we move along each axs, we always get larger crteron value. But ths s not a global mnmum. Q: Same queston agan, but now f(x) = g(x) + h(x) = g(x) + n = h (x ), wth g convex, dfferentable and each h convex? (Here the non-smooth part s called separable) A: Yes! Proof: Here we want to prove that y R n, f(y) f(x) 0 (20.2) We know that f(x + δe ) = g(x + δe ) + j h j (x j ) + h (x + δ) (20.3) 20-

20-2 Lecture 20: November 7 Fgure 20.: A counter example Snce x s optmal along th axs, accordng to subgradent optmalty, we have 0 g(x) + h (x ) (20.4) g(x) h (x ) h (y ) h (x ) g(x)(y x ) g(x)(y x ) + h (y ) h (x ) 0 Snce f s convex, accordng to the frst-order characterzaton, we have: f(y) f(x) (20.5) n g(x) T (y x) + [h (y ) h (x )] 0 = n [ g (x)(y x ) + h (y ) h (x )] = 20.2 Coordnate Descent For the problem mn x f(x) (20.6) where f(x) = g(x) + n = h (x ), wth g convex and dfferentable and h convex, we can use coordnate descent:

Lecture 20: November 7 20-3 Let x (0) R n, and for k =, 2,... repeat x (k) = argmn x f(x (k),..., x(k), x, x (k ) +,..., x (k ) n ), =, 2,..., n Note that we always use the most recent nformaton possble. Tseng [4] proves that for such f (provded f s contnuous on compact set x : f(x) f(x (0) ) and f attans ts mnmum), any lmt pont of x (k), k =, 2, 3,... s a mnmzer of f. Here are some useful and mportant notes for coordnate descent:. Order of cycle through coordnates s arbtrary, can use any permutaton of {, 2,..., n} 2. Can everywhere replace ndvdual coordnates wth blocks of coordnates. For example, we can always update a group of coordnates at the same tme. 3. One-at-a-tme update scheme s crtcal, and all-at-once scheme does not necessarly converge. 4. The analogy for solvng lnear systems: Gauss-Sedel versus Jacob method. 20.3 Examples of Coordnate Descent 20.3. Lnear Regresson For the classcal lnear regresson, we consder mn β 2 y Xβ 2 2 (20.7) where y R n, and X R n p. Take the (sub)gradent of the objectve wth respect to β (the th element of β) where all other j are fxed and set t to zero to get the update step: X T (Xβ y) = X T X β + X T (X β y) = 0 β XT (y X β ) X T X (20.8) where X and β are orgnal matrx or vector wth -th column or element removed respectvely. Repeat ths update for =, 2,..., p,, 2,... Ths s the same as Guass-Sedl updates. Remark. The computatonal cost (n terms of flops) for cycle of coordnate descent s O(np), where O(n) to compute X T (y X β ) for each update n a cycle. Ths s the same as the cost of teraton of gradent descent. 20.3.2 LASSO Regresson For the classcal LASSO, we consder mn β 2 y Xβ 2 2 + λ β (20.9) where y R n, and X R n p. Notce that we can use coordnate descent as the regularzer term can be decomposed as the sum of convex functons, namely β = p = β. Take the (sub)gradent of the objectve wth respect to β where all other j are fxed and set t to zero to get the update step: ( ) X X T X β + X T T (X β y) + λs = 0 β S (y X β ) λ/ X 2 2 X T X (20.0)

20-4 Lecture 20: November 7 where s β and S λ s a soft-thresholdng operator, β λ β > λ [S λ (β)] = 0 λ β λ. β + λ β < λ Repeat ths update for =, 2,..., p,, 2,... 20.3.3 Box-constraned QP A box-constraned QP has the form: mn x 2 xt Qx + b T x subject to l x u (20.) for b R n, Q S n +. Notce that we can use coordnate descent as the constrant can be decomposed nto element-wse convex constrants: I(l x u) = n = I(l x u ), I beng the ndcator functon. Smlar steps for takng the (sub)gradent of the objectve wth respect to x wth all other elements j fxed gves the update step: ( b j x T Q ) jx j [l,u ] (20.2) where T [l,u ] s the projecton operator on to the nterval [l, u ] that clps the value: u z > u T [l,u ](z) = z l z u. l z < l Repeat ths update for =, 2,..., n,, 2,... Q 20.3.4 Support Vector Machnes Consder the SVM dual objectve: mn α 2 αt X XT α T α subject to 0 C, α T y = 0 (20.3) [3] ntroduces Sequental Mnmal Optmzaton (SMO), a blockwse coordnate descent method that uses greedy heurstcs to select the next block of 2 nstead of smple cyclng. SMO repeats the followng updates:. Greedly choose a block of and j such that α, α j volate the complementary slackness condton. That s, select two s (accordng to some heurstc) such that where β, β 0, ξ are prmal varables. α ( ξ ( Xβ) y β 0 ) 0 (C α )ξ 0 2. Mnmze the objectve over the two chosen varables whle keepng others fxed. For a more recent work on coordnate descent method for SVMs, refer to [2].

Lecture 20: November 7 20-5 20.4 Hstory of Coordnate descent Untl Fredman et. al 2007[], coordnate descent was consdered to be an nterestng, toy method. Ths could be because people were mplementng the Jacoban verson of t wthout dstngushng between one at a tme versus all at once type of updates. 20.4. Why s Coordnate descent used today? Coordnate descent s very smple and easy to mplement. It can acheve state-of-the-art f mplemented usng some trcks descrbed n the next secton. Ths s especally true for functons n consstng of a quadratc functon and a separable component ether drectly or under proxmal Newton. Examples: lasso regresson, lasso GLMs (under proxmal Newton), SVMs, group lasso, graphcal lasso (appled to the dual), etc. 20.5 Implementaton trcks - Pathwse Gradent Descent Pathwse coordnate descent for lasso has the followng structure- Outer Loop(pathwse strategy) : The dea s to go from a sparse to dense soluton. Compute the soluton over a sequence λ > λ 2 >... > λ r of tunng parameter values For tunng parameter value λ k, ntalze coordnate descent algorthm at the computed soluton for λ k+ (warm start) Inner Loop(actve set strategy) : Ths step s effcent snce we only work wth the actve set. Perform one coordnate cycle (or small number of cycles), and record actve set A of coeffcents that are nonzero Cycle over only the coeffcents n A untl convergence Check KKT condtons over all coeffcents; f not all satsfed, add offendng coeffcents to A, go back one step Pathwse coordnate descent combned wth screenng rules make practcal coordnate descent very effcent. 20.6 Coordnate gradent descent For a smooth functon f, the teratons x (k) = x (k ) t k. f(x (k),..., x(k), x(k), x (k) +,..., x(k) n ), =...n (20.4) for k =, 2, 3,... are called coordnate gradent descent, and when f = g + h, wth g smooth and h = n = h, the teratons ( x (k) = prox h,t k x (k ) t k. g(x (k),..., x(k), x(k) ), x (k) +,..., x(k) n ), =...n (20.5)

20-6 Lecture 20: November 7 for k =, 2, 3,... are called coordnate proxmal gradent descent. When g s quadratc, (proxmal) coordnate gradent descent s the same as coordnate descent under proper step sze. Roughly speakng, theory suggests that the convergence results for coordnate descent are smlar to those for proxmal gradent descent. References [] Jerome Fredman, Trevor Haste, Holger Höflng, Robert Tbshran, et al. Pathwse coordnate optmzaton. The Annals of Appled Statstcs, (2):302 332, 2007. [2] Cho-Ju Hseh, Ka-We Chang, Chh-Jen Ln, S Sathya Keerth, and Sellamanckam Sundararajan. A dual coordnate descent method for large-scale lnear svm. In Proceedngs of the 25th nternatonal conference on Machne learnng, pages 408 45. ACM, 2008. [3] John Platt. Sequental mnmal optmzaton: A fast algorthm for tranng support vector machnes. 998. [4] Paul Tseng. Convergence of a block coordnate descent method for nondfferentable mnmzaton. Journal of optmzaton theory and applcatons, 09(3):475 494, 200.