Generalized Conditional Gradient and Its Applications

Size: px
Start display at page:

Download "Generalized Conditional Gradient and Its Applications"

Transcription

1 Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25

2 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 2 / 25

3 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 3 / 25

4 Regularized Loss Minimization Generic form for many ML problems: f is the loss function; h is the regularizer; f (w) + λ h(w), where w Assug f and h to be convex/smooth Interior point method; Mirror descent / Proximal gradient; Averaging gradient; Conditional gradient. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 4 / 25

5 Machine Learning Examples Example (Matrix Completion) Netflix problem; 1 X R m n 2 Covariance matrix estimation; etc. Example (Group Lasso) Statistical estimation; Inverse problem; Denoising; etc. ij O (X ij Z ij ) 2 + λ X tr 1 w R d 2 Aw b λ g G w g Interesting case: m, n or d are extremely large. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 5 / 25

6 Conditional gradient (Frank-Wolfe 56) Consider C: compact convex; f : smooth convex. x C f (x), 1 y t arg x, f (x t ) ; x C 2 x t+1 = (1 η)x t + ηy t. (Frank-Wolfe 56; Canon-Cullum 68) proved that CG converges at Θ(1/t). Gained much recent attention due to its simplicity; the greedy nature in step 1. Refs: (Zhang 03; Clarkson 10; Hazan 08; Jaggi-Sulovsky 10; Bach 12; etc.) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 6 / 25

7 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

8 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

9 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

10 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

11 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

12 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

13 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

14 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

15 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 4 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

16 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

17 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

18 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Refs: (Levtin-Polyak 66; Polyak 87; Beck-Teboulle 04) for faster rates. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

19 The revival of CG: sparsity! The revived popularity of conditional gradient is due to (Clarkson 10; Shalev-Shwartz-Srebro-Zhang 10), both focusing on f (x). x: x y t arg y; f (x t ), card(y t ) = 1; y x t+1 (1 η)x t + ηy t, card(x t+1 ) card(x t ) + 1. Explicit control of the sparsity. 1/ɛ vs. 1/ ɛ. Sparsity, more generally structure, is the key to the success of ML. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 8 / 25

20 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 9 / 25

21 Generalized conditional gradient Consider f : smooth convex; x κ: gauge (not necessarily smooth). Important distinction: composite, with a non-smooth term; f (x) + λ κ(x), unconstrained, hence unbounded domain. 1 Polar operator: y t arg x, f (x t ) ; x:κ(x) 1 2 line search: s t arg f ((1 η)x t + ηsy t ) + ληs; s 0 3 x t+1 = (1 η)x t + ηs t y t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 10 / 25

22 Convergence Rate x f (x) + λ κ(x) Theorem (Zhang-Y-Schuurmans 12) If f and κ have bounded level sets and f C 1, then GCG converges at rate O(1/t), where the constant is independent of λ. Proof is simple: Line search is as good as knowing κ(x ); Note that we upper bound κ((1 η)x t + ηsy t ) (1 η)κ(x t ) + ηs; Still too slow! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 11 / 25

23 Local improvement Assume some procedure (say BFGS) that can locally imize the nonsmooth problem x f (x) + λ κ(x), or some variation of it. Combine this local procedure with some globally convergent routine? Two conditions: The local procedure cannot incur big overhead; Cannot ruin the globally convergent routine. Both are met by the GCG. Refs: (Burer-Monteiro 05; Mishra et al 11; Laue 12) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 12 / 25

24 Case study: Matrix completion with trace norm Consider X 1 2 (X ij Z ij ) 2 + λ X tr. ij O tr is the convex hull of rank on the unit ball {X : X sp 1}. The only nontrivial step in GCG: Polar operator: Y t arg Y, G t, amounts to the doating Y tr 1 singular vectors of G t. In contrast, popular gradient methods need the full SVD of G t. 1 Variation: 2 ((UV ) ij Z ij ) 2 + λ ( U 2 F + V 2 F ). U,V ij O Not jointly convex in U and V ; But smooth in U and V ; Y t in GCG is rank-1 hence X t = UV is of rank at most t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 13 / 25

25 Case study: Experiment Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 14 / 25

26 Interpretation Dictionary learning problem: L(X, DΦ). D R m r,φ Rr n Many applications: NMF, sparse coding, topic model... Not jointly convex, in fact NP-hard for fixed r; Convexify by not constraining the rank explicitly: relax r! Refs: (Bengio et al 05; Bach-Mairal-Ponce 08; Zhang-Y-White-Sch 10) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 15 / 25

27 Convexification Let D :i have unit norm (say l 2 ); L(X, DΦ) + λ Ω(Φ). D,Φ Put row-wise norm on Φ: implicitly constraining the rank; Rewrite ˆX := DΦ = i Φ i: D :i Φ i: Φ i: ; Reformulate L(X, ˆX ) + λ κ( ˆX ) where ˆX κ(x ) = inf{ i σ i : X = i σ Φ i D i: :i Φ i: }; Can apply GCG now, PO: d,φ d φ G t φ. Setting both norms to l 2, we recover the matrix completion example. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 16 / 25

28 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 17 / 25

29 Computing the Polar The complexity of GCG is packed into the PO: { } g, x = κ ( g). x:κ(x) 1 Recall that in the dictionary learning problem: { d,φ d G φ } { } = max G d φ d Can easily become computationally intractable! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 18 / 25

30 Multi-view Learning Partition d = [ ] b and constrain their norms respectively. w Harder than single-view, but still doable (White-Y-Zhang-Sch 12): [ ] ( [ ] [ b w ] GG b = tr GG b [b w w w ]) max b =1, w =1 2(2+1) 2 > 2. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 19 / 25

31 Reducing PO to Proximal Consider the group regularizer: Its polar Ψ(w) = g w g. { Ψ (u) = inf max z g g g : } g zg = u does not seem to be easy to compute. Theorem For any gauge Ω, its polar Ω (y) equals the smallest ζ 0 s.t. { } y Ω (x) ζ x 2 2 = y prox ζω (y) = 0, where prox f (y) = x 1 2 x y f (x) and Prox f (y) denotes the (unique) imizer. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 20 / 25

32 Proximal Gradient Consider f (x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t 2 2. x C More generally f (x) + g(x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t g(x) x C = arg g(x) + L 2 x (x t 1 L f (x t)) 2 2 x C Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 21 / 25

33 Decomposing the Proximal How to compute the proximal operator for Ψ(w) = g w g? Theorem (NEW?) Prox Ω+Φ = Prox Φ Prox Ω for all gauges Ω iff Φ = c 2 for some c 0. Corollary (Jenatton et al 11) Let G be a collection of tree-structured groups, that is, either g g or g g or g g =. Then Prox i g i = Prox g1 Prox gm, where we arrange the groups so that g i g j = i > j. Prox 2Ω = Prox Ω Prox Ω? More generally Prox Ω+Φ = f (Prox Ω, Prox Φ )? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 22 / 25

34 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 23 / 25

35 Conclusions We have introduced the GCG; discussed efficient computations of PO; applied to matrix completion, group Lasso, etc. Further questions when the PO is hard? nonsmooth loss? online? stochastic? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 24 / 25

36 Thank you! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 25 / 25

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

CSC 576: Variants of Sparse Learning

CSC 576: Variants of Sparse Learning CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Accelerated Proximal Gradient Methods for Convex Optimization

Accelerated Proximal Gradient Methods for Convex Optimization Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage

Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL),

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

OWL to the rescue of LASSO

OWL to the rescue of LASSO OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,

More information

A Simple Algorithm for Nuclear Norm Regularized Problems

A Simple Algorithm for Nuclear Norm Regularized Problems A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Collaborative Filtering

Collaborative Filtering Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin

More information

Lecture 23: November 19

Lecture 23: November 19 10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo

More information

Collaborative Filtering Matrix Completion Alternating Least Squares

Collaborative Filtering Matrix Completion Alternating Least Squares Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016

More information

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Is the test error unbiased for these programs? 2017 Kevin Jamieson Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

Linear Inverse Problems

Linear Inverse Problems Linear Inverse Problems Ajinkya Kadu Utrecht University, The Netherlands February 26, 2018 Outline Introduction Least-squares Reconstruction Methods Examples Summary Introduction 2 What are inverse problems?

More information

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint

Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Optimization for Learning and Big Data

Optimization for Learning and Big Data Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Miroslav Dudík, Zaid Harchaoui, Jérôme Malick

Miroslav Dudík, Zaid Harchaoui, Jérôme Malick Supplementary Material Appendix A Proofs Proposition 3.1. The map θ W θ is a continuous linear map from Θ to R d k. Moreover, for all θ Θ +, W θ σ,1 θ i = θ 1 and for any W R d k, the vector of its singular

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Lecture Notes 10: Matrix Factorization

Lecture Notes 10: Matrix Factorization Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Lecture 1: September 25

Lecture 1: September 25 0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have

More information

Forward Basis Selection for Sparse Approximation over Dictionary

Forward Basis Selection for Sparse Approximation over Dictionary Forward Basis Selection for Sparse Approximation over Dictionary Xiao-Tong Yuan Department of Statistics Rutgers University xyuan@stat.rutgers.edu Shuicheng Yan ECE Department National University of Singapore

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization

Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

Oslo Class 7 Structured sparsity

Oslo Class 7 Structured sparsity RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Exploiting structure Building blocks of a function can be more structured than single variables Sparsity Variables

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

Adaptive one-bit matrix completion

Adaptive one-bit matrix completion Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines

More information

Parallel Coordinate Optimization

Parallel Coordinate Optimization 1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Some tensor decomposition methods for machine learning

Some tensor decomposition methods for machine learning Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition

More information

Block Coordinate Descent for Regularized Multi-convex Optimization

Block Coordinate Descent for Regularized Multi-convex Optimization Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline

More information

Fast column generation for atomic norm regularization

Fast column generation for atomic norm regularization Marina Vinyes Université Paris-Est, LIGM (UMR8049) Ecole des Ponts Marne-la-Vallée, France marina.vinyes@imagine.enpc.fr Guillaume Obozinski Université Paris-Est, LIGM (UMR8049) Ecole des Ponts Marne-la-Vallée,

More information

Structured matrix factorizations. Example: Eigenfaces

Structured matrix factorizations. Example: Eigenfaces Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix

More information

Maximum Margin Matrix Factorization

Maximum Margin Matrix Factorization Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization

This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu

CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive: Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful

More information

Provable Alternating Minimization Methods for Non-convex Optimization

Provable Alternating Minimization Methods for Non-convex Optimization Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish

More information

10-725/36-725: Convex Optimization Prerequisite Topics

10-725/36-725: Convex Optimization Prerequisite Topics 10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the

More information

EUSIPCO

EUSIPCO EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information

Online Dictionary Learning with Group Structure Inducing Norms

Online Dictionary Learning with Group Structure Inducing Norms Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh,

More information

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric

More information

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices

A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22

More information

Overview. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Overview. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Machine Learning for Signal Processing Sparse and Overcomplete Representations

Machine Learning for Signal Processing Sparse and Overcomplete Representations Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA

More information

Optimization in Machine Learning

Optimization in Machine Learning Optimization in Machine Learning Stephen Wright University of Wisconsin-Madison Singapore, 14 Dec 2012 Stephen Wright (UW-Madison) Optimization Singapore, 14 Dec 2012 1 / 48 1 Learning from Data: Applications

More information

2.3. Clustering or vector quantization 57

2.3. Clustering or vector quantization 57 Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :

More information

Non-Smooth, Non-Finite, and Non-Convex Optimization

Non-Smooth, Non-Finite, and Non-Convex Optimization Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional

More information

Generalized Power Method for Sparse Principal Component Analysis

Generalized Power Method for Sparse Principal Component Analysis Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018

Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018 Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

Stochastic Methods for l 1 Regularized Loss Minimization

Stochastic Methods for l 1 Regularized Loss Minimization Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method

ORIE 4741: Learning with Big Messy Data. Proximal Gradient Method ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:

More information

Matrix Factorizations: A Tale of Two Norms

Matrix Factorizations: A Tale of Two Norms Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm

More information