Generalized Conditional Gradient and Its Applications
|
|
- Paula Snow
- 5 years ago
- Views:
Transcription
1 Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25
2 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 2 / 25
3 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 3 / 25
4 Regularized Loss Minimization Generic form for many ML problems: f is the loss function; h is the regularizer; f (w) + λ h(w), where w Assug f and h to be convex/smooth Interior point method; Mirror descent / Proximal gradient; Averaging gradient; Conditional gradient. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 4 / 25
5 Machine Learning Examples Example (Matrix Completion) Netflix problem; 1 X R m n 2 Covariance matrix estimation; etc. Example (Group Lasso) Statistical estimation; Inverse problem; Denoising; etc. ij O (X ij Z ij ) 2 + λ X tr 1 w R d 2 Aw b λ g G w g Interesting case: m, n or d are extremely large. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 5 / 25
6 Conditional gradient (Frank-Wolfe 56) Consider C: compact convex; f : smooth convex. x C f (x), 1 y t arg x, f (x t ) ; x C 2 x t+1 = (1 η)x t + ηy t. (Frank-Wolfe 56; Canon-Cullum 68) proved that CG converges at Θ(1/t). Gained much recent attention due to its simplicity; the greedy nature in step 1. Refs: (Zhang 03; Clarkson 10; Hazan 08; Jaggi-Sulovsky 10; Bach 12; etc.) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 6 / 25
7 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
8 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
9 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
10 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
11 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
12 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
13 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
14 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
15 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 4 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
16 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
17 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
18 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Refs: (Levtin-Polyak 66; Polyak 87; Beck-Teboulle 04) for faster rates. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25
19 The revival of CG: sparsity! The revived popularity of conditional gradient is due to (Clarkson 10; Shalev-Shwartz-Srebro-Zhang 10), both focusing on f (x). x: x y t arg y; f (x t ), card(y t ) = 1; y x t+1 (1 η)x t + ηy t, card(x t+1 ) card(x t ) + 1. Explicit control of the sparsity. 1/ɛ vs. 1/ ɛ. Sparsity, more generally structure, is the key to the success of ML. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 8 / 25
20 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 9 / 25
21 Generalized conditional gradient Consider f : smooth convex; x κ: gauge (not necessarily smooth). Important distinction: composite, with a non-smooth term; f (x) + λ κ(x), unconstrained, hence unbounded domain. 1 Polar operator: y t arg x, f (x t ) ; x:κ(x) 1 2 line search: s t arg f ((1 η)x t + ηsy t ) + ληs; s 0 3 x t+1 = (1 η)x t + ηs t y t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 10 / 25
22 Convergence Rate x f (x) + λ κ(x) Theorem (Zhang-Y-Schuurmans 12) If f and κ have bounded level sets and f C 1, then GCG converges at rate O(1/t), where the constant is independent of λ. Proof is simple: Line search is as good as knowing κ(x ); Note that we upper bound κ((1 η)x t + ηsy t ) (1 η)κ(x t ) + ηs; Still too slow! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 11 / 25
23 Local improvement Assume some procedure (say BFGS) that can locally imize the nonsmooth problem x f (x) + λ κ(x), or some variation of it. Combine this local procedure with some globally convergent routine? Two conditions: The local procedure cannot incur big overhead; Cannot ruin the globally convergent routine. Both are met by the GCG. Refs: (Burer-Monteiro 05; Mishra et al 11; Laue 12) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 12 / 25
24 Case study: Matrix completion with trace norm Consider X 1 2 (X ij Z ij ) 2 + λ X tr. ij O tr is the convex hull of rank on the unit ball {X : X sp 1}. The only nontrivial step in GCG: Polar operator: Y t arg Y, G t, amounts to the doating Y tr 1 singular vectors of G t. In contrast, popular gradient methods need the full SVD of G t. 1 Variation: 2 ((UV ) ij Z ij ) 2 + λ ( U 2 F + V 2 F ). U,V ij O Not jointly convex in U and V ; But smooth in U and V ; Y t in GCG is rank-1 hence X t = UV is of rank at most t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 13 / 25
25 Case study: Experiment Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 14 / 25
26 Interpretation Dictionary learning problem: L(X, DΦ). D R m r,φ Rr n Many applications: NMF, sparse coding, topic model... Not jointly convex, in fact NP-hard for fixed r; Convexify by not constraining the rank explicitly: relax r! Refs: (Bengio et al 05; Bach-Mairal-Ponce 08; Zhang-Y-White-Sch 10) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 15 / 25
27 Convexification Let D :i have unit norm (say l 2 ); L(X, DΦ) + λ Ω(Φ). D,Φ Put row-wise norm on Φ: implicitly constraining the rank; Rewrite ˆX := DΦ = i Φ i: D :i Φ i: Φ i: ; Reformulate L(X, ˆX ) + λ κ( ˆX ) where ˆX κ(x ) = inf{ i σ i : X = i σ Φ i D i: :i Φ i: }; Can apply GCG now, PO: d,φ d φ G t φ. Setting both norms to l 2, we recover the matrix completion example. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 16 / 25
28 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 17 / 25
29 Computing the Polar The complexity of GCG is packed into the PO: { } g, x = κ ( g). x:κ(x) 1 Recall that in the dictionary learning problem: { d,φ d G φ } { } = max G d φ d Can easily become computationally intractable! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 18 / 25
30 Multi-view Learning Partition d = [ ] b and constrain their norms respectively. w Harder than single-view, but still doable (White-Y-Zhang-Sch 12): [ ] ( [ ] [ b w ] GG b = tr GG b [b w w w ]) max b =1, w =1 2(2+1) 2 > 2. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 19 / 25
31 Reducing PO to Proximal Consider the group regularizer: Its polar Ψ(w) = g w g. { Ψ (u) = inf max z g g g : } g zg = u does not seem to be easy to compute. Theorem For any gauge Ω, its polar Ω (y) equals the smallest ζ 0 s.t. { } y Ω (x) ζ x 2 2 = y prox ζω (y) = 0, where prox f (y) = x 1 2 x y f (x) and Prox f (y) denotes the (unique) imizer. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 20 / 25
32 Proximal Gradient Consider f (x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t 2 2. x C More generally f (x) + g(x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t g(x) x C = arg g(x) + L 2 x (x t 1 L f (x t)) 2 2 x C Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 21 / 25
33 Decomposing the Proximal How to compute the proximal operator for Ψ(w) = g w g? Theorem (NEW?) Prox Ω+Φ = Prox Φ Prox Ω for all gauges Ω iff Φ = c 2 for some c 0. Corollary (Jenatton et al 11) Let G be a collection of tree-structured groups, that is, either g g or g g or g g =. Then Prox i g i = Prox g1 Prox gm, where we arrange the groups so that g i g j = i > j. Prox 2Ω = Prox Ω Prox Ω? More generally Prox Ω+Φ = f (Prox Ω, Prox Φ )? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 22 / 25
34 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 23 / 25
35 Conclusions We have introduced the GCG; discussed efficient computations of PO; applied to matrix completion, group Lasso, etc. Further questions when the PO is hard? nonsmooth loss? online? stochastic? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 24 / 25
36 Thank you! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 25 / 25
1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationCSC 576: Variants of Sparse Learning
CSC 576: Variants of Sparse Learning Ji Liu Department of Computer Science, University of Rochester October 27, 205 Introduction Our previous note basically suggests using l norm to enforce sparsity in
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationLecture 9: September 28
0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationAccelerated Proximal Gradient Methods for Convex Optimization
Accelerated Proximal Gradient Methods for Convex Optimization Paul Tseng Mathematics, University of Washington Seattle MOPTA, University of Guelph August 18, 2008 ACCELERATED PROXIMAL GRADIENT METHODS
More informationOptimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison
Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationSketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage
Sketchy Decisions: Convex Low-Rank Matrix Optimization with Optimal Storage Madeleine Udell Operations Research and Information Engineering Cornell University Based on joint work with Alp Yurtsever (EPFL),
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationOWL to the rescue of LASSO
OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science,
More informationA Simple Algorithm for Nuclear Norm Regularized Problems
A Simple Algorithm for Nuclear Norm Regularized Problems ICML 00 Martin Jaggi, Marek Sulovský ETH Zurich Matrix Factorizations for recommender systems Y = Customer Movie UV T = u () The Netflix challenge:
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationCollaborative Filtering
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin
More informationLecture 23: November 19
10-725/36-725: Conve Optimization Fall 2018 Lecturer: Ryan Tibshirani Lecture 23: November 19 Scribes: Charvi Rastogi, George Stoica, Shuo Li Charvi Rastogi: 23.1-23.4.2, George Stoica: 23.4.3-23.8, Shuo
More informationCollaborative Filtering Matrix Completion Alternating Least Squares
Case Study 4: Collaborative Filtering Collaborative Filtering Matrix Completion Alternating Least Squares Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 19, 2016
More informationIs the test error unbiased for these programs? 2017 Kevin Jamieson
Is the test error unbiased for these programs? 2017 Kevin Jamieson 1 Is the test error unbiased for this program? 2017 Kevin Jamieson 2 Simple Variable Selection LASSO: Sparse Regression Machine Learning
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationLinear Inverse Problems
Linear Inverse Problems Ajinkya Kadu Utrecht University, The Netherlands February 26, 2018 Outline Introduction Least-squares Reconstruction Methods Examples Summary Introduction 2 What are inverse problems?
More informationConditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint
Conditional Gradient Algorithms for Rank-One Matrix Approximations with a Sparsity Constraint Marc Teboulle School of Mathematical Sciences Tel Aviv University Joint work with Ronny Luss Optimization and
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationarxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationOptimization for Learning and Big Data
Optimization for Learning and Big Data Donald Goldfarb Department of IEOR Columbia University Department of Mathematics Distinguished Lecture Series May 17-19, 2016. Lecture 1. First-Order Methods for
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationMiroslav Dudík, Zaid Harchaoui, Jérôme Malick
Supplementary Material Appendix A Proofs Proposition 3.1. The map θ W θ is a continuous linear map from Θ to R d k. Moreover, for all θ Θ +, W θ σ,1 θ i = θ 1 and for any W R d k, the vector of its singular
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationLecture Notes 10: Matrix Factorization
Optimization-based data analysis Fall 207 Lecture Notes 0: Matrix Factorization Low-rank models. Rank- model Consider the problem of modeling a quantity y[i, j] that depends on two indices i and j. To
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationLecture 1: September 25
0-725: Optimization Fall 202 Lecture : September 25 Lecturer: Geoff Gordon/Ryan Tibshirani Scribes: Subhodeep Moitra Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes have
More informationForward Basis Selection for Sparse Approximation over Dictionary
Forward Basis Selection for Sparse Approximation over Dictionary Xiao-Tong Yuan Department of Statistics Rutgers University xyuan@stat.rutgers.edu Shuicheng Yan ECE Department National University of Singapore
More informationTopographic Dictionary Learning with Structured Sparsity
Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity
More informationRanking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization
Ranking from Crowdsourced Pairwise Comparisons via Matrix Manifold Optimization Jialin Dong ShanghaiTech University 1 Outline Introduction FourVignettes: System Model and Problem Formulation Problem Analysis
More informationParcimonie en apprentissage statistique
Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised
More informationOslo Class 7 Structured sparsity
RegML2017@SIMULA Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Exploiting structure Building blocks of a function can be more structured than single variables Sparsity Variables
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationAgenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples
Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method
More informationConvex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More informationStochastic Optimization: First order method
Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationAdaptive one-bit matrix completion
Adaptive one-bit matrix completion Joseph Salmon Télécom Paristech, Institut Mines-Télécom Joint work with Jean Lafond (Télécom Paristech) Olga Klopp (Crest / MODAL X, Université Paris Ouest) Éric Moulines
More informationParallel Coordinate Optimization
1 / 38 Parallel Coordinate Optimization Julie Nutini MLRG - Spring Term March 6 th, 2018 2 / 38 Contours of a function F : IR 2 IR. Goal: Find the minimizer of F. Coordinate Descent in 2D Contours of a
More informationSparse and Regularized Optimization
Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationSome tensor decomposition methods for machine learning
Some tensor decomposition methods for machine learning Massimiliano Pontil Istituto Italiano di Tecnologia and University College London 16 August 2016 1 / 36 Outline Problem and motivation Tucker decomposition
More informationBlock Coordinate Descent for Regularized Multi-convex Optimization
Block Coordinate Descent for Regularized Multi-convex Optimization Yangyang Xu and Wotao Yin CAAM Department, Rice University February 15, 2013 Multi-convex optimization Model definition Applications Outline
More informationFast column generation for atomic norm regularization
Marina Vinyes Université Paris-Est, LIGM (UMR8049) Ecole des Ponts Marne-la-Vallée, France marina.vinyes@imagine.enpc.fr Guillaume Obozinski Université Paris-Est, LIGM (UMR8049) Ecole des Ponts Marne-la-Vallée,
More informationStructured matrix factorizations. Example: Eigenfaces
Structured matrix factorizations Example: Eigenfaces An extremely large variety of interesting and important problems in machine learning can be formulated as: Given a matrix, find a matrix and a matrix
More informationMaximum Margin Matrix Factorization
Maximum Margin Matrix Factorization Nati Srebro Toyota Technological Institute Chicago Joint work with Noga Alon Tel-Aviv Yonatan Amit Hebrew U Alex d Aspremont Princeton Michael Fink Hebrew U Tommi Jaakkola
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationThis can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization
This can be 2 lectures! still need: Examples: non-convex problems applications for matrix factorization x = prox_f(x)+prox_{f^*}(x) use to get prox of norms! PROXIMAL METHODS WHY PROXIMAL METHODS Smooth
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationCS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu
CS598 Machine Learning in Computational Biology (Lecture 5: Matrix - part 2) Professor Jian Peng Teaching Assistant: Rongda Zhu Feature engineering is hard 1. Extract informative features from domain knowledge
More informationAdaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ
More informationMachine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:
Simple Variable Selection LASSO: Sparse Regression Machine Learning CSE546 Carlos Guestrin University of Washington October 7, 2013 1 Sparsity Vector w is sparse, if many entries are zero: Very useful
More informationProvable Alternating Minimization Methods for Non-convex Optimization
Provable Alternating Minimization Methods for Non-convex Optimization Prateek Jain Microsoft Research, India Joint work with Praneeth Netrapalli, Sujay Sanghavi, Alekh Agarwal, Animashree Anandkumar, Rashish
More information10-725/36-725: Convex Optimization Prerequisite Topics
10-725/36-725: Convex Optimization Prerequisite Topics February 3, 2015 This is meant to be a brief, informal refresher of some topics that will form building blocks in this course. The content of the
More informationEUSIPCO
EUSIPCO 013 1569746769 SUBSET PURSUIT FOR ANALYSIS DICTIONARY LEARNING Ye Zhang 1,, Haolong Wang 1, Tenglong Yu 1, Wenwu Wang 1 Department of Electronic and Information Engineering, Nanchang University,
More informationStructured Sparse Estimation with Network Flow Optimization
Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce
More informationOnline Dictionary Learning with Group Structure Inducing Norms
Online Dictionary Learning with Group Structure Inducing Norms Zoltán Szabó 1, Barnabás Póczos 2, András Lőrincz 1 1 Eötvös Loránd University, Budapest, Hungary 2 Carnegie Mellon University, Pittsburgh,
More informationComputational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center
Computational and Statistical Aspects of Statistical Machine Learning John Lafferty Department of Statistics Retreat Gleacher Center Outline Modern nonparametric inference for high dimensional data Nonparametric
More informationA Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices
A Fast Augmented Lagrangian Algorithm for Learning Low-Rank Matrices Ryota Tomioka 1, Taiji Suzuki 1, Masashi Sugiyama 2, Hisashi Kashima 1 1 The University of Tokyo 2 Tokyo Institute of Technology 2010-06-22
More informationOverview. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Overview Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 1/25/2016 Sparsity Denoising Regression Inverse problems Low-rank models Matrix completion
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationMachine Learning for Signal Processing Sparse and Overcomplete Representations
Machine Learning for Signal Processing Sparse and Overcomplete Representations Abelino Jimenez (slides from Bhiksha Raj and Sourish Chaudhuri) Oct 1, 217 1 So far Weights Data Basis Data Independent ICA
More informationOptimization in Machine Learning
Optimization in Machine Learning Stephen Wright University of Wisconsin-Madison Singapore, 14 Dec 2012 Stephen Wright (UW-Madison) Optimization Singapore, 14 Dec 2012 1 / 48 1 Learning from Data: Applications
More information2.3. Clustering or vector quantization 57
Multivariate Statistics non-negative matrix factorisation and sparse dictionary learning The PCA decomposition is by construction optimal solution to argmin A R n q,h R q p X AH 2 2 under constraint :
More informationNon-Smooth, Non-Finite, and Non-Convex Optimization
Non-Smooth, Non-Finite, and Non-Convex Optimization Deep Learning Summer School Mark Schmidt University of British Columbia August 2015 Complex-Step Derivative Using complex number to compute directional
More informationGeneralized Power Method for Sparse Principal Component Analysis
Generalized Power Method for Sparse Principal Component Analysis Peter Richtárik CORE/INMA Catholic University of Louvain Belgium VOCAL 2008, Veszprém, Hungary CORE Discussion Paper #2008/70 joint work
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationSelected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 2018
Selected Methods for Modern Optimization in Data Analysis Department of Statistics and Operations Research UNC-Chapel Hill Fall 08 Instructor: Quoc Tran-Dinh Scriber: Quoc Tran-Dinh Lecture 4: Selected
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationAn Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion
An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics
More informationStochastic Methods for l 1 Regularized Loss Minimization
Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute at Chicago, 6045 S Kenwood Ave, Chicago, IL 60637, USA SHAI@TTI-CORG TEWARI@TTI-CORG Abstract We describe and analyze two stochastic methods
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationORIE 4741: Learning with Big Messy Data. Proximal Gradient Method
ORIE 4741: Learning with Big Messy Data Proximal Gradient Method Professor Udell Operations Research and Information Engineering Cornell November 13, 2017 1 / 31 Announcements Be a TA for CS/ORIE 1380:
More informationMatrix Factorizations: A Tale of Two Norms
Matrix Factorizations: A Tale of Two Norms Nati Srebro Toyota Technological Institute Chicago Maximum Margin Matrix Factorization S, Jason Rennie, Tommi Jaakkola (MIT), NIPS 2004 Rank, Trace-Norm and Max-Norm
More information