Generalized Conditional Gradient and Its Applications

Size: px

Start display at page:

Download "Generalized Conditional Gradient and Its Applications"

Paula Snow
5 years ago
Views:

1 Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25

2 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 2 / 25

3 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 3 / 25

4 Regularized Loss Minimization Generic form for many ML problems: f is the loss function; h is the regularizer; f (w) + λ h(w), where w Assug f and h to be convex/smooth Interior point method; Mirror descent / Proximal gradient; Averaging gradient; Conditional gradient. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 4 / 25

5 Machine Learning Examples Example (Matrix Completion) Netflix problem; 1 X R m n 2 Covariance matrix estimation; etc. Example (Group Lasso) Statistical estimation; Inverse problem; Denoising; etc. ij O (X ij Z ij ) 2 + λ X tr 1 w R d 2 Aw b λ g G w g Interesting case: m, n or d are extremely large. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 5 / 25

6 Conditional gradient (Frank-Wolfe 56) Consider C: compact convex; f : smooth convex. x C f (x), 1 y t arg x, f (x t ) ; x C 2 x t+1 = (1 η)x t + ηy t. (Frank-Wolfe 56; Canon-Cullum 68) proved that CG converges at Θ(1/t). Gained much recent attention due to its simplicity; the greedy nature in step 1. Refs: (Zhang 03; Clarkson 10; Hazan 08; Jaggi-Sulovsky 10; Bach 12; etc.) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 6 / 25

7 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

8 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

9 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

10 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

11 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

12 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

13 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

14 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

15 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b 0 x 1 = (1, 1) x 2 x 4 x 3 y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

16 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

17 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

18 An Example a,b a2 + (b + 1) 2, s.t. a 1, 2 b conditional gradient 1/(k+2) x 1 = (1, 1) x 2 x 4 x y 1 = ( 1, 0) y 2 = (1, 0) Can show f (x k ) f (x ) = 4/k + o(1/k). Projected gradient converges in two iterations. Refs: (Levtin-Polyak 66; Polyak 87; Beck-Teboulle 04) for faster rates. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 7 / 25

The revival of CG: sparsity! The revived popularity of conditional gradient is due to (Clarkson 10; Shalev-Shwartz-Srebro-Zhang 10), both focusing on f (x).

19 The revival of CG: sparsity! The revived popularity of conditional gradient is due to (Clarkson 10; Shalev-Shwartz-Srebro-Zhang 10), both focusing on f (x). x: x y t arg y; f (x t ), card(y t ) = 1; y x t+1 (1 η)x t + ηy t, card(x t+1 ) card(x t ) + 1. Explicit control of the sparsity. 1/ɛ vs. 1/ ɛ. Sparsity, more generally structure, is the key to the success of ML. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 8 / 25

20 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 9 / 25

21 Generalized conditional gradient Consider f : smooth convex; x κ: gauge (not necessarily smooth). Important distinction: composite, with a non-smooth term; f (x) + λ κ(x), unconstrained, hence unbounded domain. 1 Polar operator: y t arg x, f (x t ) ; x:κ(x) 1 2 line search: s t arg f ((1 η)x t + ηsy t ) + ληs; s 0 3 x t+1 = (1 η)x t + ηs t y t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 10 / 25

22 Convergence Rate x f (x) + λ κ(x) Theorem (Zhang-Y-Schuurmans 12) If f and κ have bounded level sets and f C 1, then GCG converges at rate O(1/t), where the constant is independent of λ. Proof is simple: Line search is as good as knowing κ(x ); Note that we upper bound κ((1 η)x t + ηsy t ) (1 η)κ(x t ) + ηs; Still too slow! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 11 / 25

23 Local improvement Assume some procedure (say BFGS) that can locally imize the nonsmooth problem x f (x) + λ κ(x), or some variation of it. Combine this local procedure with some globally convergent routine? Two conditions: The local procedure cannot incur big overhead; Cannot ruin the globally convergent routine. Both are met by the GCG. Refs: (Burer-Monteiro 05; Mishra et al 11; Laue 12) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 12 / 25

24 Case study: Matrix completion with trace norm Consider X 1 2 (X ij Z ij ) 2 + λ X tr. ij O tr is the convex hull of rank on the unit ball {X : X sp 1}. The only nontrivial step in GCG: Polar operator: Y t arg Y, G t, amounts to the doating Y tr 1 singular vectors of G t. In contrast, popular gradient methods need the full SVD of G t. 1 Variation: 2 ((UV ) ij Z ij ) 2 + λ ( U 2 F + V 2 F ). U,V ij O Not jointly convex in U and V ; But smooth in U and V ; Y t in GCG is rank-1 hence X t = UV is of rank at most t. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 13 / 25

25 Case study: Experiment Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 14 / 25

26 Interpretation Dictionary learning problem: L(X, DΦ). D R m r,φ Rr n Many applications: NMF, sparse coding, topic model... Not jointly convex, in fact NP-hard for fixed r; Convexify by not constraining the rank explicitly: relax r! Refs: (Bengio et al 05; Bach-Mairal-Ponce 08; Zhang-Y-White-Sch 10) Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 15 / 25

27 Convexification Let D :i have unit norm (say l 2 ); L(X, DΦ) + λ Ω(Φ). D,Φ Put row-wise norm on Φ: implicitly constraining the rank; Rewrite ˆX := DΦ = i Φ i: D :i Φ i: Φ i: ; Reformulate L(X, ˆX ) + λ κ( ˆX ) where ˆX κ(x ) = inf{ i σ i : X = i σ Φ i D i: :i Φ i: }; Can apply GCG now, PO: d,φ d φ G t φ. Setting both norms to l 2, we recover the matrix completion example. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 16 / 25

28 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 17 / 25

29 Computing the Polar The complexity of GCG is packed into the PO: { } g, x = κ ( g). x:κ(x) 1 Recall that in the dictionary learning problem: { d,φ d G φ } { } = max G d φ d Can easily become computationally intractable! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 18 / 25

30 Multi-view Learning Partition d = [ ] b and constrain their norms respectively. w Harder than single-view, but still doable (White-Y-Zhang-Sch 12): [ ] ( [ ] [ b w ] GG b = tr GG b [b w w w ]) max b =1, w =1 2(2+1) 2 > 2. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 19 / 25

31 Reducing PO to Proximal Consider the group regularizer: Its polar Ψ(w) = g w g. { Ψ (u) = inf max z g g g : } g zg = u does not seem to be easy to compute. Theorem For any gauge Ω, its polar Ω (y) equals the smallest ζ 0 s.t. { } y Ω (x) ζ x 2 2 = y prox ζω (y) = 0, where prox f (y) = x 1 2 x y f (x) and Prox f (y) denotes the (unique) imizer. Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 20 / 25

32 Proximal Gradient Consider f (x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t 2 2. x C More generally f (x) + g(x), where f x C C1 L. x t+1 = arg f (x t ) + x x t, f (x t ) + L 2 x x t g(x) x C = arg g(x) + L 2 x (x t 1 L f (x t)) 2 2 x C Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 21 / 25

33 Decomposing the Proximal How to compute the proximal operator for Ψ(w) = g w g? Theorem (NEW?) Prox Ω+Φ = Prox Φ Prox Ω for all gauges Ω iff Φ = c 2 for some c 0. Corollary (Jenatton et al 11) Let G be a collection of tree-structured groups, that is, either g g or g g or g g =. Then Prox i g i = Prox g1 Prox gm, where we arrange the groups so that g i g j = i > j. Prox 2Ω = Prox Ω Prox Ω? More generally Prox Ω+Φ = f (Prox Ω, Prox Φ )? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 22 / 25

34 Table of Contents 1 Introduction 2 Generalized Conditional Gradient 3 Polar Operator 4 Conclusions Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 23 / 25

35 Conclusions We have introduced the GCG; discussed efficient computations of PO; applied to matrix completion, group Lasso, etc. Further questions when the PO is hard? nonsmooth loss? online? stochastic? Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 24 / 25

36 Thank you! Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 25 / 25

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the