Smoothing Proximal Gradient Method. General Structured Sparse Regression

Size: px

Start display at page:

Download "Smoothing Proximal Gradient Method. General Structured Sparse Regression"

Shanon Woods
5 years ago
Views:

1 for General Structured Sparse Regression Xi Chen, Qihang Lin, Seyoung Kim, Jaime G. Carbonell, Eric P. Xing (Annals of Applied Statistics, 2012) Gatsby Unit, Tea Talk October 25, 2013

2 Outline Motivation: (structured) sparse coding. Proximal operators, FISTA. Solution: dual norm + smooth approximation.

3 Motivation: least squares, sparse coding Given: x R dx, D R dx dα. Least squares problem: J(α) = 1 2 x Dα 2 2 min α R dα. (1) Sparse coding (JPEG; convex relaxation, Lasso, w > 0): J(α) = 1 2 x Dα w α 1 min α R dα. (2)

4 Motivation: structured sparse coding Group Lasso (G partition = non-overlapping, blocks): J(α) = 1 2 x Dα w G G α G 2 min α R dα. (3) Overlapping G: hierarchy, grid, total variation, graphs. many successful application: gene analysis, face expression recognition,...

5 Non-overlapping group Lasso FISTA objective: J(α) = f(α)+g(α) min α R dα. (4) Assumptions: f, g: convex, f is smooth (Lipschitz continuous gradient, L). Fast convergence: J(α t ) J(α ) = O ( ) 1 t 2. (5)

6 FISTA Ingredients: Gradient of the smooth term: f. Lipschitz constant of f : L. Proximal operator of the non-smooth term (p > 0): prox pg (v) = arg min y [ g(y)+ 1 2p y v 2 2 ]. (6) Example: f(α) = 1 2 x Dα 2 2, g(α) = w G G α G 2, ) f(α) = D T (Dα x), L = λ max (D T D, (7) prox g : analytical (for partition G). (8)

7 Goal Objective (λ > 0; w G > 0, G G): J(α) = f(α)+ω(α)+λ α 1 min α R dα, (9) Ω(α) = G G w G α G 2. (10) Assumption: f : convex (FISTA assumptions). G: non-overlapping no analytical formula for prox pg.

8 Solution The l 2 -norm is self-dual: a 2 = max b: b 2 1 bt a. (11)

9 Solution The l 2 -norm is self-dual: a 2 = max b: b 2 1 bt a. (11) We rewrite Ω (α G β G R G : auxiliary variable): β = [ (β G ) G G ] R G G G, (12) Ω(α) = w G α G 2 = w G max β G : β G G G G G 2 1 βt G α G (13) = max w G βg T α G =: max β Q β Q βt Cα, (14) G G Q = {β : β G 2 1, G G}(product of unit balls).

10 Solution - continued Smooth approximation to Ω(α) (µ 0): ( ) Ω(α) = max β Q βt Cα max β T Cα µs(β) β Q =: Ω µ (α), s(β) = 1 2 β (15) Maximum gap is µm: G M = max s(β) = β Q 2, (16) Ω(α) µm Ω µ (α) Ω(α). (17)

11 Solution: FISTA on the smooth approximation Original objective (λ > 0): J(α) = f(α)+ω(α)+λ α 1 min α R dα. (18) Smooth approximation (µ > 0, λ > 0): J µ (α) = f(α)+ω µ (α) }{{} +λ α 1 }{{} min α R dα. (19) FISTA: f g

12 Result (=FISTA can be applied) Ω µ (α): convex with Lipschitz continuous gradient Ω µ (α) = C T β, (20) ( ) β = arg max β T Cα µs(β) (21) = β Q [ ( Π 2 ( wg α G µ Lipschitz constant: L µ = 1 µ C 2 2. )) G G ]. (22)

13 Proof (intuition) Convexity, smoothness of Ω µ : ( Ω µ (α) = max β Q = µd ( Cα µ ) β T Cα µs(β) = µ max (β T Cαµ ) s(β) β Q ). (23) Gradient Ω µ : Danskin s theorem with h(α) = max ϕ(β,α), (24) β K:compact h(α) = α ϕ(β,α). (25) Lipschitz constant L µ : Nesterov 05.

14 Convergence rate: O ( ) 1 ǫ Given: ǫ (precision). We want Set µ = ǫ G 2M, where M = 2. Sufficient number of iterations: O ( ) 1 = ǫ J(α t ) J(α ) ǫ. (26) 4 α α ǫ [ λ max ( D T D ) + 2M C 2 2 ǫ Note (subgradient descent is much slower): O ( 1 ǫ 2 ). ].

15 Summary Task: non-overlapping group Lasso. Difficulty: non-overlapping non-separability. Proposed solution: 2 = 2. Smooth approximation. G independent subproblems, analytical expressions to FISTA. convergence rate: O ( 1 ǫ).

16 Thank you for the attention!

17 Analytical solution for β ( β = arg max β T Cα µ ) β Q 2 β 2 2 = arg max β Q = arg min β Q G G ( w G βg T α G µ ) 2 β G 2 2 β G w Gα G µ G G 2 2 (27) (28). (29) Thus ( ) (β wg α ) G = Π G 2. (30) µ

18 Combination of Lipschitz constants Let L f (L g ) be a Lipschitz constant of f ( g). Then L f+g L f + L g, since ( f + g)(x) ( f + g)(y) 2 (31) [ f(x)+ g(x)] [ f(y)+ g(y)] 2 (32) f(x) f(y) 2 + g(y) g(y) 2 (33) = L f x y 2 + L g x y 2 (34) (L f + L g ) x y 2. (35)

19 Rate of convergence for SPG J(α t ) J(α ) = [J(α t ) J µ (α t )]+[J µ (α t ) J µ (α )]+[J µ (α ) J(α )] (36) µm + 2L µ α 0 α 2 2 t (37) ( ) µm + 2 α 0 α 2 ) 2 t 2 λ max (D T D + C 2 2. (38) µ Plug-in µ = ǫ 2M, and solve for t: J(α t ) J(α ) ǫ α 0 α 2 2 t 2 ( ) λ max (D T D + 2M C 2 2 ǫ ) ǫ.

20 Proximal operator f : R d R { }: closed proper convex function, i.e., epi(f) = {(y, t) R d R : f(y) t} (39) is nonempty closed convex. Proximal operator of f : prox f (v) = arg min y [ f(y)+ 1 ] 2 y v 2 2. (40) Strictly convex r.h.s. of (40) prox f : exists, unique.

21 Proximal operator = generalization of projection C: closed convex set. f = I C : indicator function of C { 0 y C, I C (y) = y / C. Then, prox f = Euclidean projection onto C: (41) prox IC (v) = Π C (v) = arg min v y 2. (42) y

22 Conjugate function f : R d R, not necessarily convex. Conjugate of f : [ ] f (v) = sup y v T y f(y). (43) Notes: f : convex pointwise sup of convex functions. if f is convex, closed: (f ) = f. if f is differentiable: f = Legendre transform of f.

23 Conjugate function: properties If f = indicator function of a unit ball, i.e., f = I C, C = B = {y R d : y 1}, (44) then f is the dual norm f (v) = v = max v T y. (45) y R d : y 1 Dual norm of p (p 1) is p with 1 p + 1 p = 1. Similarly (G: partition): u = G G u G p, u = max G G u G p. (46)

Optimization methods

Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to