OWL to the rescue of LASSO

Size: px
Start display at page:

Download "OWL to the rescue of LASSO"

Transcription

1 OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science, Bangalore March 7, 2018

2 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L 1 -norms

3 LASSO: The method of choice for feature selection Ordered Weighted L1(OWL) norm and Submodular penalties Ω S : Smoothed OWL (SOWL)

4 LASSO: The method of choice for feature selection

5 Linear Regression in High dimension Question? Let x R d, y R and y = w x + ɛ ɛ N(0, σ 2 ) From D = {(x i, y i ) i [n], x i R d, y i R} can we find w X R n d, y R n X = x 1 x 2. x n, Y Rn

6 Linear Regression in High dimension Least Squares Linear Regression X R n d, y R d : w LS = argmin w R d 1 2 Xw Y 2 2 Assumptions Labels centered : n j=1 y j = 0. Features normalized : x i R d, x i 2 = 1, x i 1 d = 0.

7 Linear Regression in High dimension Least Squares Linear Regression w LS = ( X X ) 1 X Y and E(w LS ) = w Variance of Predictive error: 1E( X (w n LS w ) 2 ) = σ 2 d n

8 Linear Regression in High dimension rank(x ) = d w LS is unique Poor predictive performance, d is close to n

9 Linear Regression in High dimension rank(x ) = d rank(x ) < d w LS is unique Poor predictive performance, d is close to n d > n w LS is not unique.

10 Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y 2 2 s.t. Ω(w) t Regularizer: Ω(w) non-negative Convex function, typically a norm. Possibly non-differentiable.

11 Lasso regression[tibshirani, 1994] Ridge: Ω(w) = w 2 2 Does not promote sparsity Closed form solution Lasso: Ω(w) = w 1 Encourages sparse solutions. Solve convex optimization problem

12 Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y λω(w) Equivalent to the constraint version Unconstrained

13 Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ).

14 Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ). Statistical properties[wainwright, 2009] Support recovery (Will it recover the true support?). Sample complexity (How many samples needed?). Prediction error (What is the expected error in prediction?).

15 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Setup Support of w Lasso y = Xw + ɛ, ɛ i N (0, σ 2 ), X R n d be S = {j w j 0 j [d]} min w R d 1 2 Xw y λ w 1

16 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n 1 Define M = max i j M ij

17 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n W.h.p, the following holds: ( 1 ŵ S ws (X λ S S/n) X + 4σ/ ) C min 1 Define M = max i j M ij

18 Lasso Model Recovery: Special cases Case: XS X c S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: XS X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error.

19 Lasso Model Recovery: Special cases Case: X S c X S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: X S X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error. When does Lasso work well? Lasso prefers low correlation between support and non-support columns. Low correlation of columns within support lead to better recovery.

20 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small.

21 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013].

22 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013]. Requirements. Need consistent estimates independent of the solver. Preferably select all the correlated variables as a group.

23 Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m.

24 Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m. Toy Example d = 40, k = 4. G 1 = [1 : 10], G 2 = [11 : 20], G 3 = [21 : 30], G 4 = [31 : 40]. Lasso: Ω(w) = w 1. Sparse recovery. Arbitrarily selects the variables within a group Figure: Original signal Figure: Recovered signal.

25 Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group.

26 Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group. Goal: Learn G and w simultaneously?

27 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. 2 Notation: w (i) : i th largest in w.

28 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.

29 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > Figure: Recovered: OWL OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.

30 Oscar: Sparsity Illustrated Figure: Examples of solutions:[bondell and Reich, 2008] Solutions encouraged towards vertices. Encourages blockwise constant solutions. See [Bondell and Reich, 2008].

31 OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij.

32 OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij. λ 0 ij 1 ρ 2 ij. Strongly correlated pairs grouped early in the regularization path. Groups: G j = {i w i = α j }.

33 OWL-Issues Figure: True model Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation.

34 OWL-Issues Figure: True model Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation. Lack of interpretations for choosing c

35 Ordered Weighted L1(OWL) norm and Submodular penalties

36 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0}

37 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ].

38 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound.

39 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w)

40 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w) Ω F p (w) = w 1. Familiar! Message: The cardinality function always relaxes to the l 1 norm.

41 Nondecreasing Submodular Penalties on Cardinality Assumptions. Denote F F, if F : A {1,..., d} R is: 1. Submodular [Bach, 2011]. A B, F (A {k}) F (A) F (B {k}) F (B). Lovász extension: f : R d R. (Convex extension of F to R d ). 2. Cardinality based. F (A) = g( A ) (Invariant to permutations). 3. Non Decreasing. g(0) = 0, g(x) g(x 1). Implication: F F F completely specified through g. Example: Let V = {1,..., d}, define F (A) = A V \ A. Then f (w) = i<j w i w j.

42 Ω F and Lovász extension Result: Case p = [Bach, 2010]: Ω F (w) = f ( w ) The l relaxation coincides with the Lovász extension in the positive orthant. To work with Ω F, may use existing results of submodular function minimization. Ω F p not known in closed form for p <.

43 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL.

44 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL.

45 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL. Motivates Ω F p (w) for p <.

46 Ω S : Smoothed OWL (SOWL)

47 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w).

48 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ).

49 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ). Use OWL equivalance: f ( η ) = Ω F (η) = d i=1 c i η (i), Ω S (w) = min η R d + ( w 2 i 1 d 2 η i i=1 }{{} Ψ(w,η) ) + c i η (i). (SOWL)

50 OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w.

51 OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Norm Balls Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w : OWL : SOWL Figure: Norm balls for OWL, SOWL, for different values of c

52 Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w.

53 Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w. Key differences: Groups defined through η w = [δ 1,..., δ }{{} 1,..., δ k,..., δ k ]. }{{} G 1 G k Influenced by the choice of c.

54 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO?

55 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible?

56 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible? 3. Theoretical properties of Ω S vs Group Lasso?

57 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η)

58 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1.

59 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1. There exists 0 λ 0 y 2 c (4 4ρ 2 ij ) 1 4, such that λ > λ 0, ˆη (λ) i = ˆη (λ) j

60 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w.

61 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w.

62 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w. 3. Allows model variance within group. ŵ (λ) i ŵ (λ) j as long as c has distinct values.

63 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path.

64 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path Figure: Original signal

65 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path Figure: Original signal Lasso w OWL w 4 2 Early group discovery. Model variation SOWL w SOWL η Figure: x-axis : λ, y-axis: ŵ.

66 Proximal Methods: A brief overview Proximal operator prox Ω (z) = argmin w 1 2 w z Ω(w) Easy to evaluate for many simple norms. prox λl1 (z) = sign(z) ( z λ) +. Generalization of Projected Gradient Descent

67 FISTA [Beck and Teboulle, 2009] Initialization t (1) = 1, w (1) = x (1) = 0. Steps : k > 1 w (k) ( = prox Ω w (k 1) 1 L f ( w (k 1) ) ). ( t (k) = ( ) t (k 1)) 2 /2. w (k) = w (k) + ( t (k 1) 1 t (k) ) (w (k) w (k 1)). Guarantee Convergence rate O(1/T 2 ). No additional assumptions than IST. Known to be optimal for this class of minimization problems.

68 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η).

69 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η log(λ) η λ

70 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η2 1 η log(λ) η w (λ) = (η z λ) λ Same complexity as computing the norm Ω S (O(d log d)). True for all cardinality based F.

71 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G k 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

72 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

73 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

74 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

75 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. Similar to Group Lasso [Bach, 2008, Theorem 2]. Learns the weights, without explicit groups information. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

76 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. 4 The experiments followed the setup of Bondell and Reich [2008]

77 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. 4 The experiments followed the setup of Bondell and Reich [2008]

78 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). 4 The experiments followed the setup of Bondell and Reich [2008]

79 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. 4 The experiments followed the setup of Bondell and Reich [2008]

80 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. Models with group variance: Measure E[ x ( w ŵ) 2 ]. w = w + ɛ, ɛ U[ τ, τ], τ = 0, 0.2, The experiments followed the setup of Bondell and Reich [2008]

81 Predictive accuracy results Algorithm Med. MSE MSE (10th Perc). MSE (90th Perc) LASSO 46.1 / 45.2 / / 32.7 / / 61.5 / 61.4 OWL 27.6 / 27.0 / / 19.2 / / 40.4 / 39.2 El. Net 30.8 / 30.7 / / 22.6 / / 43.0 / 41.4 Ω S 23.9 / 23.3 / / 16.8 / / 35.4 / 33.2 Table: Each column has numbers for τ = 0, 0.2, 0.4.

82 Summary 1. Proposed a new family of norms Ω S. 2. Properties: Equivalent to OWL in group identification. Efficient computational tools Equivalences to Group Lasso. 3. Illustrations on performance through simulations.

83 Questions?

84 Thank you!!!

85 References I F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9: , F. Bach. Structured sparsity-inducing norms through submodular functions. In Neural Information Processing Systems, F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6-2-3: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1): , March H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64: , 2008.

86 References II P. Buhlmann, P. Rütimann, Sara van de Geer, and Cun-Hui Zhang. Correlated variables in regression : Clustering and sparse estimation. Journal of Statistical Planning and Inference, 43: , A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1): , May M. A. T. Figueiredo and R. D. Nowak. Ordered weighted l 1 regularized regression with strongly correlated covariates: Theoretical aspects. In International Conference on Artificial Intelligence and Statistics (AISTATS), Mohamed Hebiri and Johannes Lederer. How correlations influence lasso prediction. IEEE Transactions on Information Theory, 59 (3): , G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. Technical Report , HAL, 2012.

87 References III R. Sankaran, F. Bach, and C. Bhattacharya. Identifying groups of strongly correlated variables through smoothed ordered weighted l 1-norms. In Artificial Intelligence and Statistics, pages , R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l 1 -constrained quadratic programming (lasso). IEEE Transactions On Information Theory, 55(5): , X. Zeng and M. Figueiredo. The ordered weighted l 1 norm: Atomic formulation, projections, and algorithms. ArXiv preprint: v5, 2015.

Convex relaxation for Combinatorial Penalties

Convex relaxation for Combinatorial Penalties Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,

More information

1 Sparsity and l 1 relaxation

1 Sparsity and l 1 relaxation 6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the

More information

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

1 Regression with High Dimensional Data

1 Regression with High Dimensional Data 6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:

More information

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable

More information

Structured sparsity-inducing norms through submodular functions

Structured sparsity-inducing norms through submodular functions Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

A Note on Extended Formulations for Cardinality-based Sparsity

A Note on Extended Formulations for Cardinality-based Sparsity A Note on Extended Formulations for Cardinality-based Sparsity Cong Han Lim Wisconsin Institute for Discovery, University of Wisconsin-Madison Abstract clim9@wisc.edu We provide compact convex descriptions

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

arxiv: v1 [cs.lg] 20 Feb 2014

arxiv: v1 [cs.lg] 20 Feb 2014 GROUP-SPARSE MATRI RECOVERY iangrong Zeng and Mário A. T. Figueiredo Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal ariv:1402.5077v1 [cs.lg] 20 Feb 2014 ABSTRACT We apply the

More information

DATA MINING AND MACHINE LEARNING

DATA MINING AND MACHINE LEARNING DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems

More information

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows Journal of Machine Learning Research 4 (03) 449-485 Submitted 4/; Revised 3/3; Published 8/3 Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows Julien Mairal Bin Yu Department

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information

Parcimonie en apprentissage statistique

Parcimonie en apprentissage statistique Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Submodular Functions Properties Algorithms Machine Learning

Submodular Functions Properties Algorithms Machine Learning Submodular Functions Properties Algorithms Machine Learning Rémi Gilleron Inria Lille - Nord Europe & LIFL & Univ Lille Jan. 12 revised Aug. 14 Rémi Gilleron (Mostrare) Submodular Functions Jan. 12 revised

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Iterative hard clustering of

Iterative hard clustering of Iterative hard clustering of features Vincent Roulet, Fajwel Fogel, Alexandre D Aspremont, Francis Bach To cite this version: Vincent Roulet, Fajwel Fogel, Alexandre D Aspremont, Francis Bach. features.

More information

Mathematical Methods for Data Analysis

Mathematical Methods for Data Analysis Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data

More information

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28 Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some

More information

Within Group Variable Selection through the Exclusive Lasso

Within Group Variable Selection through the Exclusive Lasso Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,

More information

Analysis of Greedy Algorithms

Analysis of Greedy Algorithms Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

MSA220/MVE440 Statistical Learning for Big Data

MSA220/MVE440 Statistical Learning for Big Data MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in

More information

Regression Shrinkage and Selection via the Lasso

Regression Shrinkage and Selection via the Lasso Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,

More information

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57

More information

Nikhil Rao, Miroslav Dudík, Zaid Harchaoui

Nikhil Rao, Miroslav Dudík, Zaid Harchaoui THE GROUP k-support NORM FOR LEARNING WITH STRUCTURED SPARSITY Nikhil Rao, Miroslav Dudík, Zaid Harchaoui Technicolor R&I, Microsoft Research, University of Washington ABSTRACT Several high-dimensional

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,

More information

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology

More information

Learning with Submodular Functions: A Convex Optimization Perspective

Learning with Submodular Functions: A Convex Optimization Perspective Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA

More information

Sparse Prediction with the k-overlap Norm

Sparse Prediction with the k-overlap Norm Sparse Prediction with the -Overlap Norm Andreas Argyriou Toyota Technological Institute at Chicago Rina Foygel University of Chicago Nathan Srebro Toyota Technological Institute at Chicago Abstract We

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

A Short Introduction to the Lasso Methodology

A Short Introduction to the Lasso Methodology A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael

More information

arxiv: v1 [cs.lg] 27 Dec 2015

arxiv: v1 [cs.lg] 27 Dec 2015 New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London

More information

Sparse Estimation and Dictionary Learning

Sparse Estimation and Dictionary Learning Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Primal-dual coordinate descent

Primal-dual coordinate descent Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x

More information

Lasso, Ridge, and Elastic Net

Lasso, Ridge, and Elastic Net Lasso, Ridge, and Elastic Net David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 14 A Very Simple Model Suppose we have one feature

More information

Lecture 24 May 30, 2018

Lecture 24 May 30, 2018 Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Convex Optimization Algorithms for Machine Learning in 10 Slides

Convex Optimization Algorithms for Machine Learning in 10 Slides Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,

More information

A direct formulation for sparse PCA using semidefinite programming

A direct formulation for sparse PCA using semidefinite programming A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon

More information

Tractable Upper Bounds on the Restricted Isometry Constant

Tractable Upper Bounds on the Restricted Isometry Constant Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.

More information

Approximate Message Passing Algorithms

Approximate Message Passing Algorithms November 4, 2017 Outline AMP (Donoho et al., 2009, 2010a) Motivations Derivations from a message-passing perspective Limitations Extensions Generalized Approximate Message Passing (GAMP) (Rangan, 2011)

More information

Proximal methods. S. Villa. October 7, 2014

Proximal methods. S. Villa. October 7, 2014 Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem

More information

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Sparse regression. Optimization-Based Data Analysis.   Carlos Fernandez-Granda Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic

More information

Oslo Class 6 Sparsity based regularization

Oslo Class 6 Sparsity based regularization RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Hierarchical kernel learning

Hierarchical kernel learning Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We

More information

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization / Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear

More information

A tutorial on sparse modeling. Outline:

A tutorial on sparse modeling. Outline: A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing

More information

Efficient Methods for Overlapping Group Lasso

Efficient Methods for Overlapping Group Lasso Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

An iterative hard thresholding estimator for low rank matrix recovery

An iterative hard thresholding estimator for low rank matrix recovery An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical

More information

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725 Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:

More information

sparse and low-rank tensor recovery Cubic-Sketching

sparse and low-rank tensor recovery Cubic-Sketching Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru

More information

Lasso, Ridge, and Elastic Net

Lasso, Ridge, and Elastic Net Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Sparsity in Underdetermined Systems

Sparsity in Underdetermined Systems Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Sparse PCA with applications in finance

Sparse PCA with applications in finance Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction

More information

Homogeneity Pursuit. Jianqing Fan

Homogeneity Pursuit. Jianqing Fan Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles

More information

Permutation-invariant regularization of large covariance matrices. Liza Levina

Permutation-invariant regularization of large covariance matrices. Liza Levina Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work

More information

Spectral k-support Norm Regularization

Spectral k-support Norm Regularization Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:

More information

Lasso applications: regularisation and homotopy

Lasso applications: regularisation and homotopy Lasso applications: regularisation and homotopy M.R. Osborne 1 mailto:mike.osborne@anu.edu.au 1 Mathematical Sciences Institute, Australian National University Abstract Ti suggested the use of an l 1 norm

More information

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1

More information

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Improving the Convergence of Back-Propogation Learning with Second Order Methods the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible

More information

Solving Corrupted Quadratic Equations, Provably

Solving Corrupted Quadratic Equations, Provably Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin

More information

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate

More information

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R

More information

Sparse Gaussian conditional random fields

Sparse Gaussian conditional random fields Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian

More information

ESL Chap3. Some extensions of lasso

ESL Chap3. Some extensions of lasso ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied

More information

Grouping Pursuit in regression. Xiaotong Shen

Grouping Pursuit in regression. Xiaotong Shen Grouping Pursuit in regression Xiaotong Shen School of Statistics University of Minnesota Email xshen@stat.umn.edu Joint with Hsin-Cheng Huang (Sinica, Taiwan) Workshop in honor of John Hartigan Innovation

More information

Least Sparsity of p-norm based Optimization Problems with p > 1

Least Sparsity of p-norm based Optimization Problems with p > 1 Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from

More information

On Optimal Frame Conditioners

On Optimal Frame Conditioners On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,

More information

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Adaptive Piecewise Polynomial Estimation via Trend Filtering Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering

More information

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725 Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h

More information

operator and Group LASSO penalties

operator and Group LASSO penalties Multidimensional shrinkage-thresholding operator and Group LASSO penalties Arnau Tibau Puig, Ami Wiesel, Gilles Fleury and Alfred O. Hero III Abstract The scalar shrinkage-thresholding operator is a key

More information

Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso

Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso Rahul Garg IBM T.J. Watson research center grahul@us.ibm.com Rohit Khandekar IBM T.J. Watson research center rohitk@us.ibm.com

More information

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD EE-731: ADVANCED TOPICS IN DATA SCIENCES LABORATORY FOR INFORMATION AND INFERENCE SYSTEMS SPRING 2016 INSTRUCTOR: VOLKAN CEVHER SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD STRUCTURED SPARSITY

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Lecture 8: February 9

Lecture 8: February 9 0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we

More information