OWL to the rescue of LASSO
|
|
- Hector Wilson
- 5 years ago
- Views:
Transcription
1 OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science, Bangalore March 7, 2018
2 Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L 1 -norms
3 LASSO: The method of choice for feature selection Ordered Weighted L1(OWL) norm and Submodular penalties Ω S : Smoothed OWL (SOWL)
4 LASSO: The method of choice for feature selection
5 Linear Regression in High dimension Question? Let x R d, y R and y = w x + ɛ ɛ N(0, σ 2 ) From D = {(x i, y i ) i [n], x i R d, y i R} can we find w X R n d, y R n X = x 1 x 2. x n, Y Rn
6 Linear Regression in High dimension Least Squares Linear Regression X R n d, y R d : w LS = argmin w R d 1 2 Xw Y 2 2 Assumptions Labels centered : n j=1 y j = 0. Features normalized : x i R d, x i 2 = 1, x i 1 d = 0.
7 Linear Regression in High dimension Least Squares Linear Regression w LS = ( X X ) 1 X Y and E(w LS ) = w Variance of Predictive error: 1E( X (w n LS w ) 2 ) = σ 2 d n
8 Linear Regression in High dimension rank(x ) = d w LS is unique Poor predictive performance, d is close to n
9 Linear Regression in High dimension rank(x ) = d rank(x ) < d w LS is unique Poor predictive performance, d is close to n d > n w LS is not unique.
10 Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y 2 2 s.t. Ω(w) t Regularizer: Ω(w) non-negative Convex function, typically a norm. Possibly non-differentiable.
11 Lasso regression[tibshirani, 1994] Ridge: Ω(w) = w 2 2 Does not promote sparsity Closed form solution Lasso: Ω(w) = w 1 Encourages sparse solutions. Solve convex optimization problem
12 Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y λω(w) Equivalent to the constraint version Unconstrained
13 Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ).
14 Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ). Statistical properties[wainwright, 2009] Support recovery (Will it recover the true support?). Sample complexity (How many samples needed?). Prediction error (What is the expected error in prediction?).
15 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Setup Support of w Lasso y = Xw + ɛ, ɛ i N (0, σ 2 ), X R n d be S = {j w j 0 j [d]} min w R d 1 2 Xw y λ w 1
16 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n 1 Define M = max i j M ij
17 Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n W.h.p, the following holds: ( 1 ŵ S ws (X λ S S/n) X + 4σ/ ) C min 1 Define M = max i j M ij
18 Lasso Model Recovery: Special cases Case: XS X c S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: XS X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error.
19 Lasso Model Recovery: Special cases Case: X S c X S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: X S X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error. When does Lasso work well? Lasso prefers low correlation between support and non-support columns. Low correlation of columns within support lead to better recovery.
20 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small.
21 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013].
22 Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013]. Requirements. Need consistent estimates independent of the solver. Preferably select all the correlated variables as a group.
23 Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m.
24 Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m. Toy Example d = 40, k = 4. G 1 = [1 : 10], G 2 = [11 : 20], G 3 = [21 : 30], G 4 = [31 : 40]. Lasso: Ω(w) = w 1. Sparse recovery. Arbitrarily selects the variables within a group Figure: Original signal Figure: Recovered signal.
25 Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group.
26 Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group. Goal: Learn G and w simultaneously?
27 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. 2 Notation: w (i) : i th largest in w.
28 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.
29 Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > Figure: Recovered: OWL OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.
30 Oscar: Sparsity Illustrated Figure: Examples of solutions:[bondell and Reich, 2008] Solutions encouraged towards vertices. Encourages blockwise constant solutions. See [Bondell and Reich, 2008].
31 OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij.
32 OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij. λ 0 ij 1 ρ 2 ij. Strongly correlated pairs grouped early in the regularization path. Groups: G j = {i w i = α j }.
33 OWL-Issues Figure: True model Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation.
34 OWL-Issues Figure: True model Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation. Lack of interpretations for choosing c
35 Ordered Weighted L1(OWL) norm and Submodular penalties
36 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0}
37 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ].
38 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound.
39 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w)
40 Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w) Ω F p (w) = w 1. Familiar! Message: The cardinality function always relaxes to the l 1 norm.
41 Nondecreasing Submodular Penalties on Cardinality Assumptions. Denote F F, if F : A {1,..., d} R is: 1. Submodular [Bach, 2011]. A B, F (A {k}) F (A) F (B {k}) F (B). Lovász extension: f : R d R. (Convex extension of F to R d ). 2. Cardinality based. F (A) = g( A ) (Invariant to permutations). 3. Non Decreasing. g(0) = 0, g(x) g(x 1). Implication: F F F completely specified through g. Example: Let V = {1,..., d}, define F (A) = A V \ A. Then f (w) = i<j w i w j.
42 Ω F and Lovász extension Result: Case p = [Bach, 2010]: Ω F (w) = f ( w ) The l relaxation coincides with the Lovász extension in the positive orthant. To work with Ω F, may use existing results of submodular function minimization. Ω F p not known in closed form for p <.
43 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL.
44 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL.
45 Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL. Motivates Ω F p (w) for p <.
46 Ω S : Smoothed OWL (SOWL)
47 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w).
48 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ).
49 SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ). Use OWL equivalance: f ( η ) = Ω F (η) = d i=1 c i η (i), Ω S (w) = min η R d + ( w 2 i 1 d 2 η i i=1 }{{} Ψ(w,η) ) + c i η (i). (SOWL)
50 OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w.
51 OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Norm Balls Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w : OWL : SOWL Figure: Norm balls for OWL, SOWL, for different values of c
52 Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w.
53 Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w. Key differences: Groups defined through η w = [δ 1,..., δ }{{} 1,..., δ k,..., δ k ]. }{{} G 1 G k Influenced by the choice of c.
54 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO?
55 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible?
56 Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible? 3. Theoretical properties of Ω S vs Group Lasso?
57 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η)
58 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1.
59 Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1. There exists 0 λ 0 y 2 c (4 4ρ 2 ij ) 1 4, such that λ > λ 0, ˆη (λ) i = ˆη (λ) j
60 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w.
61 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w.
62 Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w. 3. Allows model variance within group. ŵ (λ) i ŵ (λ) j as long as c has distinct values.
63 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path.
64 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path Figure: Original signal
65 Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path Figure: Original signal Lasso w OWL w 4 2 Early group discovery. Model variation SOWL w SOWL η Figure: x-axis : λ, y-axis: ŵ.
66 Proximal Methods: A brief overview Proximal operator prox Ω (z) = argmin w 1 2 w z Ω(w) Easy to evaluate for many simple norms. prox λl1 (z) = sign(z) ( z λ) +. Generalization of Projected Gradient Descent
67 FISTA [Beck and Teboulle, 2009] Initialization t (1) = 1, w (1) = x (1) = 0. Steps : k > 1 w (k) ( = prox Ω w (k 1) 1 L f ( w (k 1) ) ). ( t (k) = ( ) t (k 1)) 2 /2. w (k) = w (k) + ( t (k 1) 1 t (k) ) (w (k) w (k 1)). Guarantee Convergence rate O(1/T 2 ). No additional assumptions than IST. Known to be optimal for this class of minimization problems.
68 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η).
69 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η log(λ) η λ
70 Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η2 1 η log(λ) η w (λ) = (η z λ) λ Same complexity as computing the norm Ω S (O(d log d)). True for all cardinality based F.
71 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G k 3 D: Diagonal matrix, defined using groups G1,..., G k, and w
72 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w
73 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w
74 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w
75 Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. Similar to Group Lasso [Bach, 2008, Theorem 2]. Learns the weights, without explicit groups information. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w
76 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. 4 The experiments followed the setup of Bondell and Reich [2008]
77 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. 4 The experiments followed the setup of Bondell and Reich [2008]
78 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). 4 The experiments followed the setup of Bondell and Reich [2008]
79 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. 4 The experiments followed the setup of Bondell and Reich [2008]
80 Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. Models with group variance: Measure E[ x ( w ŵ) 2 ]. w = w + ɛ, ɛ U[ τ, τ], τ = 0, 0.2, The experiments followed the setup of Bondell and Reich [2008]
81 Predictive accuracy results Algorithm Med. MSE MSE (10th Perc). MSE (90th Perc) LASSO 46.1 / 45.2 / / 32.7 / / 61.5 / 61.4 OWL 27.6 / 27.0 / / 19.2 / / 40.4 / 39.2 El. Net 30.8 / 30.7 / / 22.6 / / 43.0 / 41.4 Ω S 23.9 / 23.3 / / 16.8 / / 35.4 / 33.2 Table: Each column has numbers for τ = 0, 0.2, 0.4.
82 Summary 1. Proposed a new family of norms Ω S. 2. Properties: Equivalent to OWL in group identification. Efficient computational tools Equivalences to Group Lasso. 3. Illustrations on performance through simulations.
83 Questions?
84 Thank you!!!
85 References I F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9: , F. Bach. Structured sparsity-inducing norms through submodular functions. In Neural Information Processing Systems, F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6-2-3: , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1): , March H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64: , 2008.
86 References II P. Buhlmann, P. Rütimann, Sara van de Geer, and Cun-Hui Zhang. Correlated variables in regression : Clustering and sparse estimation. Journal of Statistical Planning and Inference, 43: , A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1): , May M. A. T. Figueiredo and R. D. Nowak. Ordered weighted l 1 regularized regression with strongly correlated covariates: Theoretical aspects. In International Conference on Artificial Intelligence and Statistics (AISTATS), Mohamed Hebiri and Johannes Lederer. How correlations influence lasso prediction. IEEE Transactions on Information Theory, 59 (3): , G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. Technical Report , HAL, 2012.
87 References III R. Sankaran, F. Bach, and C. Bhattacharya. Identifying groups of strongly correlated variables through smoothed ordered weighted l 1-norms. In Artificial Intelligence and Statistics, pages , R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58: , M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l 1 -constrained quadratic programming (lasso). IEEE Transactions On Information Theory, 55(5): , X. Zeng and M. Figueiredo. The ordered weighted l 1 norm: Atomic formulation, projections, and algorithms. ArXiv preprint: v5, 2015.
Convex relaxation for Combinatorial Penalties
Convex relaxation for Combinatorial Penalties Guillaume Obozinski Equipe Imagine Laboratoire d Informatique Gaspard Monge Ecole des Ponts - ParisTech Joint work with Francis Bach Fête Parisienne in Computation,
More information1 Sparsity and l 1 relaxation
6.883 Learning with Combinatorial Structure Note for Lecture 2 Author: Chiyuan Zhang Sparsity and l relaxation Last time we talked about sparsity and characterized when an l relaxation could recover the
More informationLinear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1
Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1 ( OWL ) Regularization Mário A. T. Figueiredo Instituto de Telecomunicações and Instituto Superior Técnico, Universidade de
More informationProximal Methods for Optimization with Spasity-inducing Norms
Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology
More informationA New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables
A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider the problem of
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More information1 Regression with High Dimensional Data
6.883 Learning with Combinatorial Structure ote for Lecture 11 Instructor: Prof. Stefanie Jegelka Scribe: Xuhong Zhang 1 Regression with High Dimensional Data Consider the following regression problem:
More informationPre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models
Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models Niharika Gauraha and Swapan Parui Indian Statistical Institute Abstract. We consider variable
More informationStructured sparsity-inducing norms through submodular functions
Structured sparsity-inducing norms through submodular functions Francis Bach Sierra team, INRIA - Ecole Normale Supérieure - CNRS Thanks to R. Jenatton, J. Mairal, G. Obozinski June 2011 Outline Introduction:
More informationRecent Advances in Structured Sparse Models
Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured
More informationA Note on Extended Formulations for Cardinality-based Sparsity
A Note on Extended Formulations for Cardinality-based Sparsity Cong Han Lim Wisconsin Institute for Discovery, University of Wisconsin-Madison Abstract clim9@wisc.edu We provide compact convex descriptions
More informationOptimization methods
Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to
More informationarxiv: v1 [cs.lg] 20 Feb 2014
GROUP-SPARSE MATRI RECOVERY iangrong Zeng and Mário A. T. Figueiredo Instituto de Telecomunicações, Instituto Superior Técnico, Lisboa, Portugal ariv:1402.5077v1 [cs.lg] 20 Feb 2014 ABSTRACT We apply the
More informationDATA MINING AND MACHINE LEARNING
DATA MINING AND MACHINE LEARNING Lecture 5: Regularization and loss functions Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Loss functions Loss functions for regression problems
More informationSupervised Feature Selection in Graphs with Path Coding Penalties and Network Flows
Journal of Machine Learning Research 4 (03) 449-485 Submitted 4/; Revised 3/3; Published 8/3 Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows Julien Mairal Bin Yu Department
More informationStructured Sparse Estimation with Network Flow Optimization
Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce
More informationParcimonie en apprentissage statistique
Parcimonie en apprentissage statistique Guillaume Obozinski Ecole des Ponts - ParisTech Journée Parcimonie Fédération Charles Hermite, 23 Juin 2014 Parcimonie en apprentissage 1/44 Classical supervised
More informationLasso: Algorithms and Extensions
ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions
More informationSubmodular Functions Properties Algorithms Machine Learning
Submodular Functions Properties Algorithms Machine Learning Rémi Gilleron Inria Lille - Nord Europe & LIFL & Univ Lille Jan. 12 revised Aug. 14 Rémi Gilleron (Mostrare) Submodular Functions Jan. 12 revised
More informationMaster 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique
Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some
More informationIterative hard clustering of
Iterative hard clustering of features Vincent Roulet, Fajwel Fogel, Alexandre D Aspremont, Francis Bach To cite this version: Vincent Roulet, Fajwel Fogel, Alexandre D Aspremont, Francis Bach. features.
More informationMathematical Methods for Data Analysis
Mathematical Methods for Data Analysis Massimiliano Pontil Istituto Italiano di Tecnologia and Department of Computer Science University College London Massimiliano Pontil Mathematical Methods for Data
More informationSparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28
Sparsity Models Tong Zhang Rutgers University T. Zhang (Rutgers) Sparsity Models 1 / 28 Topics Standard sparse regression model algorithms: convex relaxation and greedy algorithm sparse recovery analysis:
More informationAccelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)
Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan
More informationSIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University
SIAM Conference on Imaging Science, Bologna, Italy, 2018 Adaptive FISTA Peter Ochs Saarland University 07.06.2018 joint work with Thomas Pock, TU Graz, Austria c 2018 Peter Ochs Adaptive FISTA 1 / 16 Some
More informationWithin Group Variable Selection through the Exclusive Lasso
Within Group Variable Selection through the Exclusive Lasso arxiv:1505.07517v1 [stat.me] 28 May 2015 Frederick Campbell Department of Statistics, Rice University and Genevera Allen Department of Statistics,
More informationAnalysis of Greedy Algorithms
Analysis of Greedy Algorithms Jiahui Shen Florida State University Oct.26th Outline Introduction Regularity condition Analysis on orthogonal matching pursuit Analysis on forward-backward greedy algorithm
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationMSA220/MVE440 Statistical Learning for Big Data
MSA220/MVE440 Statistical Learning for Big Data Lecture 9-10 - High-dimensional regression Rebecka Jörnsten Mathematical Sciences University of Gothenburg and Chalmers University of Technology Recap from
More informationCPSC 540: Machine Learning
CPSC 540: Machine Learning Proximal-Gradient Mark Schmidt University of British Columbia Winter 2018 Admin Auditting/registration forms: Pick up after class today. Assignment 1: 2 late days to hand in
More informationRegression Shrinkage and Selection via the Lasso
Regression Shrinkage and Selection via the Lasso ROBERT TIBSHIRANI, 1996 Presenter: Guiyun Feng April 27 () 1 / 20 Motivation Estimation in Linear Models: y = β T x + ɛ. data (x i, y i ), i = 1, 2,...,
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 57 Table of Contents 1 Sparse linear models Basis Pursuit and restricted null space property Sufficient conditions for RNS 2 / 57
More informationNikhil Rao, Miroslav Dudík, Zaid Harchaoui
THE GROUP k-support NORM FOR LEARNING WITH STRUCTURED SPARSITY Nikhil Rao, Miroslav Dudík, Zaid Harchaoui Technicolor R&I, Microsoft Research, University of Washington ABSTRACT Several high-dimensional
More informationA Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization
A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationMIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco
MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley A. d Aspremont, INFORMS, Denver,
More informationMLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT
MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationConfidence Intervals for Low-dimensional Parameters with High-dimensional Data
Confidence Intervals for Low-dimensional Parameters with High-dimensional Data Cun-Hui Zhang and Stephanie S. Zhang Rutgers University and Columbia University September 14, 2012 Outline Introduction Methodology
More informationLearning with Submodular Functions: A Convex Optimization Perspective
Foundations and Trends R in Machine Learning Vol. 6, No. 2-3 (2013) 145 373 c 2013 F. Bach DOI: 10.1561/2200000039 Learning with Submodular Functions: A Convex Optimization Perspective Francis Bach INRIA
More informationSparse Prediction with the k-overlap Norm
Sparse Prediction with the -Overlap Norm Andreas Argyriou Toyota Technological Institute at Chicago Rina Foygel University of Chicago Nathan Srebro Toyota Technological Institute at Chicago Abstract We
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationA Short Introduction to the Lasso Methodology
A Short Introduction to the Lasso Methodology Michael Gutmann sites.google.com/site/michaelgutmann University of Helsinki Aalto University Helsinki Institute for Information Technology March 9, 2016 Michael
More informationarxiv: v1 [cs.lg] 27 Dec 2015
New Perspectives on k-support and Cluster Norms New Perspectives on k-support and Cluster Norms arxiv:151.0804v1 [cs.lg] 7 Dec 015 Andrew M. McDonald Department of Computer Science University College London
More informationSparse Estimation and Dictionary Learning
Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationPrimal-dual coordinate descent
Primal-dual coordinate descent Olivier Fercoq Joint work with P. Bianchi & W. Hachem 15 July 2015 1/28 Minimize the convex function f, g, h convex f is differentiable Problem min f (x) + g(x) + h(mx) x
More informationLasso, Ridge, and Elastic Net
Lasso, Ridge, and Elastic Net David Rosenberg New York University October 29, 2016 David Rosenberg (New York University) DS-GA 1003 October 29, 2016 1 / 14 A Very Simple Model Suppose we have one feature
More informationLecture 24 May 30, 2018
Stats 3C: Theory of Statistics Spring 28 Lecture 24 May 3, 28 Prof. Emmanuel Candes Scribe: Martin J. Zhang, Jun Yan, Can Wang, and E. Candes Outline Agenda: High-dimensional Statistical Estimation. Lasso
More informationConsistent high-dimensional Bayesian variable selection via penalized credible regions
Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable
More informationConvex Optimization Algorithms for Machine Learning in 10 Slides
Convex Optimization Algorithms for Machine Learning in 10 Slides Presenter: Jul. 15. 2015 Outline 1 Quadratic Problem Linear System 2 Smooth Problem Newton-CG 3 Composite Problem Proximal-Newton-CD 4 Non-smooth,
More informationA direct formulation for sparse PCA using semidefinite programming
A direct formulation for sparse PCA using semidefinite programming A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon
More informationTractable Upper Bounds on the Restricted Isometry Constant
Tractable Upper Bounds on the Restricted Isometry Constant Alex d Aspremont, Francis Bach, Laurent El Ghaoui Princeton University, École Normale Supérieure, U.C. Berkeley. Support from NSF, DHS and Google.
More informationApproximate Message Passing Algorithms
November 4, 2017 Outline AMP (Donoho et al., 2009, 2010a) Motivations Derivations from a message-passing perspective Limitations Extensions Generalized Approximate Message Passing (GAMP) (Rangan, 2011)
More informationProximal methods. S. Villa. October 7, 2014
Proximal methods S. Villa October 7, 2014 1 Review of the basics Often machine learning problems require the solution of minimization problems. For instance, the ERM algorithm requires to solve a problem
More informationSparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda
Sparse regression Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda 3/28/2016 Regression Least-squares regression Example: Global warming Logistic
More informationOslo Class 6 Sparsity based regularization
RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationHierarchical kernel learning
Hierarchical kernel learning Francis Bach Willow project, INRIA - Ecole Normale Supérieure May 2010 Outline Supervised learning and regularization Kernel methods vs. sparse methods MKL: Multiple kernel
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Low-rank matrix recovery via convex relaxations Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationTopographic Dictionary Learning with Structured Sparsity
Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationCOMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017
COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University UNDERDETERMINED LINEAR EQUATIONS We
More informationUses of duality. Geoff Gordon & Ryan Tibshirani Optimization /
Uses of duality Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Remember conjugate functions Given f : R n R, the function is called its conjugate f (y) = max x R n yt x f(x) Conjugates appear
More informationA tutorial on sparse modeling. Outline:
A tutorial on sparse modeling. Outline: 1. Why? 2. What? 3. How. 4. no really, why? Sparse modeling is a component in many state of the art signal processing and machine learning tasks. image processing
More informationEfficient Methods for Overlapping Group Lasso
Efficient Methods for Overlapping Group Lasso Lei Yuan Arizona State University Tempe, AZ, 85287 Lei.Yuan@asu.edu Jun Liu Arizona State University Tempe, AZ, 85287 j.liu@asu.edu Jieping Ye Arizona State
More informationPrimal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions
Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function
More informationAn iterative hard thresholding estimator for low rank matrix recovery
An iterative hard thresholding estimator for low rank matrix recovery Alexandra Carpentier - based on a joint work with Arlene K.Y. Kim Statistical Laboratory, Department of Pure Mathematics and Mathematical
More informationProximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725
Proximal Gradient Descent and Acceleration Ryan Tibshirani Convex Optimization 10-725/36-725 Last time: subgradient method Consider the problem min f(x) with f convex, and dom(f) = R n. Subgradient method:
More informationsparse and low-rank tensor recovery Cubic-Sketching
Sparse and Low-Ran Tensor Recovery via Cubic-Setching Guang Cheng Department of Statistics Purdue University www.science.purdue.edu/bigdata CCAM@Purdue Math Oct. 27, 2017 Joint wor with Botao Hao and Anru
More informationLasso, Ridge, and Elastic Net
Lasso, Ridge, and Elastic Net David Rosenberg New York University February 7, 2017 David Rosenberg (New York University) DS-GA 1003 February 7, 2017 1 / 29 Linearly Dependent Features Linearly Dependent
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationSparsity in Underdetermined Systems
Sparsity in Underdetermined Systems Department of Statistics Stanford University August 19, 2005 Classical Linear Regression Problem X n y p n 1 > Given predictors and response, y Xβ ε = + ε N( 0, σ 2
More informationSparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference
Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which
More informationSparse PCA with applications in finance
Sparse PCA with applications in finance A. d Aspremont, L. El Ghaoui, M. Jordan, G. Lanckriet ORFE, Princeton University & EECS, U.C. Berkeley Available online at www.princeton.edu/~aspremon 1 Introduction
More informationHomogeneity Pursuit. Jianqing Fan
Jianqing Fan Princeton University with Tracy Ke and Yichao Wu http://www.princeton.edu/ jqfan June 5, 2014 Get my own profile - Help Amazing Follow this author Grace Wahba 9 Followers Follow new articles
More informationPermutation-invariant regularization of large covariance matrices. Liza Levina
Liza Levina Permutation-invariant covariance regularization 1/42 Permutation-invariant regularization of large covariance matrices Liza Levina Department of Statistics University of Michigan Joint work
More informationSpectral k-support Norm Regularization
Spectral k-support Norm Regularization Andrew McDonald Department of Computer Science, UCL (Joint work with Massimiliano Pontil and Dimitris Stamos) 25 March, 2015 1 / 19 Problem: Matrix Completion Goal:
More informationLasso applications: regularisation and homotopy
Lasso applications: regularisation and homotopy M.R. Osborne 1 mailto:mike.osborne@anu.edu.au 1 Mathematical Sciences Institute, Australian National University Abstract Ti suggested the use of an l 1 norm
More informationLearning Multiple Tasks with a Sparse Matrix-Normal Penalty
Learning Multiple Tasks with a Sparse Matrix-Normal Penalty Yi Zhang and Jeff Schneider NIPS 2010 Presented by Esther Salazar Duke University March 25, 2011 E. Salazar (Reading group) March 25, 2011 1
More informationImproving the Convergence of Back-Propogation Learning with Second Order Methods
the of Back-Propogation Learning with Second Order Methods Sue Becker and Yann le Cun, Sept 1988 Kasey Bray, October 2017 Table of Contents 1 with Back-Propagation 2 the of BP 3 A Computationally Feasible
More informationSolving Corrupted Quadratic Equations, Provably
Solving Corrupted Quadratic Equations, Provably Yuejie Chi London Workshop on Sparse Signal Processing September 206 Acknowledgement Joint work with Yuanxin Li (OSU), Huishuai Zhuang (Syracuse) and Yingbin
More information9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients
What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate
More informationProximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization
Proximal Newton Method Zico Kolter (notes by Ryan Tibshirani) Convex Optimization 10-725 Consider the problem Last time: quasi-newton methods min x f(x) with f convex, twice differentiable, dom(f) = R
More informationSparse Gaussian conditional random fields
Sparse Gaussian conditional random fields Matt Wytock, J. ico Kolter School of Computer Science Carnegie Mellon University Pittsburgh, PA 53 {mwytock, zkolter}@cs.cmu.edu Abstract We propose sparse Gaussian
More informationESL Chap3. Some extensions of lasso
ESL Chap3 Some extensions of lasso 1 Outline Consistency of lasso for model selection Adaptive lasso Elastic net Group lasso 2 Consistency of lasso for model selection A number of authors have studied
More informationGrouping Pursuit in regression. Xiaotong Shen
Grouping Pursuit in regression Xiaotong Shen School of Statistics University of Minnesota Email xshen@stat.umn.edu Joint with Hsin-Cheng Huang (Sinica, Taiwan) Workshop in honor of John Hartigan Innovation
More informationLeast Sparsity of p-norm based Optimization Problems with p > 1
Least Sparsity of p-norm based Optimization Problems with p > Jinglai Shen and Seyedahmad Mousavi Original version: July, 07; Revision: February, 08 Abstract Motivated by l p -optimization arising from
More informationOn Optimal Frame Conditioners
On Optimal Frame Conditioners Chae A. Clark Department of Mathematics University of Maryland, College Park Email: cclark18@math.umd.edu Kasso A. Okoudjou Department of Mathematics University of Maryland,
More informationAdaptive Piecewise Polynomial Estimation via Trend Filtering
Adaptive Piecewise Polynomial Estimation via Trend Filtering Liubo Li, ShanShan Tu The Ohio State University li.2201@osu.edu, tu.162@osu.edu October 1, 2015 Liubo Li, ShanShan Tu (OSU) Trend Filtering
More informationProximal Newton Method. Ryan Tibshirani Convex Optimization /36-725
Proximal Newton Method Ryan Tibshirani Convex Optimization 10-725/36-725 1 Last time: primal-dual interior-point method Given the problem min x subject to f(x) h i (x) 0, i = 1,... m Ax = b where f, h
More informationoperator and Group LASSO penalties
Multidimensional shrinkage-thresholding operator and Group LASSO penalties Arnau Tibau Puig, Ami Wiesel, Gilles Fleury and Alfred O. Hero III Abstract The scalar shrinkage-thresholding operator is a key
More informationBlock-sparse Solutions using Kernel Block RIP and its Application to Group Lasso
Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso Rahul Garg IBM T.J. Watson research center grahul@us.ibm.com Rohit Khandekar IBM T.J. Watson research center rohitk@us.ibm.com
More informationSCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD
EE-731: ADVANCED TOPICS IN DATA SCIENCES LABORATORY FOR INFORMATION AND INFERENCE SYSTEMS SPRING 2016 INSTRUCTOR: VOLKAN CEVHER SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD STRUCTURED SPARSITY
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationLecture 8: February 9
0-725/36-725: Convex Optimiation Spring 205 Lecturer: Ryan Tibshirani Lecture 8: February 9 Scribes: Kartikeya Bhardwaj, Sangwon Hyun, Irina Caan 8 Proximal Gradient Descent In the previous lecture, we
More information