OWL to the rescue of LASSO

Similar documents
Convex relaxation for Combinatorial Penalties

1 Sparsity and l 1 relaxation

Linear Regression with Strongly Correlated Designs Using Ordered Weigthed l 1

Proximal Methods for Optimization with Spasity-inducing Norms

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

1 Regression with High Dimensional Data

Pre-Selection in Cluster Lasso Methods for Correlated Variable Selection in High-Dimensional Linear Models

Structured sparsity-inducing norms through submodular functions

Recent Advances in Structured Sparse Models

A Note on Extended Formulations for Cardinality-based Sparsity

Optimization methods

arxiv: v1 [cs.lg] 20 Feb 2014

DATA MINING AND MACHINE LEARNING

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

Structured Sparse Estimation with Network Flow Optimization

Parcimonie en apprentissage statistique

Lasso: Algorithms and Extensions

Submodular Functions Properties Algorithms Machine Learning

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Iterative hard clustering of

Mathematical Methods for Data Analysis

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

SIAM Conference on Imaging Science, Bologna, Italy, Adaptive FISTA. Peter Ochs Saarland University

Within Group Variable Selection through the Exclusive Lasso

Analysis of Greedy Algorithms

Optimization methods

MSA220/MVE440 Statistical Learning for Big Data

CPSC 540: Machine Learning

Regression Shrinkage and Selection via the Lasso

STAT 200C: High-dimensional Statistics

Nikhil Rao, Miroslav Dudík, Zaid Harchaoui

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

Generalized Conditional Gradient and Its Applications

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A direct formulation for sparse PCA using semidefinite programming

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

SVRG++ with Non-uniform Sampling

Confidence Intervals for Low-dimensional Parameters with High-dimensional Data

Learning with Submodular Functions: A Convex Optimization Perspective

Sparse Prediction with the k-overlap Norm

Big Data Analytics: Optimization and Randomization

A Short Introduction to the Lasso Methodology

arxiv: v1 [cs.lg] 27 Dec 2015

Sparse Estimation and Dictionary Learning

Learning with stochastic proximal gradient

Primal-dual coordinate descent

Lasso, Ridge, and Elastic Net

Lecture 24 May 30, 2018

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Convex Optimization Algorithms for Machine Learning in 10 Slides

A direct formulation for sparse PCA using semidefinite programming

Tractable Upper Bounds on the Restricted Isometry Constant

Approximate Message Passing Algorithms

Proximal methods. S. Villa. October 7, 2014

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Oslo Class 6 Sparsity based regularization

ECS289: Scalable Machine Learning

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Hierarchical kernel learning

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

Topographic Dictionary Learning with Structured Sparsity

Linear Methods for Regression. Lijun Zhang

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

A tutorial on sparse modeling. Outline:

Efficient Methods for Overlapping Group Lasso

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

An iterative hard thresholding estimator for low rank matrix recovery

Proximal Gradient Descent and Acceleration. Ryan Tibshirani Convex Optimization /36-725

sparse and low-rank tensor recovery Cubic-Sketching

Lasso, Ridge, and Elastic Net

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

STA141C: Big Data & High Performance Statistical Computing

Sparsity in Underdetermined Systems

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse PCA with applications in finance

Homogeneity Pursuit. Jianqing Fan

Permutation-invariant regularization of large covariance matrices. Liza Levina

Spectral k-support Norm Regularization

Lasso applications: regularisation and homotopy

Learning Multiple Tasks with a Sparse Matrix-Normal Penalty

Improving the Convergence of Back-Propogation Learning with Second Order Methods

Solving Corrupted Quadratic Equations, Provably

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Proximal Newton Method. Zico Kolter (notes by Ryan Tibshirani) Convex Optimization

Sparse Gaussian conditional random fields

ESL Chap3. Some extensions of lasso

Grouping Pursuit in regression. Xiaotong Shen

Least Sparsity of p-norm based Optimization Problems with p > 1

On Optimal Frame Conditioners

Adaptive Piecewise Polynomial Estimation via Trend Filtering

Proximal Newton Method. Ryan Tibshirani Convex Optimization /36-725

operator and Group LASSO penalties

Block-sparse Solutions using Kernel Block RIP and its Application to Group Lasso

SCRIBERS: SOROOSH SHAFIEEZADEH-ABADEH, MICHAËL DEFFERRARD

ISyE 691 Data mining and analytics

Lecture 8: February 9

Transcription:

OWL to the rescue of LASSO IISc IBM day 2018 Joint Work R. Sankaran and Francis Bach AISTATS 17 Chiranjib Bhattacharyya Professor, Department of Computer Science and Automation Indian Institute of Science, Bangalore March 7, 2018

Identifying groups of strongly correlated variables through Smoothed Ordered Weighted L 1 -norms

LASSO: The method of choice for feature selection Ordered Weighted L1(OWL) norm and Submodular penalties Ω S : Smoothed OWL (SOWL)

LASSO: The method of choice for feature selection

Linear Regression in High dimension Question? Let x R d, y R and y = w x + ɛ ɛ N(0, σ 2 ) From D = {(x i, y i ) i [n], x i R d, y i R} can we find w X R n d, y R n X = x 1 x 2. x n, Y Rn

Linear Regression in High dimension Least Squares Linear Regression X R n d, y R d : w LS = argmin w R d 1 2 Xw Y 2 2 Assumptions Labels centered : n j=1 y j = 0. Features normalized : x i R d, x i 2 = 1, x i 1 d = 0.

Linear Regression in High dimension Least Squares Linear Regression w LS = ( X X ) 1 X Y and E(w LS ) = w Variance of Predictive error: 1E( X (w n LS w ) 2 ) = σ 2 d n

Linear Regression in High dimension rank(x ) = d w LS is unique Poor predictive performance, d is close to n

Linear Regression in High dimension rank(x ) = d rank(x ) < d w LS is unique Poor predictive performance, d is close to n d > n w LS is not unique.

Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y 2 2 s.t. Ω(w) t Regularizer: Ω(w) non-negative Convex function, typically a norm. Possibly non-differentiable.

Lasso regression[tibshirani, 1994] Ridge: Ω(w) = w 2 2 Does not promote sparsity Closed form solution Lasso: Ω(w) = w 1 Encourages sparse solutions. Solve convex optimization problem

Regularized Linear Regression Regularized Linear Regression X R n d, y R d, Ω : R d R: 1 min w R d 2 Xw y 2 2 + λω(w) Equivalent to the constraint version Unconstrained

Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ).

Lasso: Properties at a glance Computational Proximal methods : IST, FISTA, Chambolle-Pock [Chambolle and Pock, 2011, Beck and Teboulle, 2009]. Convergence rate : O(1/T 2 ) in T iterations. Assumption: availability of proximal operator of Ω (Easy for l 1 ). Statistical properties[wainwright, 2009] Support recovery (Will it recover the true support?). Sample complexity (How many samples needed?). Prediction error (What is the expected error in prediction?).

Lasso Model Recovery[Wainwright, 2009, Theorem 1] Setup Support of w Lasso y = Xw + ɛ, ɛ i N (0, σ 2 ), X R n d be S = {j w j 0 j [d]} min w R d 1 2 Xw y 2 2 + λ w 1

Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n 1 Define M = max i j M ij

Lasso Model Recovery[Wainwright, 2009, Theorem 1] Conditions 1. ( 1 X S c X S XS S) X 1 γ, Incoherence with γ (0, 1], ( ) 1 Λ min n X S X S C min 2σ 2 log d λ > λ 0 2 γ n W.h.p, the following holds: ( 1 ŵ S ws (X λ S S/n) X + 4σ/ ) C min 1 Define M = max i j M ij

Lasso Model Recovery: Special cases Case: XS X c S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: XS X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error.

Lasso Model Recovery: Special cases Case: X S c X S = 0. The incoherence condition trivially holds, and γ = 1. The threshold λ 0 is lesser The recovery error is lesser. Case: X S X S = I. C min = 1/n, the largest possible for a given n. Larger C min lesser recovery error. When does Lasso work well? Lasso prefers low correlation between support and non-support columns. Low correlation of columns within support lead to better recovery.

Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small.

Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013].

Lasso Model Recovery: Implications Setting: Strongly correlated columns in X. Correlation between feature i and feature j ρ ij x i x j Large correlation between X S and X S c Large correlation within X S C min is small. γ is small. The r.h.s. of the bound is large. (loose bound). Hence w.h.p., lasso fails in model recovery. In other words: Lasso solutions differ with the solver used. Solution is not unique typically. The prediction error may not be as worse though [Hebiri and Lederer, 2013]. Requirements. Need consistent estimates independent of the solver. Preferably select all the correlated variables as a group.

Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m.

Illustration: Lasso under correlation[zeng and Figueiredo, 2015] Setting: strongly correlated features. {1,..., d} = G 1 G k, G m G l =, l m ρ ij x i x j very high ( 1) for pairs i, j G m. Toy Example d = 40, k = 4. G 1 = [1 : 10], G 2 = [11 : 20], G 3 = [21 : 30], G 4 = [31 : 40]. Lasso: Ω(w) = w 1. Sparse recovery. Arbitrarily selects the variables within a group. 6 4 2 0 0 20 40 Figure: Original signal 6 4 2 0 0 20 40 Figure: Recovered signal.

Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group.

Possible solutions: 2-stage procedures Cluster Group Lasso [Buhlmann et al., 2013] Identify strongly correlated groups G = {G 1,..., G k } Canonical Correlation. Group selection. Ω(w) = k j=1 α j w Gj 2. Select all or no variables from each group. Goal: Learn G and w simultaneously?

Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. 2 Notation: w (i) : i th largest in w.

Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.

Ordered Weighted l 1 (OWL) norms OSCAR 2 [Bondell and Reich, 2008] d Ω O (w) = c i w (i) i=1 c i = c 0 + (d i)µ, c 0, µ, c d > 0. 6 4 2 0 0 20 40 Figure: Recovered: OWL OWL [Figueiredo and Nowak, 2016]: c 1 c d 0. 2 Notation: w (i) : i th largest in w.

Oscar: Sparsity Illustrated Figure: Examples of solutions:[bondell and Reich, 2008] Solutions encouraged towards vertices. Encourages blockwise constant solutions. See [Bondell and Reich, 2008].

OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij.

OWL-Properties Grouping covariates [Bondell and Reich, 2008, Theorem 1], [Figueiredo and Nowak, 2016, Theorem 1] w i = w j, if λ λ 0 ij. λ 0 ij 1 ρ 2 ij. Strongly correlated pairs grouped early in the regularization path. Groups: G j = {i w i = α j }.

OWL-Issues 6 4 2 0 0 20 40 Figure: True model 6 4 2 0 0 20 40 Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation.

OWL-Issues 6 4 2 0 0 20 40 Figure: True model 6 4 2 0 0 20 40 Figure: Recovered: OWL Bias for piecewise constant ŵ Easily understood through the norm balls. Requires more samples to consistent estimation. Lack of interpretations for choosing c

Ordered Weighted L1(OWL) norm and Submodular penalties

Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0}

Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ].

Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound.

Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w)

Preliminaries: Penalties on the Support Goal Encourage w to have desired support structure. supp(w) = {i w i 0} Idea Penalty on support [Obozinski and Bach, 2012]: pen(w) = F (supp(w)) + w p p, p [1, ]. Relaxation(pen(w)): Ω F p (w) Tightest positively homogenous, convex lower bound. Example: F (supp(w)) = supp(w) Ω F p (w) = w 1. Familiar! Message: The cardinality function always relaxes to the l 1 norm.

Nondecreasing Submodular Penalties on Cardinality Assumptions. Denote F F, if F : A {1,..., d} R is: 1. Submodular [Bach, 2011]. A B, F (A {k}) F (A) F (B {k}) F (B). Lovász extension: f : R d R. (Convex extension of F to R d ). 2. Cardinality based. F (A) = g( A ) (Invariant to permutations). 3. Non Decreasing. g(0) = 0, g(x) g(x 1). Implication: F F F completely specified through g. Example: Let V = {1,..., d}, define F (A) = A V \ A. Then f (w) = i<j w i w j.

Ω F and Lovász extension Result: Case p = [Bach, 2010]: Ω F (w) = f ( w ) The l relaxation coincides with the Lovász extension in the positive orthant. To work with Ω F, may use existing results of submodular function minimization. Ω F p not known in closed form for p <.

Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c 1 + + c i. Interpretations Gives alternate interpretations for OWL.

Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c 1 + + c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL.

Equivalence of OWL and Lovász extensions: Statement Proposition [Sankaran et al., 2017]: F F, Ω F (w) Ω O (w) 1. Given F (A) = f ( A ), Ω F (w) = Ω O (w), with c i = f (i) f (i 1). 2. Given c 1... c d 0, Ω O (w) = Ω F (w) with f (i) = c 1 + + c i. Interpretations Gives alternate interpretations for OWL. Ω F has undesired extreme points [Bach, 2011]. Explains piecewise constant solutions of OWL. Motivates Ω F p (w) for p <.

Ω S : Smoothed OWL (SOWL)

SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w).

SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ).

SOWL: Definition Smoothed OWL Ω S (w) := Ω F 2 (w). Variational form for Ω F 2 [Obozinski and Bach, 2012]. ( d 1 Ω S (w) = min η R d 2 + i=1 w 2 i η i + f (η) ). Use OWL equivalance: f ( η ) = Ω F (η) = d i=1 c i η (i), Ω S (w) = min η R d + ( w 2 i 1 d 2 η i i=1 }{{} Ψ(w,η) ) + c i η (i). (SOWL)

OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w.

OWL vs SOWL Case: c = 1 d. Ω S (w) = w 1. Ω O (w) = w 1. Norm Balls Case: c = [1, 0,..., 0] }{{}. d 1 Ω S (w) = w 2. Ω O (w) = w. 1.5 1.5 1 1 0.5 0.5 0 0-0.5-0.5-1 -1-1.5-1.5-1 -0.5 0 0.5 1 1.5-1.5-1.5-1 -0.5 0 0.5 1 1.5 : OWL : SOWL Figure: Norm balls for OWL, SOWL, for different values of c

Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w.

Group Lasso and Ω S SOWL objective (eliminating η): Ω S (w) = k w Gj j=1 i G j c i. Denote η w : denote the optimal η, given w. Key differences: Groups defined through η w = [δ 1,..., δ }{{} 1,..., δ k,..., δ k ]. }{{} G 1 G k Influenced by the choice of c.

Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO?

Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible?

Open Questions: 1. Does Ω S promotes grouping of correlated variables as Ω O? Are there any benefits over ΩO? 2. Is using Ω S computationally feasible? 3. Theoretical properties of Ω S vs Group Lasso?

Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y 2 2 + λ d 2 η i i=1 }{{} Γ (λ) (w,η)

Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y 2 2 + λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1.

Grouping property Ω S : Statement Learning Problem: LS-SOWL min w R d,η R d + ( w 2 i ) + c i η (i). 1 2n Xw y 2 2 + λ d 2 η i i=1 }{{} Γ (λ) (w,η) Theorem: [Sankaran et al., 2017] Define the following: ( ŵ (λ), ˆη (λ)) = argmin w,η Γ (λ) (w, η). ρ ij = x i x j. c = min i c i c i+1. There exists 0 λ 0 y 2 c (4 4ρ 2 ij ) 1 4, such that λ > λ 0, ˆη (λ) i = ˆη (λ) j

Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w.

Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w.

Grouping property Ω S : Interpretation 1. Variables η i, η j grouped if ρ ij 1 (Even for small λ). Similar to Figueiredo and Nowak [2016, Theorem 1], which is for absolute values of w. 2. SOWL differentiates grouping variable η and model variable w. 3. Allows model variance within group. ŵ (λ) i ŵ (λ) j as long as c has distinct values.

Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path.

Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path. 2 1.5 1 0.5 0 0 10 20 30 Figure: Original signal

Illustration: Group Discovery using SOWL Aim: Illustrate group discovery of SOWL. Consider z R d, Compute prox λωs (z). Study the regularization path. 2 1.5 1 0.5 0 0 10 20 30 Figure: Original signal 2 1 0 0 50 100 150 Lasso w 2 1 2 1 0 0 50 100 150 OWL w 4 2 Early group discovery. Model variation. 0 0 50 100 150 SOWL w 0 0 100 200 SOWL η Figure: x-axis : λ, y-axis: ŵ.

Proximal Methods: A brief overview Proximal operator prox Ω (z) = argmin w 1 2 w z 2 2 + Ω(w) Easy to evaluate for many simple norms. prox λl1 (z) = sign(z) ( z λ) +. Generalization of Projected Gradient Descent

FISTA [Beck and Teboulle, 2009] Initialization t (1) = 1, w (1) = x (1) = 0. Steps : k > 1 w (k) ( = prox Ω w (k 1) 1 L f ( w (k 1) ) ). ( t (k) = 1 + 1 + 4 ( ) t (k 1)) 2 /2. w (k) = w (k) + ( t (k 1) 1 t (k) ) (w (k) w (k 1)). Guarantee Convergence rate O(1/T 2 ). No additional assumptions than IST. Known to be optimal for this class of minimization problems.

Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z 2 2 + λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η).

Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z 2 2 + λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η2 1 0-5 0 5 10 log(λ) η2 1 0 0 10 20 λ

Computing prox ΩS Problem: prox λω (z) = argmin w 1 2 w z 2 2 + λω(w). w (λ) = prox λωs (z), η (λ) w = argmin η Ψ(w (λ), η). Key idea: Ordering of η (λ) w remains same for all λ. 3 3 η2 1 η2 1 0-5 0 5 10 log(λ) η w (λ) = (η z λ) +. 0 0 10 20 λ Same complexity as computing the norm Ω S (O(d log d)). True for all cardinality based F.

Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G k 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

Random Design Problem setting: LS-SOWL. True model: y = Xw + ε, ( X ) i N (µ, Σ), ε i N (0, σ 2 ). Notation: J = {i wi 0}, η w = [δ1,..., δ1,..., δk }{{},..., δ k]. }{{} G 1 G Assumptions: 3 k Σ J,J is invertible, λ 0, and λ n. Irrepresentability conditions 1. δ k = 0 if J c. 2. Σ J c,j (Σ J,J ) 1 DwJ 2 β < 1. Result[Sankaran et al., 2017] 1. ŵ p w. 2. P( ˆ J = J ) 1. Similar to Group Lasso [Bach, 2008, Theorem 2]. Learns the weights, without explicit groups information. 3 D: Diagonal matrix, defined using groups G1,..., G k, and w

Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. 4 The experiments followed the setup of Bondell and Reich [2008]

Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. 4 The experiments followed the setup of Bondell and Reich [2008]

Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). 4 The experiments followed the setup of Bondell and Reich [2008]

Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. 4 The experiments followed the setup of Bondell and Reich [2008]

Quantitative Simulation: Predictive Accuracy Aim: Learn ŵ using LS-SOWL, evaluate prediction error. Generate samples: x N (0, Σ), ɛ N (0, σ 2 ), y = x w + ɛ. Metric: E[ x (w ŵ) 2 ] = (w ŵ) Σ(w ŵ). Data: 4 w = [0 10, 2 10, 0 10, 2 10 ]. n = 100, σ = 15 and Σ i,j = 0.5 if i j and 1 if i = j. Models with group variance: Measure E[ x ( w ŵ) 2 ]. w = w + ɛ, ɛ U[ τ, τ], τ = 0, 0.2, 0.4. 4 The experiments followed the setup of Bondell and Reich [2008]

Predictive accuracy results Algorithm Med. MSE MSE (10th Perc). MSE (90th Perc) LASSO 46.1 / 45.2 / 45.5 32.8 / 32.7 / 33.2 60.0 / 61.5 / 61.4 OWL 27.6 / 27.0 / 26.4 19.8 / 19.2 / 19.2 42.7 / 40.4 / 39.2 El. Net 30.8 / 30.7 / 30.6 21.9 / 22.6 / 23.0 42.4 / 43.0 / 41.4 Ω S 23.9 / 23.3 / 23.4 16.9 / 16.8 / 16.8 35.2 / 35.4 / 33.2 Table: Each column has numbers for τ = 0, 0.2, 0.4.

Summary 1. Proposed a new family of norms Ω S. 2. Properties: Equivalent to OWL in group identification. Efficient computational tools Equivalences to Group Lasso. 3. Illustrations on performance through simulations.

Questions?

Thank you!!!

References I F. Bach. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research, 9:1179 1225, 2008. F. Bach. Structured sparsity-inducing norms through submodular functions. In Neural Information Processing Systems, 2010. F. Bach. Learning with submodular functions: A convex optimization perspective. Foundations and Trends in Machine Learning, 6-2-3:145 373, 2011. A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183 202, March 2009. H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics, 64:115 123, 2008.

References II P. Buhlmann, P. Rütimann, Sara van de Geer, and Cun-Hui Zhang. Correlated variables in regression : Clustering and sparse estimation. Journal of Statistical Planning and Inference, 43: 1835 1858, 2013. A. Chambolle and T. Pock. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of mathematical imaging and vision, 40(1):120 145, May 2011. M. A. T. Figueiredo and R. D. Nowak. Ordered weighted l 1 regularized regression with strongly correlated covariates: Theoretical aspects. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2016. Mohamed Hebiri and Johannes Lederer. How correlations influence lasso prediction. IEEE Transactions on Information Theory, 59 (3):1846 1854, 2013. G. Obozinski and F. Bach. Convex relaxation for combinatorial penalties. Technical Report 00694765, HAL, 2012.

References III R. Sankaran, F. Bach, and C. Bhattacharya. Identifying groups of strongly correlated variables through smoothed ordered weighted l 1-norms. In Artificial Intelligence and Statistics, pages 1123 1131, 2017. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267 288, 1994. M.J. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l 1 -constrained quadratic programming (lasso). IEEE Transactions On Information Theory, 55(5): 2183 2202, 2009. X. Zeng and M. Figueiredo. The ordered weighted l 1 norm: Atomic formulation, projections, and algorithms. ArXiv preprint:1409.4271v5, 2015.