Oslo Class 6 Sparsity based regularization

RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

Sparsity The function of interest depends on few building blocks

Why sparsity Interpretability High dimensional statistics Compression

What is sparsity? d f(x) = x j w j j=1 Sparse coefficients: few w j 0

Sparsity and dictionaries More generally consider p f(x) = φ j (x)w j j=1 with φ 1,..., φ p atoms of a dictionary. The concept of sparsity depends on the considered dictionary.

Linear inverse problems n < d more variables than observations

Sparse regularization min w 1 n ˆXw ŷ 2 + λ w 0 w 2 2 l 0 -norm w 0 = d j=1 1 {wj 0}

Best subset selection

Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets...

Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets... 1. Greedy methods 2. Convex relaxations

Greedy methods Initalize, then Select a variable Compute solution Update Repeat

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j r i = r i 1 ˆXw i 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

Orthogonal Matching pursuit r 0 = ŷ, w 0 = 0, I 0 = for i = 1 to T Select j {1,..., d} which maximizes I i = I i 1 {j}, v 2 j ˆXe j 2, with v j = r i 1 ˆXe j w i = arg min w ˆXM Ii w ŷ 2, where (M Ii w) j = δ j Ii w j r i = r i 1 ˆXw i

Convex relaxation min w 1 n ˆXw ŷ 2 + λ w 1 w 2 2 l 1 -norm d w 1 = w i i=1 Modeling Optimization

The problem of sparsity min w 1, s.t. ˆXw = ŷ

Ridge Regression and sparsity Replace w 1 with w 2?

Unlike ridge-regression, l 1 regularization leads to sparsity!

Sparse regularization min w Called Lasso or Basis Pursuit Convex but not smooth 1 n ˆXw ŷ 2 + λ w 1

Optimization Could be solved via the subgradient method Objective function is composite min w 1 n ˆXw ŷ 2 }{{} convex smooth +λ w 1 }{{} convex

Proximal methods min E(w) + R(w) w Let 1 Prox R (w) = min v 2 v w 2 + R(v) and, for w 0 = 0 w t = Prox γr (w t 1 γ E(w t 1 ))

Proximal Methods (cont.) min E(w) + R(w) w Let R : R p R convex continuous and E : R p R differentiable, convex and such that E(w) E(w ) L w w (e.g. sup w H(w) L), Then for γ = 1/L, }{{} hessian converges to a minimizer of E + R. w t = Prox γr (w t 1 γ E(w t 1 ))

Soft thresholding R(w) = λ w 1 w j λ w j > λ (Prox λ 1 (w)) j = 0 w j [ λ, λ] w j + λ w j < λ

ISTA w t+1 = Prox γλ 1 (w t γ n ˆX ( ˆXw t ŷ)) w j γλ w j > γλ (Prox γλ 1 (w)) j = 0 w j [ γλ, γλ] w j + γλ w j < γλ Small coefficients are set to zero!

Back to inverse problems ˆXw + δ = ŷ If x i i.i.d. random and then l 1 regularization recovers w n 2s log d s

Sampling theorem 2ω 0 samples needed

LASSO min w Interpretability: variable selection! 1 n ˆXw ŷ 2 + λ w 1

Variable selection and correlation min w 1 n ˆXw ŷ 2 + λ w 1 }{{} strictly convex Cannot handle correlations between the variables

Elastic net regularization min w 1 n ˆXw ŷ 2 + λ(α w 1 + (1 α) w 2 )

ISTA for elastic net w t+1 = Prox γλα 1 (w t γ 2 n ˆX ( ˆXw t ŷ) γλ(1 α)w t 1 ) w j γλα w j > γλα (Prox γλα 1 (w)) j = 0 w j [ γλα, γλα] w j + γλα w j < γλα Small coefficients are set to zero!

Grouping effect Strong convexity = All relevant (possibly correlated) variables are selected

Elastic net and l p norms 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7-0.3-0.2-0.1 0 0.1 0.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75-0.3-0.2-0.1 0 0.1 0.2 1 2 w 1 + 1 2 w 2 = 1 d ( w j p ) 1/p = 1 j=1 l p norms are similar to elastic net but they are smooth (no kink!)

This Class Sparsity Geometry Computations Variable selection and elastic net

Next Class Structured Sparsity