MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco

Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y i,w x i ) + λ w 2. i=1 Implicit regularization by optimization. Regularization with projections/sketching. Non linear extension with features/kernels. What about other norms/penalties?

Sparsity The function of interest depends on few building blocks

Why sparsity? Interpretability High dimensional statistics, n d Compression

What is sparsity? d f (x) = x j w j j=1 Sparse coefficients: few w j 0

Sparsity and dictionaries More generally consider f (x) = with φ 1,...,φ p dictionary. p φ j (x)w j j=1

Sparsity and dictionaries (cont.) The concept of sparsity depends on the considered dictionary. If we let (φ j ) j,(ψ j ) j two dictionaries of lin. indip. features such that f (x) = φ j β j = ψ j b j, then f = β = b. j j However, sparsity on (φ j ) j,(ψ j ) j can be very different!

Linear We stick to linear functions for sake of simplicity. d f (x) = x j w j. j=1 Given data, consider the linear system Xw = Ŷ.

Linear systems with sparsity n d There is a solution with s d non zero entries in unknown locations.

Best subset selection Solve for all possible columns subsets. Aka torturing the data until they confess.

Sparse regularization Best subset selection is equivalent to or min w 0 w R d w, subj. to Xw = Ŷ, 1 min w R d n Xw Ŷ 2 + λ w 0 w 2 l 0 -norm w 0 = d j=1 1 {wj 0}

Best subset selection min w 0, subj. to Xw = Ŷ, w R d The problem is combinatorially hard. Approximate approaches include: 1. Greedy methods. 2. Convex relaxations.

Greedy methods Initalize, then select a variable. Compute solution. Update. Repeat.

Matching pursuit r 0 = Ŷ, w 0 = 0, I 0 = for i = 1 to T Let X j = Xe j, and select j {1,...,d} maximizing 1 I i = I i 1 {j}, w i = w i 1 + v j e j r i = r i 1 X j v j = Ŷ Xw i a j = v 2 j X j 2 with v j = r i 1 X j X j 2, 1 Note that v j = argmin ˆX j v r i 1 2, and, ˆX j v j r i 1 2 = r i 1 a j v R

Orthogonal Matching pursuit r 0 = Ŷ, w 0 = 0, I 0 = for i = 1 to T Let X j = Xe j, and select j {1,...,d} maximizing I i = I i 1 {j}, a j = v 2 j X j 2 with v j = r i 1 X j X j 2, w i = argmin w XM Ii w Ŷ 2, where (M Ii w) j = δ j Ii w j r i = Ŷ Xw i

Convex relaxation Lasso (statistics) or Basis Pursuit (signal processing) 1 min w R d n Xw Ŷ 2 + λ w 1 w 2 l 1 -norm d w 1 = w i. i=1 Next, we discuss modeling + optimization aspects.

The geometry of sparsity min w 1, s.t. Xw = Ŷ

Ridge regression and sparsity Replace w 1 with w?

l 1 vs l 2 Unlike ridge-regression, l 1 regularization leads to sparsity!

Optimization for sparse regularization Convex but not smooth 1 min w R d n Xw Ŷ 2 + λ w 1

Optimization Could be solved via the subgradient method Objective function is composite min n Xw Ŷ 2 +λ w 1 } {{ }}{{} convex w R d 1 convex smooth

Proximal methods min w R d E(w) + R(w) Let 1 Prox R (w) = min v R d 2 v w 2 + R(v) and, for w 0 = 0 w t = Prox γr (w t 1 γ E(w t 1 ))

Proximal Methods (cont.) min w R d E(w) + R(w) Let R : R p R convex continuous and E : R p R differentiable, convex and such that E(w) E(w ) L w w (e.g. sup w H(w) L), Then for γ = 1/L, }{{} hessian converges to a minimizer of E + R. w t = Prox γr (w t 1 γ E(w t 1 ))

Soft thresholding R(w) = λ w 1 w j λ w j > λ (Prox λ 1 (w)) j = 0 w j [ λ,λ] w j + λ w j < λ

ISTA w t+1 = Prox γλ 1 (w t γ n X ( Xw t Ŷ )) w (Prox γλ 1 (w)) j j γλ w j > γλ = 0 w j [ γλ,γλ] w j + γλ w j < γλ Small coefficients are set to zero!

Back to inverse problems Xw = Ŷ If x i are i.i.d. gaussian vectors, w 0 = s and n 2s log d s then l 1 regularization recovers w with high probability.

Sampling theorem Classically 2ω 0 samples needed

LASSO 1 min w R d n Xw Ŷ 2 + λ w 1 Interpretability: variable selection!

Variable selection and correlation min n Xw Ŷ 2 + λ w 1 } {{ } strictly convex w R d 1 Cannot handle correlations between the variables

Elastic net regularization 1 min w R d n Xw Ŷ 2 + λ(α w 1 + (1 α) w 2 )

ISTA for elastic net w t+1 = Prox γλα 1 (w t γ 2 n X ( Xw t Ŷ ) γλ(1 α)w t 1 ) w (Prox γλα 1 (w)) j j γλα w j > γλα = 0 w j [ γλα,γλα] w j + γλα w j < γλα Small coefficients are set to zero!

Grouping effect Strong convexity = All relevant (possibly correlated) variables are selected

Elastic net and l p norms 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75 0.7-0.3-0.2-0.1 0 0.1 0.2 1.15 1.1 1.05 1 0.95 0.9 0.85 0.8 0.75-0.3-0.2-0.1 0 0.1 0.2 1 2 w 1 + 1 2 w 2 = 1 d ( w j p ) 1/p = 1 j=1 l p norms are similar to elastic net but they are smooth (no kink!)

Summary Sparsity Geometry Computations Variable selection and elastic net