Oslo Class 6 Sparsity based regularization

Size: px

Start display at page:

Download "Oslo Class 6 Sparsity based regularization"

Alexandrina Flowers
6 years ago
Views:

1 Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

2 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity

3 Sparsity The function of interest depends on few building blocks

4 Why sparsity Interpretability High dimensional statistics Compression

5 What is sparsity? d f(x) = x j w j j=1 Sparse coefficients: few w j 0

6 Sparsity and dictionaries More generally consider p f(x) = φ j (x)w j j=1 with φ 1,..., φ p atoms of a dictionary. The concept of sparsity depends on the considered dictionary.

7 Linear inverse problems n < d more variables than observations

8 Sparse regularization min w 1 n ˆXw ŷ 2 + λ w 0 w 2 2 l 0 -norm w 0 = d j=1 1 {wj 0}

9 Best subset selection

10 Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets...

11 Best subset selection min w 1 n ˆXw ŷ 2 + λ w 0 as hard as trying all possible subsets Greedy methods 2. Convex relaxations

12 Greedy methods Initalize, then Select a variable Compute solution Update Repeat

13 Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

14 Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

15 Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

16 Matching pursuit for i = 1 to T r 0 = ŷ, w 0 = 0, I 0 = Let ˆX j = ˆXe j, and select j {1,..., d} maximizing 1 a j = v2 j ˆX j 2, with v j = r i 1 ˆX j I i = I i 1 {j}, w i = w i 1 + v j e j r i = r i 1 ˆXw i 1 Note that v j = argmin ˆX j v r i 1 2, and, a j = ˆX j v j r i 1 2 v R

17 Orthogonal Matching pursuit r 0 = ŷ, w 0 = 0, I 0 = for i = 1 to T Select j {1,..., d} which maximizes I i = I i 1 {j}, v 2 j ˆXe j 2, with v j = r i 1 ˆXe j w i = arg min w ˆXM Ii w ŷ 2, where (M Ii w) j = δ j Ii w j r i = r i 1 ˆXw i

18 Convex relaxation min w 1 n ˆXw ŷ 2 + λ w 1 w 2 2 l 1 -norm d w 1 = w i i=1 Modeling Optimization

19 The problem of sparsity min w 1, s.t. ˆXw = ŷ

20 Ridge Regression and sparsity Replace w 1 with w 2?

21 Unlike ridge-regression, l 1 regularization leads to sparsity!

22 Sparse regularization min w Called Lasso or Basis Pursuit Convex but not smooth 1 n ˆXw ŷ 2 + λ w 1

23 Optimization Could be solved via the subgradient method Objective function is composite min w 1 n ˆXw ŷ 2 }{{} convex smooth +λ w 1 }{{} convex

24 Proximal methods min E(w) + R(w) w Let 1 Prox R (w) = min v 2 v w 2 + R(v) and, for w 0 = 0 w t = Prox γr (w t 1 γ E(w t 1 ))

25 Proximal Methods (cont.) min E(w) + R(w) w Let R : R p R convex continuous and E : R p R differentiable, convex and such that E(w) E(w ) L w w (e.g. sup w H(w) L), Then for γ = 1/L, }{{} hessian converges to a minimizer of E + R. w t = Prox γr (w t 1 γ E(w t 1 ))

26 Soft thresholding R(w) = λ w 1 w j λ w j > λ (Prox λ 1 (w)) j = 0 w j [ λ, λ] w j + λ w j < λ

27 ISTA w t+1 = Prox γλ 1 (w t γ n ˆX ( ˆXw t ŷ)) w j γλ w j > γλ (Prox γλ 1 (w)) j = 0 w j [ γλ, γλ] w j + γλ w j < γλ Small coefficients are set to zero!

28 Back to inverse problems ˆXw + δ = ŷ If x i i.i.d. random and then l 1 regularization recovers w n 2s log d s

29 Sampling theorem 2ω 0 samples needed

30 LASSO min w Interpretability: variable selection! 1 n ˆXw ŷ 2 + λ w 1

31 Variable selection and correlation min w 1 n ˆXw ŷ 2 + λ w 1 }{{} strictly convex Cannot handle correlations between the variables

32 Elastic net regularization min w 1 n ˆXw ŷ 2 + λ(α w 1 + (1 α) w 2 )

33 ISTA for elastic net w t+1 = Prox γλα 1 (w t γ 2 n ˆX ( ˆXw t ŷ) γλ(1 α)w t 1 ) w j γλα w j > γλα (Prox γλα 1 (w)) j = 0 w j [ γλα, γλα] w j + γλα w j < γλα Small coefficients are set to zero!

34 Grouping effect Strong convexity = All relevant (possibly correlated) variables are selected

35 Elastic net and l p norms w w 2 = 1 d ( w j p ) 1/p = 1 j=1 l p norms are similar to elastic net but they are smooth (no kink!)

36 This Class Sparsity Geometry Computations Variable selection and elastic net

37 Next Class Structured Sparsity

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications. Class 08: Sparsity Based Regularization. Lorenzo Rosasco

MIT 9.520/6.860, Fall 2018 Statistical Learning Theory and Applications Class 08: Sparsity Based Regularization Lorenzo Rosasco Learning algorithms so far ERM + explicit l 2 penalty 1 min w R d n n l(y