Oslo Class 7 Structured sparsity

Size: px

Start display at page:

Download "Oslo Class 7 Structured sparsity"

Tyrone Fox
5 years ago
Views:

1 Oslo Class 7 Structured sparsity Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017

2 Exploiting structure Building blocks of a function can be more structured than single variables

3 Sparsity Variables divided in non-overlapping groups

4 Group sparsity f(x) = d j=1 w jx j w = (w 1,...,...,..., w }{{} d ) }{{} w(1) w(g) each group G g has size G g, so w(g) R Gg

5 Group sparsity regularization Regularization exploiting structure R group (w) = w(g) = Gg (w(g)) 2 j j=1

6 Group sparsity regularization Regularization exploiting structure R group (w) = w(g) = Gg (w(g)) 2 j j=1 Compare to G g w(g) 2 = (w(g)) 2 j j=1

7 Group sparsity regularization Regularization exploiting structure R group (w) = w(g) = Gg (w(g)) 2 j j=1 Compare to G g w(g) 2 = (w(g)) 2 j j=1 or G g w(g) 2 = (w(g)) j j=1

8 l 1 l 2 norm We take the l 2 norm of all the groups ( w(1),..., w(g) ) and then the l 1 norm of the above vector w(g)

9 Groups lasso min w 1 n ˆXw ŷ 2 + λ w(g) reduces to the Lasso if groups have cardinality one

10 Computations min w 1 n ˆXw ŷ 2 + λ w(g) }{{} non differentiable Convex, non-smooth, but composite structure w t+1 = Prox γλrgroup (w t γ 2 ) n ˆX ( ˆXw t ŷ)

11 Block thresholding It can be shown that Prox λrgroup (w) = (Prox λ (w(1)),..., Prox λ (w(g)) (Prox λ (w(g))) j = {w(g) j λ w(g)j w(g) w(g) > λ 0 w(g) λ Entire groups of coefficients set to zero! Reduces to softhresholding if groups have cardinality one

12 Other norms l 1 l p norms R(w) = w(g) p = j=1 G g (w(g)) p j 1 p

13 Overlapping groups Variables divided in possibly overlapping groups

14 Regularization with overlapping groups Group Lasso R GL (w) = w(g)

15 Regularization with overlapping groups Group Lasso R GL (w) = w(g) The selected variables are union of group complements

16 Regularization with overlapping groups Let w(g) R d be equal to w(g) on group G g and zero otherwise

17 Regularization with overlapping groups Let w(g) R d be equal to w(g) on group G g and zero otherwise Group Lasso with overlap { G } R GLO (w) = inf w(g) w(1),..., w(g) s.t. w = w(g)

18 Regularization with overlapping groups Let w(g) R d be equal to w(g) on group G g and zero otherwise Group Lasso with overlap { G } R GLO (w) = inf w(g) w(1),..., w(g) s.t. w = w(g) Multiple ways to write w = G w(g)

19 Regularization with overlapping groups Let w(g) R d be equal to w(g) on group G g and zero otherwise Group Lasso with overlap { G } R GLO (w) = inf w(g) w(1),..., w(g) s.t. w = w(g) Multiple ways to write w = G w(g) Selected variables are groups!

20 An equivalence It holds min w 1 n ˆXw ŷ λr GLO (w) min w n X w ŷ 2 + λ w(g) X is the matrix obtained by replicating columns/variables w = (w(1),..., w(g)), vector with (nonoverlapping!) groups

21 An equivalence (cont.) Indeed min w 1 n ˆXw ŷ 2 + λ w(1),...,w(g) s.t. inf w(1),...,w(g) s.t. G w(g)=w w(g) = 1 inf n ˆXw ŷ 2 + λ w(g) = G w(g)=w 1 inf w(1),...,w(g) n ˆX( 1 G inf w(1),...,w(g) n w(g)) ŷ 2 + λ w(g) = ˆX Gg w(g) ŷ 2 + λ w(g) = 1 min w n X w ŷ 2 + λ w(g)

22 Computations Can use block thresholding with replicated variables = potentially wasteful The proximal operator for R GLO can be computed efficiently but not in closed form

23 More structure Structured overlapping groups trees DAG... Structure can be exploited in computations...

24 Beyond linear models Consider a dictionary made by union of distinct dictionaries f(x) = G f g (x) }{{} = Φ g (x) w(g), where each dictionary defines a feature map Φ g (x) = (φ g 1 (x),..., φg p g (x)) Easy extension with usual change of variable...

25 Representer theorems Let f(x) = x ( w(g)) = x(g) w(g) = f g (x),

26 Representer theorems Let Idea Show that i.e. f g (x) = f(x) = x ( w(g)) = w(g) = x(g) w(g) = n x(g) i c(g) i, i=1 n x(g) x(g) i c(g) i = i=1 n f g (x), x(g) x(g) i }{{} i=1 Φ g(x) Φ g(x i)=k g(x,x i) c(g) i

27 Representer theorems Let Idea Show that i.e. f g (x) = f(x) = x ( w(g)) = w(g) = x(g) w(g) = n x(g) i c(g) i, i=1 n x(g) x(g) i c(g) i = i=1 Note that in this case n f g (x), x(g) x(g) i }{{} i=1 Φ g(x) Φ g(x i)=k g(x,x i) f g 2 = w(g) 2 = c(g) ˆX(g) ˆX(g) c(g) }{{} ˆK(g) c(g) i

28 Coefficients update c t+1 = Prox γλrgroup (c t γ( ˆKc ) t ŷ)) where ˆK = ( ˆK(1),..., ˆK(G)), and c t = (c t (1),..., c t (G)) Block Thresholding It can be shown that c(g) j c(g) λ j (Prox λ (c(g))) j c(g) ˆK(g)c(g) f g > λ = }{{} fg 0 f g λ

29 Non-parametric sparsity f(x) = f g (x) n n f g (x) = x(g) x(g) i (c(g)) i f g (x) = K g (x, x i )(c(g)) i i=1 i=1 (K 1,..., K G ) family of kernels w(g) = f g Kg

30 l 1 MKL 1 inf n ˆXw ŷ 2 + λ w(g) = G w(g)=w w(1),...,w(g) s.t. min f 1,..., f g s.t. G f g = f 1 n n (y i f(x i )) 2 + λ f g Kg i=1

31 l 2 MKL w(g) 2 = f g 2 K g Corresponds to using the kernel K(x, x ) = K g (x, x )

32 l 1 or l 2 MKL l 2 *much* faster l 1 could be useful is only few kernels are relevant

33 Why MKL? Data fusion different features Model selection, e.g. gaussian kernels with different widths Richer model many kernels!

34 MKL & kernel learning It can be shown that min f 1,..., f g s.t. G f g = f 1 n n (y i f(x i )) 2 + λ f g Kg i=1 min min 1 K K f H K n n (y i f(x i )) 2 + λ f 2 K i=1 where K = {K K = g K gα g, α g 0, }

35 Sparsity beyond vectors Recall multi-variable regression (x i, y i ) i=1 n, x i R d, y i R T f(x) = x }{{} W d T min W ˆXW Ŷ 2 F + λ Tr(W AW )

36 Sparse regularization We have seen d T Tr(W W ) = (W t,j ) 2 j=1 t=1 We could consider now d T W t,j j=1 t=1...

37 Spectral Norms/p-Schatten norms We have seen Tr(W W ) = min{d,t } t=1 σ 2 i We could consider now or R(W ) = W = min{d,t } t=1 σ i, nuclear norm min{d,t } R(W ) = ( (σ i ) p ) 1/p, p-schatten norm t=1

38 Nuclear norm regularization min ˆXW Ŷ W 2 F + λ W

39 Computations W t+1 = Prox γλ ( W t 2γ ˆX ( XW t Ŷ ) ) Let W = UΣV, Σ = diag(σ 1,..., σ p ) Prox (W ) = U diag(prox 1 (σ 1,..., σ p ))V

40 This class Structured sparsity MKL Matrix sparsity

41 Next class Solve intelligence!

Oslo Class 6 Sparsity based regularization

RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity