MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Size: px

Start display at page:

Download "MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT"

Ruth Simmons
5 years ago
Views:

1 MLCC 2018 Variable Selection and Sparsity Lorenzo Rosasco UNIGE-MIT-IIT

2 Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC

3 Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results MLCC

4 Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results Interpretability is often determined by detecting which factors allow good prediction MLCC

5 Prediction and Interpretability In many practical situations, beyond prediction, it is important to obtain interpretable results Interpretability is often determined by detecting which factors allow good prediction We look at this question from the perspective of variable selection MLCC

6 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 MLCC

7 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data, the goal of variable selection is to detect which are variables important for prediction MLCC

8 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j the components x j of an input can be seen as measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data, the goal of variable selection is to detect which are variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero MLCC

9 Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC

10 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 MLCC

11 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data the goal of variable selection is to detect which variables important for prediction MLCC

12 Linear Models Consider a linear model Here f w (x) = w T x = v w j x j (1) the components x j of an input are specific measurements (pixel values, dictionary words count, gene expressions,... ) i=1 Given data the goal of variable selection is to detect which variables important for prediction Key assumption: the best possible prediction rule is sparse, that is only few of the coefficients are non zero MLCC

13 Notation We need some notation: X n be the n by D data matrix MLCC

14 Notation We need some notation: X n be the n by D data matrix X j R n, j = 1,..., D its columns MLCC

15 Notation We need some notation: X n be the n by D data matrix X j R n, j = 1,..., D its columns Y n R n the output vector MLCC

16 High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. Classically n D low dimension/overdetermined system MLCC

17 High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. Classically n D low dimension/overdetermined system Lately n D high dimensional/underdetermined system Buzzwords: compressed sensing, high-dimensional statistics... MLCC

18 High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. n { X n w = Y n { D MLCC

19 High-dimensional Statistics Estimating a linear model corresponds to solving a linear system X n w = Y n. n { X n w = Y n { D Sparsity! MLCC

20 Brute Force Approach Sparsity can be measured by the l 0 norm w 0 = {j w j 0} that counts non zero components in w MLCC

21 Brute Force Approach Sparsity can be measured by the l 0 norm w 0 = {j w j 0} that counts non zero components in w If we consider the square loss, a regularization approach is given by 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 MLCC

22 Best subset selection is Hard 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) MLCC

23 Best subset selection is Hard 1 min w R D n n (y i f w (x i )) 2 + λ w 0 i=1 The above approach is as hard as a brute force approach: considering all training sets obtained with all possible subsets of variables (single, couples, triplets... of variables) The computational complexity is combinatorial. In the following we consider two possible approximate approaches: greedy methods convex relaxation MLCC

24 Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC

25 Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set MLCC

26 Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual MLCC

27 Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable MLCC

28 Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector MLCC

29 Greedy Methods Approach Greedy approaches encompasses the following steps: 1. initialize the residual, the coefficient vector, and the index set 2. find the variable most correlated with the residual 3. update the index set to include the index of such variable 4. update/compute coefficient vector 5. update residual. The simplest such procedure is called forward stage-wise regression in statistics and matching pursuit (MP) in signal processing MLCC

30 Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. MLCC

31 Initialization Let r, w, I denote the residual, the coefficient vector, an index set, respectively. The MP algorithm starts by initializing the residual r R n, the coefficient vector w R D, and the index set I {1,..., D} r 0 = Y n, w 0 = 0, I 0 = MLCC

32 Selection The variable most correlated with the residual is given by k = arg max a j, a j = (rt i 1 Xj ) 2 j=1,...,d X j 2, where we note that v j = rt i 1 Xj X j 2 = arg min r i 1 X j v 2, v R r i 1 X j v j 2 = r i 1 2 a j MLCC

33 Selection (cont.) Such a selection rule has two interpretations: We select the variable with larger projection on the output, or equivalently we select the variable such that the corresponding column best explains the the output vector in a least squares sense MLCC

34 Active Set, Solution and residual Update Then, index set is updated as I i = I i 1 {k}, and the coefficients vector is given by w i = w i 1 + w k, w k = v k e k where e k is the element of the canonical basis in R D with k-th component different from zero MLCC

35 Active Set, Solution and residual Update Then, index set is updated as I i = I i 1 {k}, and the coefficients vector is given by w i = w i 1 + w k, w k = v k e k where e k is the element of the canonical basis in R D with k-th component different from zero Finally, the residual is updated r i = r i 1 Xw k MLCC

36 Orthogonal Matching Pursuit A variant of the above procedure, called Orthogonal Matching Pursuit, is also often considered, where the coefficient computation is replaced by w i = arg min w R D Y n X n M Ii w 2, where the D by D matrix M I is such that (M I w) j = w j if j I and (M I w) j = 0 otherwise. Moreover, the residual update is replaced by r i = Y n X n w i MLCC

37 Theoretical Guarantees If the solution is sparse, and the data matrix has columns not too correlated OMP can be shown to recover with high probability the right vector of coefficients MLCC

38 Outline Variable Selection Subset Selection Greedy Methods: (Orthogonal) Matching Pursuit Convex Relaxation: LASSO & Elastic Net MLCC

39 l 1 Norm and Regularization Another popular approach to find sparse solutions is based on a convex relaxation Namely, the l 0 norm is replaced by the l 1 norm, w 1 = D w j j=1 MLCC

40 l 1 Norm and Regularization Another popular approach to find sparse solutions is based on a convex relaxation Namely, the l 0 norm is replaced by the l 1 norm, w 1 = D w j j=1 In the case of least squares, one can consider 1 min w R D n n (y i f w (x i )) 2 + λ w 1 i=1 MLCC

41 Convex Relxation 1 min w R D n n (y i f w (x i )) 2 + λ w 1. i=1 The above problem is called LASSO in statistics and Basis Pursuit in signal processing The objective function defining the corresponding minimization problem is convex but not differentiable Tools from non-smooth convex optimization are needed to find a solution MLCC

42 Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): MLCC

43 Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max MLCC

44 Iterative Soft Thresholding A simple yet powerful procedure to compute a solution is based on the so called iterative soft thresholding algorithm (ISTA): w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max At each iteration a non linear soft thresholding operator is applied to a gradient step MLCC

45 Iterative Soft Thresholding (cont.) w 0 = 0, w i = S λγ (w i 1 2γ n XT n (Y n X n w i 1 )), i = 1,..., T max the iteration should be run until a convergence criterion is met, e.g. w i w i 1 ɛ, for some precision ɛ, or a maximum number of iteration T max is reached To ensure convergence we should choose the step-size γ = n 2 X T n X n MLCC

46 Splitting Methods In ISTA the contribution of error and regularization are split: the argument of the soft thresholding operator corresponds to a step of gradient descent 2 n XT n (Y n X n w i 1 ) The soft thresholding operator depends only on the regularization and acts component wise on a vector w, so that S α (u) = u α + u u. MLCC

47 Soft Thresholding and Sparsity S α (u) = u α + u u. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero MLCC

48 Soft Thresholding and Sparsity S α (u) = u α + u u. The above expression shows that the coefficients of the solution computed by ISTA can be exactly zero This can be contrasted to Tikhonov regularization where this is hardly the case MLCC

49 Lasso meets Tikhonov: Elastic Net Indeed, it is possible to see that: while Tikhonov allows to compute a stable solution, in general its solution is not sparse On the other hand the solution of LASSO, might not be stable MLCC

50 Lasso meets Tikhonov: Elastic Net Indeed, it is possible to see that: while Tikhonov allows to compute a stable solution, in general its solution is not sparse On the other hand the solution of LASSO, might not be stable The elastic net algorithm, defined as 1 min w R D n n (y i f w (x i )) 2 + λ(α w 1 + (1 α) w 2 2), α [0, 1] (2) i=1 can be seen as hybrid algorithm which interpolates between Tikhonov and LASSO MLCC

51 ISTA for Elastic Net The ISTA procedure can be adapted to solve the elastic net problem, where the gradient descent step incorporates also the derivative of the l 2 penalty term. The resulting algorithm is w 0 = 0, for i = 1,..., T max w i = S λαγ ((1 λγ(1 α))w i 1 2γ n XT n (Y n X n w i 1 )), To ensure convergence we should choose the step-size γ = n 2( X T n X n + λ(1 α)) MLCC

52 Wrapping Up Sparsity and interpretable models greedy methods convex relaxation MLCC

53 Next Class unsupervised learning: clustering! MLCC

Oslo Class 6 Sparsity based regularization

RegML2017@SIMULA Oslo Class 6 Sparsity based regularization Lorenzo Rosasco UNIGE-MIT-IIT May 4, 2017 Learning from data Possible only under assumptions regularization min Ê(w) + λr(w) w Smoothness Sparsity