Generalized Concomitant Multi-Task Lasso for sparse multimodal regression

Size: px

Start display at page:

Download "Generalized Concomitant Multi-Task Lasso for sparse multimodal regression"

Gwen Shelton
5 years ago
Views:

1 Generalized Concomitant Multi-Task Lasso for sparse multimodal regression Mathurin Massias INRIA Saclay Joint work with: Olivier Fercoq (Télécom ParisTech) Alexandre Gramfort (INRIA Saclay) Joseph Salmon (Télécom ParisTech)

2 Table of Contents Motivation - problem setup Joint estimators of the noise level General noise model Block homoscedastic model

3 "One" motivation: M/EEG inverse problem sensors: magneto- and electro-encephalogram measurements during a cognitive experiment sources: brain locations

p(m) e assume Gaussian variables: The M/EEG inverse problem: modelisation variables: E N (0, E ) E N (0, 2.5e 13 X N (0, dditive model: 0 X) B X) 1.3e 13 2.5e 13 X p 10000 n 100 d 1.

4 p(m) e assume Gaussian variables: The M/EEG inverse problem: modelisation variables: E N (0, E ) E N (0, 2.5e 13 X N (0, dditive model: 0 X) B X) 1.3e e 13 X p n 100 d 1.3e 13 X N (0, E) M = GX + E. M = GX +E. X are Y known, X is obtained by maximizing the l is obtained by maximizing the likelihood which lead X = arg min M GX E + X = arg min M ads to: X GX X E X = + X XG T X (G, XG T + E) 1

5 The M/EEG inverse problem: modelisation Multi-task regression: n observations q tasks p features Y R n q observation matrix X R n p forward matrix Y = XB + E where B R p q is the true source activity matrix E R n q is an additive white noise.

6 l 2,1 regularization sources Penalty: Group-Lasso (l 2,1 ) p B 2,1 = B j 2 j=1 (B j : j th row of B) we solve: 1 ˆB arg min B R p q 2 Y XB 2 + λ B 2,1 time Source activity ˆB R p q a.k.a. Multiple Measurement Vector (MMV) in signal processing or multitask Lasso in ML [Obozinski et al., 2010]

7 Table of Contents Motivation - problem setup Joint estimators of the noise level General noise model Block homoscedastic model

8 Choice of λ and noise level The noise E is assumed white gaussian, with i.i.d. entries E i,t N (0, σ 2 ) For optimal performance of the Lasso, λ should be proportional to the noise level (λ σ 2009] Yet σ is unknown in practice! log p n ) [Bickel et al.,

9 Joint estimation of β and σ (illustrated on Lasso for simplicity: model is y = Xβ + ε) Intuitive idea: run Lasso with some λ, get ˆβ estimate σ with residuals: σ = y X ˆβ / n relaunch Lasso with λ σ etc. Note: this is the original implementation proposed for the Scaled-Lasso [Sun and Zhang, 2012]

10 Concomitant Lasso Concomitant Lasso [Owen, 2007] (inspired by Huber [1981]): ( ˆβ, y Xβ 2 ˆσ) arg min + σ β R p,σ>0 2nσ 2 + λ β 1 also equivalent to the Square-root/Scaled Lasso [Belloni et al., 2011, Sun and Zhang, 2012]. Note: note that σ 2 acts as a penalty over the noise

11 Solving the Concomitant Lasso A jointly convex formulation (w.r.t. both β and σ): y Xβ 2 arg min + σ β R p,σ>0 2nσ 2 + λ β 1 β update: smooth + separable function, make CD steps [Friedman et al., 2007].

12 Solving the Concomitant Lasso A jointly convex formulation (w.r.t. both β and σ): y Xβ 2 arg min + σ β R p,σ>0 2nσ 2 + λ β 1 β update: smooth + separable function, make CD steps [Friedman et al., 2007]. σ update: 1D optimization problem, closed-form σ = y Xβ / n.

13 Solving the Concomitant Lasso A jointly convex formulation (w.r.t. both β and σ): y Xβ 2 arg min + σ β R p,σ>0 2nσ 2 + λ β 1 β update: smooth + separable function, make CD steps [Friedman et al., 2007]. σ update: 1D optimization problem, closed-form σ = y Xβ / n.

14 Solving the Concomitant Lasso A jointly convex formulation (w.r.t. both β and σ): y Xβ 2 arg min + σ β R p,σ>0 2nσ 2 + λ β 1 β update: smooth + separable function, make CD steps [Friedman et al., 2007]. σ update: 1D optimization problem, closed-form σ = y Xβ / n. But what if we hit y = Xβ?

15 Smoothed Concomitant Lasso Smoothed Concomitant Lasso [Ndiaye et al., 2017]: y Xβ 2 arg min + σ β R p,σ>σ 2nσ 2 + λ β 1

16 Smoothed Concomitant Lasso Smoothed Concomitant Lasso [Ndiaye et al., 2017]: y Xβ 2 arg min + σ β R p,σ>σ 2nσ 2 + λ β 1 Called smoothed following the terminology of Nesterov [2005], because it amounts to smoothing in the dual: for X = ˆθ = arg max y, λθ + σ θ X ( ) 1 2 nλ2 2 θ 2 { } θ R n : X θ 1, θ 1 λn. state-of-the-art solvers as fast as for the Lasso.

17 Table of Contents Motivation - problem setup Joint estimators of the noise level General noise model Block homoscedastic model

18 What about more complex noise models? What can we do if the noise is not white? Smoothed Generalized Concomitant Lasso (SGCL): (ˆB, ˆΣ) arg min B R p q Σ S n ++,Σ Σ Y XB 2 Σ 1 2nq + Tr(Σ) 2n + λ B 2,1 with Z A = Tr Z AZ and Σ = σ Id.

19 What about more complex noise models? What can we do if the noise is not white? Smoothed Generalized Concomitant Lasso (SGCL): (ˆB, ˆΣ) arg min B R p q Σ S n ++,Σ Σ Y XB 2 Σ 1 2nq + Tr(Σ) 2n + λ B 2,1 with Z A = Tr Z AZ and Σ = σ Id. In the case Σ = σ Id, we recover the Smoothed Concomitant.

20 What about more complex noise models? What can we do if the noise is not white? Smoothed Generalized Concomitant Lasso (SGCL): (ˆB, ˆΣ) arg min B R p q Σ S n ++,Σ Σ Y XB 2 Σ 1 2nq + Tr(Σ) 2n + λ B 2,1 with Z A = Tr Z AZ and Σ = σ Id. In the case Σ = σ Id, we recover the Smoothed Concomitant. Note: the noise penalty is now on the sum of the eigenvalues of Σ

21 Solving the SGCL Jointly convex formulation, we can still use alternate minimization and use the duality gap as stopping criterion. Σ fixed: smooth + l 1 -type, BCD works

22 Solving the SGCL Jointly convex formulation, we can still use alternate minimization and use the duality gap as stopping criterion. B fixed: with the current residuals R = Y XB, the problem is: 1 arg min Σ S n ++,Σ Σ 2nq Tr[R Σ 1 R] + 1 2n Tr(Σ). Closed-form solution: if U diag(s 1,..., s n )U is the SVD of RR : arg min Σ = U diag(max(σ, s 1 ),..., max(σ, s n ))U.

23 Alternate minimization Algorithm: Alternate min. for multi-task SGCL input : X, Y, Σ, λ, f, T init : B = 0 p,q, Σ 1 = Σ 1, R = Y for iter = 1,..., T do if iter = 1 (mod f) then Σ Ψ(R, Σ) // closed-form sol. of minimization in Σ for j = 1,..., p do L j = Xj Σ 1 X j // Lipschitz constants for j = 1,..., p do R R + X j B j // partial residual update (X j B j BST R, λnq ) L j L j // coef. update R R X j B j // residual update return B, Σ Complexity? OK if we store Σ 1 X, and Σ 1 R instead of R.

24 Main drawbacks Σ update OK for M/EEG because n = 300, but O(n 3 ) SVD problematic otherwise. O(n 2 ) parameters to infer for Σ, with nq observations: works only for q 10n.

25 Table of Contents Motivation - problem setup Joint estimators of the noise level General noise model Block homoscedastic model

26 Block Homoscedastic model In our case we record 3 different types of signals: electrodes measure the electric potentials magnetometers measure the magnetic field gradiometers measure the gradient of the magnetic field physical natures, different noise levels Observations are divided into 3 blocks & the partition is known.

27 Block Homoscedastic model K groups of observations: X = X 1. X K, Y = Y 1. Y K, E = E 1. E K Σ = diag(σ 1 Id n1,..., σ K Id nk ) For each block, homoscedastic model with white noise: Y k = X k B + σ ke k and entries of E k i.i.d. N (0, 1)

28 Smoothed Block Homoscedastic Concomitant (SBHCL) Plugging a diagonal Σ (constant over consecutive blocks) into the general model yields: Block Homoscedastic Concomitant: ( K Y k X k B 2 arg min B R p q, 2nqσ k=1 k σ 1,...,σ K R K ++ σ k σ k, k [K] + n ) kσ k + λ B 2n 2,1 Σ down from n(n 1) 2 parameters (hopeless without more structure) to K.

29 Solving the SBHCL (block) coordinate descent steps remain the same computing Σ 1 R for the BCD is easier

30 Solving the SBHCL (block) coordinate descent steps remain the same computing Σ 1 R for the BCD is easier σ k s updates are simple and can even be performed at each B j update (as for the concomitant)

31 Solving the SBHCL (block) coordinate descent steps remain the same computing Σ 1 R for the BCD is easier σ k s updates are simple and can even be performed at each B j update (as for the concomitant)

32 Alternate minimization Algorithm: Alternate min. for multi-task SBHCL input : X 1,..., X K, Y 1,..., Y K, σ 1,..., σ K, λ, T init : B = 0 p,q, k [K], σ k = Y k / n k q, R k = Y k, k [K], j [p], L k,j = Xj k 2 2 for iter = 1,..., T do for j = 1,..., p do for k = 1,..., K do R k R k + Xj kb j // residual update ( K B j BST k=1 Xj k σ k for k = 1,..., K do R k R k Xj kb j σ k σ k Rk nk q return B, σ 1,..., σ k R k ), λnq / K k=1 L k,j σ k // residual update // smart std dev update

33 In practice Design: (n, p, q) = 300, 1000, 100 X Toeplitz-correlated: Cov(X i, X j ) = ρ i j, ρ ]0, 1[ 3 blocks with noise in ratio 1, 2, 5

34 Support recovery True positive rate True positive rate SBHCL MTL MTL (Block 1) SCL SCL (Block 1) False positive rate SBHCL MTL MTL (Block 1) SCL SCL (Block 1) False positive rate ROC curves of true support recovery: for SBHCL, MTL and SCL on all blocks, and the MTL and SCL on the least noisy block. Top: SNR=1, ρ = 0.1, Bottom: SNR = 1, ρ = 0.9.

35 Prediction performance log10(rmse/rmseoracle) log10(rmse/rmseoracle) SBHCL (block 1) SBHCL (block 2) SBHCL (block 3) SCL (block 1) SCL (block 2) SCL (block 3) / max SBHCL (block 1) SBHCL (block 2) SBHCL (block 3) SCL (block 1) SCL (block 2) SCL (block 3) / max

36 Take home message more general noise models: possible to estimate full covariance if there are enough tasks if more noise structure is known (e.g., block homoscedastic model): not more costly than the multi-task Lasso (MTL) taking into account multiple noise levels helps: both for prediction and support identification using additional (though noisier) data helps! Python code is available at This work was funded by ERC Starting Grant SLAB ERC-YStG Slides powered by MooseTeX.

37 References I G. Obozinski, B. Taskar, and M. I. Jordan. Joint covariate selection and joint subspace selection for multiple classification problems. Statistics and Computing, 20(2): , P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of Lasso and Dantzig selector. Ann. Statist., 37(4): , T. Sun and C.-H. Zhang. Scaled sparse linear regression. Biometrika, 99(4): , A. B. Owen. A robust hybrid of lasso and ridge regression. Contemporary Mathematics, 443:59 72, P. J. Huber. Robust Statistics. John Wiley & Sons Inc., A. Belloni, V. Chernozhukov, and L. Wang. Square-root Lasso: pivotal recovery of sparse signals via conic programming. Biometrika, 98(4): , J. Friedman, T. J. Hastie, H. Höfling, and R. Tibshirani. Pathwise coordinate optimization. Ann. Appl. Stat., 1(2): , E. Ndiaye, O. Fercoq, A. Gramfort, V. Leclère, and J. Salmon. Efficient smoothed concomitant Lasso estimation for high dimensional regression. In NCMIP, Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103(1): , 2005.

Generalized Concomitant Multi-Task Lasso for Sparse Multimodal Regression

Generalized Concomitant Multi-Task Lasso for Sparse Multimodal Regression Mathurin Massias Olivier Fercoq Alexandre Gramfort Joseph Salmon Inria LTCI, Télécom ParisTech Inria LTCI, Télécom ParisTech Université