Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Size: px

Start display at page:

Download "Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014"

Julian Tyler Gibson
5 years ago
Views:

1 Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox LASSO Regression LASSO: least absolute shrinkage and selection operator New objective: Emily Fox

2 Geometric Intuition for Sparsity β 2. ^ β. β 2 ^ β β 1 β 1 Lasso Ridge Regression Emily Fox Soft Threshholding ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > From Kevin Murphy textbook Emily Fox

LASSO Coefficient Path From Kevin Murphy textbook Emily Fox 2014 5 Sparsistency Typical Statistical Consistency Analysis: Holding model size (p) fixed, as number of samples (N) goes to

3 LASSO Coefficient Path From Kevin Murphy textbook Emily Fox Sparsistency Typical Statistical Consistency Analysis: Holding model size (p) fixed, as number of samples (N) goes to infinity, estimated parameter goes to true parameter Here we want to examine p >> N domains Let both model size p and sample size N go to infinity! Hard case: N = k log p Emily Fox

4 Sparsistency Rescale LASSO objective by N: Theorem (Wainwright 2008, Zhao and Yu 2006, ): Under some constraints on the design matrix X, if we solve the LASSO regression using Then for some c 1 >0, the following holds with at least probability The LASSO problem has a unique solution with support contained within the true support If min j >c 2 n for some c 2 >0, then S( ˆ) =S( ) j2s( ) Emily Fox Case Study 3: fmri Prediction Fused LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox February 4 th,

fmri Prediction Subtask Goal: Predict semantic features from fmri image Features of word Emily Fox 2014 9 Fused LASSO Might want coefficients of neighboring voxels to be

5 fmri Prediction Subtask Goal: Predict semantic features from fmri image Features of word Emily Fox Fused LASSO Might want coefficients of neighboring voxels to be similar How to modify LASSO penalty to account for this? Graph-guided fused LASSO Assume a 2d lattice graph connecting neighboring pixels in the fmri image Penalty: Emily Fox

6 Generalized LASSO Assume a structured linear regression model: If D is invertible, then get a new LASSO problem if we substitute Otherwise, not equivalent For solution path, see Ryan Tibshirani and Jonathan Taylor, The Solution Path of the Generalized Lasso. Annals of Statistics, Emily Fox Generalized LASSO Let D = ˆ 1 = argmin 2R n 2 ky k2 2 + kd k Thisisthe1d fused lasso Emily Fox

7 Generalized LASSO ˆ 1 = argmin 2R n 2 ky k2 2 + kd k 1 Suppose D gives adjacent di erences in : D i =(0, 0,... 1,...,1,...0), where adjacency is defined according to a graph G. For a 2d grid, this is the 2d fused lasso. Emily Fox Generalized LASSO Let D = Trend filtering ˆ 1 = argmin 2R n 2 ky k2 2 + kd k Thisislinear trend filtering Emily Fox

8 8 Generalized LASSO Emily Fox ˆ = argmin 2R n 1 2 ky k2 2 + kd k 1 Let D = Getquadratic trend filtering Generalized LASSO Emily Fox Tracing out the fits as a function of the regularization parameter ˆ for = 25 ˆ for 2 [0, 1]

9 Acknowledgements Some material relating to the fused/generalized LASSO slides was provided by Ryan Tibshirani Emily Fox Case Study 3: fmri Prediction LASSO Solvers Part 1: LARS Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox

10 LASSO Algorithms Standard convex optimizer Now: Least angle regression (LAR) Efron et al Computes entire path of solutions State-of-the-art until 2008 Next up: Pathwise coordinate descent ( shooting ) new Parallel (approx.) methods Emily Fox LARS Efron et al LAR is an efficient stepwise variable selection algorithm useful and less greedy version of traditional forward selection methods Can be modified to compute regularization path of LASSO à LARS (Least angle regression and shrinkage) Increasing upper bound B, coefficients gradually turn on Few critical values of B where support changes Non-zero coefficients increase or decrease linearly between critical points Can solve for critical values analytically Complexity: Emily Fox

11 LASSO Coefficient Path From Kevin Murphy textbook Emily Fox LARS Algorithm Overview Start with all coefficient estimates A Let be the active set of covariates most correlated with the current residual A = {x j1 } Initially, for some covariate Take the largest possible step in the direction of until another covariate enters Continue in the direction equiangular between and until a third covariate enters x j1 x j2 A x j1 x j2 x j3 A x j1 Continue in the direction equiangular between,, until a fourth covariate xj4 enters A x j1 x j2 x j3 This procedure continues until all covariates are added at which point Emily Fox

12 Comments LARS increases A, but LASSO allows it to decrease Only involves a single index at a time If p > N, LASSO returns at most N variables If group of variables are highly correlated, LASSO tends to choose one to include rather arbitrarily Straightforward to observe from LARS algorithm.sensitive to noise. Emily Fox More Comments In general, can t solve analytically for GLM (e.g., logistic reg.) Gradually decrease λ and use efficiency of computing ˆ( k ) from ˆ( k 1 ) = warm-start strategy See Friedman et al for coordinate ascent + warm-starting strategy If N > p, but variables are correlated, ridge regression tends to have better predictive performance than LASSO (Zou & Hastie 2005) Elastic net is hybrid between LASSO and ridge regression Emily Fox

13 Case Study 3: fmri Prediction LASSO Solvers Part 2: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD Averaging Solutions Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox Scaling Up LASSO Solvers Another way to solve LASSO problem: Stochastic Coordinate Descent (SCD) Minimizing a coordinate in LASSO A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! J Analysis of SCD Parallel SCD (Shotgun) Other parallel learning approaches for linear models Parallel stochastic gradient descent (SGD) Parallel independent solutions then averaging Emily Fox

14 Coordinate Descent Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick a coordinate? When does this converge to optimum? Emily Fox Soft Threshholding ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > From Kevin Murphy textbook Emily Fox

15 Stochastic Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate j at random Set: Where: ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > a j =2 NX (x i j) 2 i=1 NX c j =2 x i j(y i 0 jx i j) i=1 Emily Fox Analysis of SCD [Shalev-Shwartz, Tewari 09/ 11] Analysis works for LASSO, L1 regularized logistic regression, and other objectives! For (coordinate-wise) strongly convex functions: Theorem: Starting from After T iterations Where E[ ] is wrt random coordinate choices of SCD Natural question: How does SCD & SGD convergence rates differ? Emily Fox

16 Shooting: Sequential SCD Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Stochastic Coordinate Descent (SCD) (e.g., Shalev-Shwartz & Tewari, 2009) While not converged, " Choose random coordinate j, " Update β j (closed-form minimization) F(β) contour Emily Fox Shotgun: Parallel SCD [Bradley et al 11] Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Shotgun (Parallel SCD) While not converged, " On each of P processors, " Choose random coordinate j, " Update β j (same as for Shooting) Emily Fox

17 Is SCD inherently sequential? Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Coordinate update: β j β j +δβ j (closed-form minimization) Collective update: " δβ i % $ 0 ' Δβ = $ 0 ' $ ' $ δβ j ' # 0 & Emily Fox Is SCD inherently sequential? Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Theorem: If X is normalized s.t. diag(x T X)=1, F(β + Δβ) F(β) ( ) 2 δβ ij + X T X i j P i j,i k P, j k ( ) ij,i k δβ ij δβ ik Emily Fox

18 Is SCD inherently sequential? Theorem: If X is normalized s.t. diag(x T X)=1, F(β + Δβ) F(β) ( ) 2 δβ ij + X T X i j P i j,i k P, j k ( ) ij,i k δβ ij δβ ik Nice case: Uncorrelated features Bad case: Correlated features Emily Fox Shotgun: Convergence Analysis Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Assume # parallel updates P < pd /ρ +1 Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009) Emily Fox

19 Convergence Analysis Lasso: Theorem: Shotgun Convergence Assume P < d p/ρ +1 where ρ = spectral radius of X T X E! " F(β (T ) )# $ F(β*) min F(β) where β F(β) = Xβ y 2 2 +λ β 1 ( ) p d 1 β* F(β (0) ) TP Nice case: Uncorrelated features ρ = P max = Bad case: Correlated features ρ = P max = (at worst) Emily Fox Empirical Evaluation Iterations to convergence Mug32_singlepixcam P max = pd =1024 ρ = P (# simulated parallel updates) Iterations to convergence Ball64_singlepixcam P max =3 pd = 4096 ρ = P (# simulated parallel updates) Emily Fox

20 Stepping Back Stochastic coordinate ascent Optimization: Parallel SCD: Issue: Solution: Natural counterpart: Optimization: Parallel Issue: Solution: Emily Fox Parallel SGD with No Locks [e.g., Hogwild!, Niu et al. 11] Each processor in parallel: Pick data point i at random For j = 1 p: Assume atomicity of: Emily Fox

21 What you need to know Sparsistency Fused LASSO LASSO Solvers LARS A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! J Analysis of SCD Parallel SCD (Shotgun) Emily Fox

LASSO Review, Fused LASSO, Parallel LASSO Solvers

Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1 Variable