Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Similar documents
LASSO Review, Fused LASSO, Parallel LASSO Solvers

Classification Logistic Regression

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Pathwise coordinate optimization

Multivariate Normal Models

Is the test error unbiased for these programs?

6. Regularized linear regression

Multivariate Normal Models

Linear Methods for Regression. Lijun Zhang

l 1 and l 2 Regularization

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Lecture 25: November 27

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Kernelized Perceptron Support Vector Machines

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Lasso: Algorithms and Extensions

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

Linear regression methods

STA141C: Big Data & High Performance Statistical Computing

Fast Regularization Paths via Coordinate Descent

Lass0: sparse non-convex regression by local search

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Ad Placement Strategies

Regression Shrinkage and Selection via the Lasso

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Regularization Paths

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECS289: Scalable Machine Learning

A New Combined Approach for Inference in High-Dimensional Regression Models with Correlated Variables

Linear Models in Machine Learning

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Bi-level feature selection with applications to genetic association

Selected Topics in Optimization. Some slides borrowed from

Chapter 3. Linear Models for Regression

Statistical Methods for Data Mining

Lecture 23: November 21

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Lasso, Ridge, and Elastic Net

Fast Regularization Paths via Coordinate Descent

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Fast Regularization Paths via Coordinate Descent

Regularization and Variable Selection via the Elastic Net

Linear Regression Introduction to Machine Learning. Matt Gormley Lecture 4 September 19, Readings: Bishop, 3.

Parallel Coordinate Optimization

LEAST ANGLE REGRESSION 469

Uses of duality. Geoff Gordon & Ryan Tibshirani Optimization /

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Generalized Elastic Net Regression

Least Angle Regression, Forward Stagewise and the Lasso

Statistical Inference

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Introduction to the genlasso package

SVRG++ with Non-uniform Sampling

Statistical Learning with the Lasso, spring The Lasso

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Optimization methods

Prediction & Feature Selection in GLM

Convex Optimization Lecture 16

DATA MINING AND MACHINE LEARNING

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for OR & FE

Big Data Analytics: Optimization and Randomization

Biostatistics Advanced Methods in Biostatistics IV

Classification: Logistic Regression from Data

A significance test for the lasso

Sparsity in Underdetermined Systems

Regularization Paths. Theme

Gradient descent. Barnabas Poczos & Ryan Tibshirani Convex Optimization /36-725

UVA CS 4501: Machine Learning. Lecture 6 Extra: Op=miza=on for Linear Regression Model with Regulariza=ons. Dr. Yanjun Qi. University of Virginia

Linear Regression. Volker Tresp 2018

Lasso, Ridge, and Elastic Net

The lasso: some novel algorithms and applications

MS-C1620 Statistical inference

Stability and the elastic net

arxiv: v1 [cs.lg] 17 Dec 2015

Ad Placement Strategies

Summary and discussion of: Exact Post-selection Inference for Forward Stepwise and Least Angle Regression Statistics Journal Club

Sparse Approximation and Variable Selection

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Collaborative Filtering Matrix Completion Alternating Least Squares

Iterative Selection Using Orthogonal Regression Techniques

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Last Lecture Recap. UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 6: Regression Models with Regulariza8on

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Sparse Linear Models (10/7/13)

27: Case study with popular GM III. 1 Introduction: Gene association mapping for complex diseases 1

Overfitting, Bias / Variance Analysis

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Linear Regression (continued)

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Transcription:

Case Study 3: fmri Prediction Fused LASSO LARS Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 1 LASSO Regression LASSO: least absolute shrinkage and selection operator New objective: Emily Fox 2014 2 1

Geometric Intuition for Sparsity β 2. ^ β. β 2 ^ β β 1 β 1 Lasso Ridge Regression Emily Fox 2014 3 Soft Threshholding ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > From Kevin Murphy textbook Emily Fox 2014 4 2

LASSO Coefficient Path From Kevin Murphy textbook Emily Fox 2014 5 Sparsistency Typical Statistical Consistency Analysis: Holding model size (p) fixed, as number of samples (N) goes to infinity, estimated parameter goes to true parameter Here we want to examine p >> N domains Let both model size p and sample size N go to infinity! Hard case: N = k log p Emily Fox 2014 6 3

Sparsistency Rescale LASSO objective by N: Theorem (Wainwright 2008, Zhao and Yu 2006, ): Under some constraints on the design matrix X, if we solve the LASSO regression using Then for some c 1 >0, the following holds with at least probability The LASSO problem has a unique solution with support contained within the true support If min j >c 2 n for some c 2 >0, then S( ˆ) =S( ) j2s( ) Emily Fox 2014 7 Case Study 3: fmri Prediction Fused LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox February 4 th, 2014 8 4

fmri Prediction Subtask Goal: Predict semantic features from fmri image Features of word Emily Fox 2014 9 Fused LASSO Might want coefficients of neighboring voxels to be similar How to modify LASSO penalty to account for this? Graph-guided fused LASSO Assume a 2d lattice graph connecting neighboring pixels in the fmri image Penalty: Emily Fox 2014 10 5

Generalized LASSO Assume a structured linear regression model: If D is invertible, then get a new LASSO problem if we substitute Otherwise, not equivalent For solution path, see Ryan Tibshirani and Jonathan Taylor, The Solution Path of the Generalized Lasso. Annals of Statistics, 2011. Emily Fox 2014 11 Generalized LASSO Let D = 2 6 4 ˆ 1 = argmin 2R n 2 ky k2 2 + kd k 1 1 1 0 0... 0 1 1 0... 0 0 1 1.... 3 7 5.Thisisthe1d fused lasso. 2 0 2 4 6 8 10 12 0 20 40 60 80 100 Emily Fox 2014 12 6

Generalized LASSO ˆ 1 = argmin 2R n 2 ky k2 2 + kd k 1 Suppose D gives adjacent di erences in : D i =(0, 0,... 1,...,1,...0), where adjacency is defined according to a graph G. For a 2d grid, this is the 2d fused lasso. Emily Fox 2014 13 Generalized LASSO Let D = 2 6 4 Trend filtering ˆ 1 = argmin 2R n 2 ky k2 2 + kd k 1 1 2 1 0... 0 1 2 1... 0 0 1 2.... 3 7 5.Thisislinear trend filtering. 2 0 2 4 6 8 10 12 0 20 40 60 80 100 Emily Fox 2014 14 7

8 Generalized LASSO Emily Fox 2014 15 ˆ = argmin 2R n 1 2 ky k2 2 + kd k 1 Let D = 2 6 6 4 1 3 3 1... 0 1 3 3... 0 0 1 3.... 3 7 7 5.Getquadratic trend filtering. 0 20 40 60 80 100 2 0 2 4 6 8 10 12 Generalized LASSO Emily Fox 2014 16 Tracing out the fits as a function of the regularization parameter 0 20 40 60 80 100 0 2 4 6 8 10 12 0 20 40 60 80 100 0 2 4 6 8 10 12 ˆ for = 25 ˆ for 2 [0, 1]

Acknowledgements Some material relating to the fused/generalized LASSO slides was provided by Ryan Tibshirani Emily Fox 2014 17 Case Study 3: fmri Prediction LASSO Solvers Part 1: LARS Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 18 9

LASSO Algorithms Standard convex optimizer Now: Least angle regression (LAR) Efron et al. 2004 Computes entire path of solutions State-of-the-art until 2008 Next up: Pathwise coordinate descent ( shooting ) new Parallel (approx.) methods Emily Fox 2014 19 LARS Efron et al. 2004 LAR is an efficient stepwise variable selection algorithm useful and less greedy version of traditional forward selection methods Can be modified to compute regularization path of LASSO à LARS (Least angle regression and shrinkage) Increasing upper bound B, coefficients gradually turn on Few critical values of B where support changes Non-zero coefficients increase or decrease linearly between critical points Can solve for critical values analytically Complexity: Emily Fox 2014 20 10

LASSO Coefficient Path From Kevin Murphy textbook Emily Fox 2014 21 LARS Algorithm Overview Start with all coefficient estimates A Let be the active set of covariates most correlated with the current residual A = {x j1 } Initially, for some covariate Take the largest possible step in the direction of until another covariate enters Continue in the direction equiangular between and until a third covariate enters x j1 x j2 A x j1 x j2 x j3 A x j1 Continue in the direction equiangular between,, until a fourth covariate xj4 enters A x j1 x j2 x j3 This procedure continues until all covariates are added at which point Emily Fox 2014 22 11

Comments LARS increases A, but LASSO allows it to decrease Only involves a single index at a time If p > N, LASSO returns at most N variables If group of variables are highly correlated, LASSO tends to choose one to include rather arbitrarily Straightforward to observe from LARS algorithm.sensitive to noise. Emily Fox 2014 23 More Comments In general, can t solve analytically for GLM (e.g., logistic reg.) Gradually decrease λ and use efficiency of computing ˆ( k ) from ˆ( k 1 ) = warm-start strategy See Friedman et al. 2010 for coordinate ascent + warm-starting strategy If N > p, but variables are correlated, ridge regression tends to have better predictive performance than LASSO (Zou & Hastie 2005) Elastic net is hybrid between LASSO and ridge regression Emily Fox 2014 24 12

Case Study 3: fmri Prediction LASSO Solvers Part 2: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD Averaging Solutions Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, 2014 Emily Fox 2014 25 Scaling Up LASSO Solvers Another way to solve LASSO problem: Stochastic Coordinate Descent (SCD) Minimizing a coordinate in LASSO A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! J Analysis of SCD Parallel SCD (Shotgun) Other parallel learning approaches for linear models Parallel stochastic gradient descent (SGD) Parallel independent solutions then averaging Emily Fox 2014 26 13

Coordinate Descent Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick a coordinate? When does this converge to optimum? Emily Fox 2014 27 Soft Threshholding ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > From Kevin Murphy textbook Emily Fox 2014 28 14

Stochastic Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate j at random Set: Where: ˆj = 8 < : (c j + )/a j c j < 0 c j 2 [, ] (c j )/a j c j > a j =2 NX (x i j) 2 i=1 NX c j =2 x i j(y i 0 jx i j) i=1 Emily Fox 2014 29 Analysis of SCD [Shalev-Shwartz, Tewari 09/ 11] Analysis works for LASSO, L1 regularized logistic regression, and other objectives! For (coordinate-wise) strongly convex functions: Theorem: Starting from After T iterations Where E[ ] is wrt random coordinate choices of SCD Natural question: How does SCD & SGD convergence rates differ? Emily Fox 2014 30 15

Shooting: Sequential SCD Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Stochastic Coordinate Descent (SCD) (e.g., Shalev-Shwartz & Tewari, 2009) While not converged, " Choose random coordinate j, " Update β j (closed-form minimization) F(β) contour Emily Fox 2014 31 Shotgun: Parallel SCD [Bradley et al 11] Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Shotgun (Parallel SCD) While not converged, " On each of P processors, " Choose random coordinate j, " Update β j (same as for Shooting) Emily Fox 2014 32 16

Is SCD inherently sequential? Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Coordinate update: β j β j +δβ j (closed-form minimization) Collective update: " δβ i % $ 0 ' Δβ = $ 0 ' $ ' $ δβ j ' # 0 & Emily Fox 2014 33 Is SCD inherently sequential? Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Theorem: If X is normalized s.t. diag(x T X)=1, F(β + Δβ) F(β) ( ) 2 δβ ij + X T X i j P i j,i k P, j k ( ) ij,i k δβ ij δβ ik Emily Fox 2014 34 17

Is SCD inherently sequential? Theorem: If X is normalized s.t. diag(x T X)=1, F(β + Δβ) F(β) ( ) 2 δβ ij + X T X i j P i j,i k P, j k ( ) ij,i k δβ ij δβ ik Nice case: Uncorrelated features Bad case: Correlated features Emily Fox 2014 35 Shotgun: Convergence Analysis Lasso: min F(β) where β F(β) = Xβ y 2 2 +λ β 1 Assume # parallel updates P < pd /ρ +1 Generalizes bounds for Shooting (Shalev-Shwartz & Tewari, 2009) Emily Fox 2014 36 18

Convergence Analysis Lasso: Theorem: Shotgun Convergence Assume P < d p/ρ +1 where ρ = spectral radius of X T X E! " F(β (T ) )# $ F(β*) min F(β) where β F(β) = Xβ y 2 2 +λ β 1 ( ) p d 1 β* 2 2 2 +F(β (0) ) TP Nice case: Uncorrelated features ρ = P max = Bad case: Correlated features ρ = P max = (at worst) Emily Fox 2014 37 Empirical Evaluation Iterations to convergence 10000 Mug32_singlepixcam P max =158 1000 pd =1024 ρ = 6.4967 100 10 1 10 100 P (# simulated parallel updates) Iterations to convergence Ball64_singlepixcam 56870 P max =3 pd = 4096 ρ = 2047.8 14000 1 4 P (# simulated parallel updates) Emily Fox 2014 38 19

Stepping Back Stochastic coordinate ascent Optimization: Parallel SCD: Issue: Solution: Natural counterpart: Optimization: Parallel Issue: Solution: Emily Fox 2014 39 Parallel SGD with No Locks [e.g., Hogwild!, Niu et al. 11] Each processor in parallel: Pick data point i at random For j = 1 p: Assume atomicity of: Emily Fox 2014 40 20

What you need to know Sparsistency Fused LASSO LASSO Solvers LARS A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! J Analysis of SCD Parallel SCD (Shotgun) Emily Fox 2014 41 21