LASSO Review, Fused LASSO, Parallel LASSO Solvers

Similar documents
Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 4 th, Emily Fox 2014

Classification Logistic Regression

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection

Linear Methods for Regression. Lijun Zhang

6. Regularized linear regression

Is the test error unbiased for these programs?

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Multivariate Normal Models

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Pathwise coordinate optimization

The picasso Package for Nonconvex Regularized M-estimation in High Dimensions in R

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Chapter 3. Linear Models for Regression

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

Fast Regularization Paths via Coordinate Descent

Prediction & Feature Selection in GLM

Linear regression methods

Lecture 25: November 27

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Machine Learning and Computational Statistics, Spring 2017 Homework 2: Lasso Regression

MS-C1620 Statistical inference

Lass0: sparse non-convex regression by local search

STA141C: Big Data & High Performance Statistical Computing

ESL Chap3. Some extensions of lasso

DATA MINING AND MACHINE LEARNING

COMS 4721: Machine Learning for Data Science Lecture 6, 2/2/2017

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

l 1 and l 2 Regularization

Regularization and Variable Selection via the Elastic Net

Statistical Learning with the Lasso, spring The Lasso

Linear Models in Machine Learning

Fast Regularization Paths via Coordinate Descent

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Multivariate Normal Models

ECE521 lecture 4: 19 January Optimization, MLE, regularization

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Regularization Paths

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Learning with Sparsity Constraints

Regularization: Ridge Regression and the LASSO

ECS289: Scalable Machine Learning

The lasso: some novel algorithms and applications

Linear Regression. Aarti Singh. Machine Learning / Sept 27, 2010

ISyE 691 Data mining and analytics

MSA220/MVE440 Statistical Learning for Big Data

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Lasso: Algorithms and Extensions

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Fast Regularization Paths via Coordinate Descent

Discussion of Least Angle Regression

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Optimization methods

Machine Learning for Biomedical Engineering. Enrico Grisan

Lecture 17 Intro to Lasso Regression

Linear Regression. Volker Tresp 2018

A significance test for the lasso

Regularization Path Algorithms for Detecting Gene Interactions

Adaptive Gradient Methods AdaGrad / Adam. Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade

Machine Learning for OR & FE

Parameter Norm Penalties. Sargur N. Srihari

Collaborative Filtering Matrix Completion Alternating Least Squares

Bi-level feature selection with applications to genetic association

Biostatistics Advanced Methods in Biostatistics IV

A robust hybrid of lasso and ridge regression

Statistical Data Mining and Machine Learning Hilary Term 2016

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

CPSC 540: Machine Learning

Generalized Elastic Net Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Regression.

Sparse Ridge Fusion For Linear Regression

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

Sparsity in Underdetermined Systems

LEAST ANGLE REGRESSION 469

Variable Selection for Highly Correlated Predictors

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Sparse Linear Models (10/7/13)

The lasso: some novel algorithms and applications

Model Selection. Frank Wood. December 10, 2009

SVRG++ with Non-uniform Sampling

Introduction to the genlasso package

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

MATH 829: Introduction to Data Mining and Analysis Least angle regression

Ad Placement Strategies

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Shrinkage Methods: Ridge and Lasso

An Introduction to Graphical Lasso

Regression Shrinkage and Selection via the Lasso

Least Angle Regression, Forward Stagewise and the Lasso

Kernelized Perceptron Support Vector Machines

Transcription:

Case Study 3: fmri Prediction LASSO Review, Fused LASSO, Parallel LASSO Solvers Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 1

Variable Selection Ridge regression: Penalizes large weights What if we want to perform feature selection? E.g., Which regions of the brain are important for word prediction? Can t simply choose predictors with largest coefficients in ridge solution Computationally impossible to perform all subsets regression Stepwise procedures are sensitive to data perturbations and often include features with negligible improvement in fit Try new penalty: Penalize non-zero weights Penalty: Leads to sparse solutions Just like ridge regression, solution is indexed by a continuous param λ Sham Kakade 2016 2

LASSO Regression LASSO: least absolute shrinkage and selection operator New objective: Sham Kakade 2016 3

Geometric Intuition for Sparsity Sham Kakade 2016 4

Soft Threshholding From Kevin Murphy textbook Sham Kakade 2016 5

LASSO Coefficient Path From Kevin Murphy textbook Sham Kakade 2016 6

LASSO Example Sham Kakade 2016 7

Sparsistency Typical Statistical Consistency Analysis: Holding model size (p) fixed, as number of samples (N) goes to infinity, estimated parameter goes to true parameter Here we want to examine p >> N domains Let both model size p and sample size N go to infinity! Hard case: N = k log p Sham Kakade 2016 8

Sparsistency Rescale LASSO objective by N: Theorem (Zhao and Yu 2006, ): Under some constraints on the design matrix X, if we solve the LASSO regression using Then for some c 1 >0, the following holds with at least probability The LASSO problem has a unique solution with support contained within the true support If for some c 2 >0, then Sham Kakade 2016 9

Case Study 3: fmri Prediction Fused LASSO Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 10

fmri Prediction Subtask Goal: Predict semantic features from fmri image Features of word Sham Kakade 2016 11

Fused LASSO Might want coefficients of neighboring voxels to be similar How to modify LASSO penalty to account for this? Graph-guided fused LASSO Assume a 2d lattice graph connecting neighboring pixels in the fmri image Penalty: Sham Kakade 2016 12

Generalized LASSO Assume a structured linear regression model: If D is invertible, then get a new LASSO problem if we substitute Otherwise, not equivalent For solution path, see Ryan Tibshirani and Jonathan Taylor, The Solution Path of the Generalized Lasso. Annals of Statistics, 2011. Sham Kakade 2016 13

Generalized LASSO Sham Kakade 2016 14

Generalized LASSO Sham Kakade 2016 15

Generalized LASSO Sham Kakade 2016 16

Generalized LASSO Sham Kakade 2016 17

Generalized LASSO Tracing out the fits as a function of the regularization parameter Sham Kakade 2016 18

Case Study 3: fmri Prediction LASSO Solvers Part 1: LARS Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 19

LASSO Algorithms So far: Standard convex optimizer Now: Least angle regression (LAR) Efron et al. 2004 Computes entire path of solutions Next up: Pathwise coordinate descent ( shooting ) new Parallel (approx.) methods Sham Kakade 2016 20

LARS Efron et al. 2004 LAR is an efficient stepwise variable selection algorithm useful and less greedy version of traditional forward selection methods Can be modified to compute regularization path of LASSO LARS (Least angle regression and shrinkage) Increasing upper bound B, coefficients gradually turn on Few critical values of B where support changes Non-zero coefficients increase or decrease linearly between critical points Can solve for critical values analytically Complexity: Sham Kakade 2016 21

LASSO Coefficient Path From Kevin Murphy textbook Sham Kakade 2016 22

LARS Algorithm Overview Start with all coefficient estimates Let be the active set of covariates most correlated with the current residual Initially, for some covariate Take the largest possible step in the direction of until another covariate enters Continue in the direction equiangular between and until a third covariate enters Continue in the direction equiangular between,, until a fourth covariate enters This procedure continues until all covariates are added Sham Kakade 2016 23

Comments LAR increases, but LASSO allows it to decrease Only involves a single index at a time If p > N, LASSO returns at most N variables If group of variables are highly correlated, LASSO tends to choose one to include rather arbitrarily Straightforward to observe from LARS algorithm.sensitive to noise. Sham Kakade 2016 24

More Comments In general, can t solve analytically for GLM (e.g., logistic reg.) Gradually decrease λ and use efficiency of computing from = warm-start strategy See Friedman et al. 2010 for coordinate ascent + warm-start strategy If N > p, but variables are correlated, ridge regression tends to have better predictive performance than LASSO (Zou & Hastie 2005) Elastic net is hybrid between LASSO and ridge regression Sham Kakade 2016 25

Case Study 3: fmri Prediction LASSO Solvers Part 2: SCD for LASSO (Shooting) Parallel SCD (Shotgun) Parallel SGD Averaging Solutions Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade May 3, 2016 Sham Kakade 2016 26

Scaling Up LASSO Solvers Another way to solve LASSO problem: Stochastic Coordinate Descent (SCD) Minimizing a coordinate in LASSO A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! Analysis of SCD Parallel SCD (Shotgun) Other parallel learning approaches for linear models Parallel stochastic gradient descent (SGD) Parallel independent solutions then averaging ADMM Sham Kakade 2016 27

Coordinate Descent Given a function F Want to find minimum Often, hard to find minimum for all coordinates, but easy for one coordinate Coordinate descent: How do we pick a coordinate? When does this converge to optimum? Sham Kakade 2016 28

Soft Threshholding From Kevin Murphy textbook Sham Kakade 2016 29

Stochastic Coordinate Descent for LASSO (aka Shooting Algorithm) Repeat until convergence Pick a coordinate j at random Set: Where: Sham Kakade 2016 30

Analysis of SCD [Shalev-Shwartz, Tewari 09/ 11] Analysis works for LASSO, L1 regularized logistic regression, and other objectives! For (coordinate-wise) strongly convex functions: Theorem: Starting from After T iterations Where E[ ] is wrt random coordinate choices of SCD Natural question: How does SCD & SGD convergence rates differ? Sham Kakade 2016 31

Shooting: Sequential SCD Lasso: min F(b) where b F(b) = Xb - y 2 2 +l b 1 Stochastic Coordinate Descent (SCD) (e.g., Shalev-Shwartz & Tewari, 2009) F(b) contour While not converged, Choose random coordinate j, Update β j (closed-form minimization) Sham Kakade 2016 32

Shotgun: Parallel SCD [Bradley et al 11] Lasso: min F(b) where b F(b) = Xb - y 2 2 +l b 1 Shotgun (Parallel SCD) While not converged, On each of P processors, Choose random coordinate j, Update β j (same as for Shooting) Sham Kakade 2016 33

Stepping Back Stochastic coordinate ascent Optimization: Parallel SCD: Issue: Solution: Natural counterpart: Optimization: Parallel Issue: Solution: Sham Kakade 2016 34

What you need to know Sparsistency Fused LASSO LASSO Solvers LARS A simple SCD for LASSO (Shooting) Your HW, a more efficient implementation! Analysis of SCD Parallel SCD (Shotgun) Sham Kakade 2016 35