Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Similar documents
Linear Methods for Regression. Lijun Zhang

Machine Learning for OR & FE

Regression, Ridge Regression, Lasso

Prediction & Feature Selection in GLM

ISyE 691 Data mining and analytics

MS-C1620 Statistical inference

Machine Learning Linear Regression. Prof. Matteo Matteucci

Linear model selection and regularization

Linear Model Selection and Regularization

Linear regression methods

Data Mining Stat 588

Dimension Reduction Methods

Lecture 14: Shrinkage

STAT 462-Computational Data Analysis

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Shrinkage Methods: Ridge and Lasso

Regularization: Ridge Regression and the LASSO

ESL Chap3. Some extensions of lasso

Machine Learning for Biomedical Engineering. Enrico Grisan

Chapter 3. Linear Models for Regression

High-dimensional regression modeling

Linear Model Selection and Regularization

Variable Selection under Measurement Error: Comparing the Performance of Subset Selection and Shrinkage Methods

Nonconcave Penalized Likelihood with A Diverging Number of Parameters

Day 4: Shrinkage Estimators

Sparse Linear Models (10/7/13)

MSA220/MVE440 Statistical Learning for Big Data

Regularization and Variable Selection via the Elastic Net

6. Regularized linear regression

High-dimensional regression with unknown variance

Variable Selection in Restricted Linear Regression Models. Y. Tuaç 1 and O. Arslan 1

Lasso Regression: Regularization for feature selection

Analysis Methods for Supersaturated Design: Some Comparisons

Stability and the elastic net

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Lecture 7: Modeling Krak(en)

Introduction to Statistical modeling: handout for Math 489/583

Machine Learning. Regression basics. Marc Toussaint University of Stuttgart Summer 2015

Consistent high-dimensional Bayesian variable selection via penalized credible regions

MSA220/MVE440 Statistical Learning for Big Data

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

A Modern Look at Classical Multivariate Techniques

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 14: Variable Selection - Beyond LASSO

STK4900/ Lecture 5. Program

Lasso Regression: Regularization for feature selection

Introduction to Machine Learning and Cross-Validation

The Adaptive Lasso and Its Oracle Properties Hui Zou (2006), JASA

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Business Statistics. Tommaso Proietti. Model Evaluation and Selection. DEF - Università di Roma 'Tor Vergata'

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 14, 2017

Machine Learning for Microeconometrics

High-dimensional regression

Smoothly Clipped Absolute Deviation (SCAD) for Correlated Variables

Robust Variable Selection Methods for Grouped Data. Kristin Lee Seamon Lilly

Linear Regression. Volker Tresp 2018

STAT 535 Lecture 5 November, 2018 Brief overview of Model Selection and Regularization c Marina Meilă

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Sparse regression. Optimization-Based Data Analysis. Carlos Fernandez-Granda

Iterative Selection Using Orthogonal Regression Techniques

Linear Regression Models. Based on Chapter 3 of Hastie, Tibshirani and Friedman

MLR Model Selection. Author: Nicholas G Reich, Jeff Goldsmith. This material is part of the statsteachr project

Linear Regression Linear Regression with Shrinkage

Linear Regression Linear Regression with Shrinkage

ORIE 4741: Learning with Big Messy Data. Regularization

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

A simulation study of model fitting to high dimensional data using penalized logistic regression

MLCC 2018 Variable Selection and Sparsity. Lorenzo Rosasco UNIGE-MIT-IIT

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

Multiple (non) linear regression. Department of Computer Science, Czech Technical University in Prague

Linear Regression In God we trust, all others bring data. William Edwards Deming

Feature selection with high-dimensional data: criteria and Proc. Procedures

On High-Dimensional Cross-Validation

Machine Learning. Regularization and Feature Selection. Fabio Vandin November 13, 2017

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

Stepwise Searching for Feature Variables in High-Dimensional Linear Regression

Tuning Parameter Selection in L1 Regularized Logistic Regression

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

STAT 100C: Linear models

A Short Introduction to the Lasso Methodology

COMS 4771 Lecture Fixed-design linear regression 2. Ridge and principal components regression 3. Sparse regression and Lasso

Variable Selection in Predictive Regressions

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Moment and IV Selection Approaches: A Comparative Simulation Study

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Lecture 5: Soft-Thresholding and Lasso

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Sparse Approximation and Variable Selection

Using Multiple Kernel-based Regularization for Linear System Identification

Proteomics and Variable Selection

High-Dimensional Statistical Learning: Introduction

Shrinkage Tuning Parameter Selection in Precision Matrices Estimation

Lecture Data Science

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Bayesian Grouped Horseshoe Regression with Application to Additive Models

L 0 methods. H.J. Kappen Donders Institute for Neuroscience Radboud University, Nijmegen, the Netherlands. December 5, 2011.

Least Squares Regression

Statistics 262: Intermediate Biostatistics Model selection

Transcription:

Direct Learning: Linear Regression

Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss: f (X) = β 0 + X T β; classification problem: I(β 0 + X T β > 0). The optimal prediction rule is the one in this class that minimizes EPE.

Why linear rules They are simple and have easy interpretation. The coefficients are informative of the important of each input feature. Estimation using linear rules is less variable. Although the best linear rule may not be the Bayes rule, the prediction performance is usually satisfactory in practice, especially with high-dimensional and noisy feature variables. Linear rules can be generalized to allow nonlinear interactions by replacing X by some basis functions of X (e.g., tensor splines).

Linear rule with squared loss Training data: (X 1, Y 1 ),..., (X n, Y n ). Learning the optimal linear rule is equivalent to minimizing LSE: p n (Y i β 0 X ij β j ) 2. i=1 j=1 The optimal rule is β 0 + x T β, where ( β 0, β) T = (X T X) 1 X T Y X is the feature variable matrix, and Y is the response vector. For any future subject, the prediction is x T ( β 0, β) T = x T (X T X) 1 X T Y. For Gaussian linear models, inference for the finite sample distribution of β has been well established.

Analysis of prostate cancer data

Improve LSE for linear rules LSE often has low bias but large variance. It is often interesting to determine a small subset of the feature variables that are more predictive of the outcome. The latter becomes even more significant when the dimensionality of the feature space is high or p >> n.

Best-subset selection This is an exhaustive search method to identify the best subset of {X 1,..., X p } optimizing a given crtieron. The procedure is: for a given subset of size k {0, 1, 2,..., p}, use an efficient algorithm (the leaps and bounds procedure, Furnival and Wilson, 1974) to identify the best subset of size k, C k, that gives the smallest RSS. we then select k such that C k minimizes a given criterion (to be discussed later).

Best-subset selection for prostate cancer data

Suboptimal-subset selection Best-subset selection is infeasible for not small p, e.g., p > 40. Suboptimal but computationally efficient-subset selection includes forward-stepwise selection: it is a greedy algorithm to search along an optimal sequence of increasing models; backward-stepwise selection: it starts from the full model and sequentially eliminates one variable from the model. Stepwise selection adds or deletes one variable based on certain significance testing so it is a locally optimal search.

Comparing different subset selection

Shrinkage methods Subset selection is a discrete process and so the variability jumping from one model to another is high. Discrete search algorithm is usually computationally intensive. Shrinkage methods provide more smooth search procedures for identify the best models. Such methods are often carried out via smooth regularization/penalization.

Ridge regression The estimate for β is obtained by minimizing resulting in p n (Y i β 0 Xi T β)2 + λ βj 2, i=1 j=1 β = (X T X + λi) 1 X T Y. That is, we add an additional L 2 -penalty to shrink coefficients towards zero. The optimization is equivalent to min p n (Y i β 0 Xi T β)2, subject to βj 2 C. i=1 Both λ and C are regularization parameters (tuning parameters) and will be chosen data-dependently based on certain criterion (to be discussed later). j=1

Prostate cancer example (ridge regression)

Lasso shrinkage The lasso minimizes 1 2 p n (Y i β 0 Xi T β)2 + λ β j i=1 so the regularization is L 1 -penalty for β. The objective function is convex and the computation is a quadratic programming problem. The whole solution path can be efficiently solved using Least Angle Regression (LAR) algorithm, which is a forward stepwise selection procedure. j=1

LAR algorithm

Prostate cancer example (Lasso)

Comparison among shrinkage methods

Comparison among shrinkage methods in prostate cancer data

Structure shrinkage methods Structural regularization: Group Lasso for L groups of features min n (Y i β 0 i=1 Elastic-net penalty p L Xl T β l) 2 + λ l=1 L pl β l L2 l=1 λ (αβj 2 + (1 α) β j ) j=1 Laplacian regularization with a Laplacian eigenmap matrix D, β T Dβ, to encourage similar coefficients of two variables on the same edge of a network.

Sparsity shrinkage methods Oracle selection regularization it selects feature variables with β j = 0 with proability tending to 1 (oracle property); it also shrinks non-zero coefficients towards zero; the regularization is a non-convex function of β such as the smoothly clipped absolute deviation (SCAD): q λ (β) β [ = λsign(β) I( β λ) + (αλ β ) ] + I( β > λ) (α 1)λ Computation for such regularization relies on some local approximation so may not achieve the global minimum..

Sparsity shrinkage methods: continue Alternatively, adaptive Lasso (ALasso) method uses the regularization term p λ c β j / β j γ, γ > 0, j=1 where β j is a consistent initial estimator for β j. ALasso requires initial estimates so may not be applicable when p >> n.

Graphic comparisons among all penalties

Sample R-codes for implementation Best Subset Selection Ridge regression Lasso

Tuning parameter selection There are often tuning parameters that need to be chosen: k subset size in best-subset selection; regularization parameters in all shrinkage methods. Larger regularization parameters lead to more shrunk coefficients (more sparse model) so less variable prediction; however, the resulting model yields higher bias in prediction. There should also be some bias and variance trade-off in tuning parameter selection. Some model selection criteria such as AIC and BIC can be used; however, they are based on information criterion so do not directly serve the purpose of prediction.

Mallow s CP criterion function for subset selection The criterion is based on the prediction error E[(Y f k (x 0 )) 2 X = x 0 ], where f k is the estimated function from the k best feature variables. Assuming Var(Y f (X)) = σ 2, this prediction error is σ 2 + (f (x 0 ) E[ f k (x 0 )]) 2 + Var( f k (x 0 )) so when average over x 0 from the empirical data, it is σ 2 + 1 n n i=1 (f (X i ) E[ f k (X i )]) 2 + σ2 n Trace(XT k (XT k X k) 1 X k ).

Mallow s CP: continue Since the in-sample error, n 1 n i=1 (Y i f k (X i )) 2, has an expectation approximated by σ 2 + 1 n n i=1 { f (X i ) 2 E[ f k (X i )] 2} 1 n n Var( f k (X i )), i=1 the expectation of the prediction error is equal to the expectation of the in-sample error plus 2σ 2 n 1 Trace(X T k (XT k X k) 1 X k ) = 2σ 2 k/n. Mallow s CP selects k to minimize 1 n n (Y i f k (X i )) 2 + 2 σ 2 k/n. i=1

Data-adaptive selection: Cross-validation The goal is to mimick scenarios of learning prediction rules using training samples then evaluating performance in future data. The idea is to randomly split data into training sample and testing sample training sample is used to train prediction rules using learning methods; testing sample is used to evaluate prediction errors of the learned rules. To avoid incidence of good or bad splits, this procedure repeats multiple times. Recommendation is often leave-one-out cross-validation, 5-fold or 10-fold cross-validation. The best tuning parameters are chosen to minimize the average of the prediction errors.

Generalized cross-validation CV is computationally costly, especially leave-one-out CV. Some approximation, called generalized cross-validation, is often used in practice: n 1 [ n Y i f ] 2 (X i ), 1 trace(σ)/n i=1 where ΣY is the prediction for all the subjects.