Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Similar documents
Extensions to LDA and multinomial regression

Stability and the elastic net

Cox regression: Estimation

Proportional hazards regression

Statistical aspects of prediction models with high-dimensional data

Nonconvex penalties: Signal-to-noise ratio and algorithms

Building a Prognostic Biomarker

Semiparametric Regression

Theoretical results for lasso, MCP, and SCAD

Tied survival times; estimation of survival probabilities

Bi-level feature selection with applications to genetic association

Residuals and model diagnostics

Journal of Statistical Software

Fast Regularization Paths via Coordinate Descent

Univariate shrinkage in the Cox model for high dimensional data

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Lecture 16 Solving GLMs via IRWLS

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

The MNet Estimator. Patrick Breheny. Department of Biostatistics Department of Statistics University of Kentucky. August 2, 2010

Lecture 2 Machine Learning Review

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

High-Dimensional Statistical Learning: Introduction

Generalized Boosted Models: A guide to the gbm package

MSA220/MVE440 Statistical Learning for Big Data

Applied Machine Learning Annalisa Marsico

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

High-dimensional Ordinary Least-squares Projection for Screening Variables

Bayesian linear regression

Lecture 5: LDA and Logistic Regression

Sparse Linear Models (10/7/13)

Different types of regression: Linear, Lasso, Ridge, Elastic net, Ro

Recap from previous lecture

Machine Learning and Data Mining. Linear regression. Kalev Kask

Poisson regression: Further topics

CPSC 540: Machine Learning

Week 5: Classification

DATA MINING AND MACHINE LEARNING

Overfitting, Bias / Variance Analysis

Time-dependent coefficients

Prediction & Feature Selection in GLM

Classification. Chapter Introduction. 6.2 The Bayes classifier

High-dimensional regression

Summary and discussion of: Dropout Training as Adaptive Regularization

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

ECE521 week 3: 23/26 January 2017

Regularization and Variable Selection via the Elastic Net

Group exponential penalties for bi-level variable selection

Classification 1: Linear regression of indicators, linear discriminant analysis

Neural networks (not in book)

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Linear Models for Regression CS534

Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors

Logistic Regression. Mohammad Emtiyaz Khan EPFL Oct 8, 2015

Regression techniques provide statistical analysis of relationships. Research designs may be classified as experimental or observational; regression

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Linear Regression Models P8111

Machine Learning Lecture 7

Semi-Penalized Inference with Direct FDR Control

Introduction to Statistical modeling: handout for Math 489/583

Lecture 6: Methods for high-dimensional problems

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture #11: Classification & Logistic Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Generalized Linear Models. Kurt Hornik

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Linear Regression 9/23/17. Simple linear regression. Advertising sales: Variance changes based on # of TVs. Advertising sales: Normal error?

Proteomics and Variable Selection

Data Mining 2018 Logistic Regression Text Classification

Lecture 4: Training a Classifier

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

Linear Methods for Prediction

Statistical Methods for Data Mining

Linear Models for Regression CS534

Machine Learning And Applications: Supervised Learning-SVM

COMS 4771 Regression. Nakul Verma

The logistic regression model is thus a glm-model with canonical link function so that the log-odds equals the linear predictor, that is

Machine Learning for OR & FE

Logistic regression: Miscellaneous topics

Linear classifiers: Logistic regression

ISQS 5349 Spring 2013 Final Exam

Modelling geoadditive survival data

Theorems. Least squares regression

Linear Models for Regression CS534

Multistate models and recurrent event models

Sampling distribution of GLM regression coefficients

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

Opening Theme: Flexibility vs. Stability

Computational statistics

Lecture 5: Logistic Regression. Neural Networks

Lecture 3: Statistical Decision Theory (Part II)

18.9 SUPPORT VECTOR MACHINES

Lecture 3: Introduction to Complexity Regularization

3003 Cure. F. P. Treasure

Regression Analysis for Data Containing Outliers and High Leverage Points

Marginal false discovery rate approaches to inference on penalized regression models

MEANINGFUL REGRESSION COEFFICIENTS BUILT BY DATA GRADIENTS

Transcription:

Other likelihoods Patrick Breheny April 25 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/29

Introduction In principle, the idea of penalized regression can be extended to any sort of regression model, although of course complications may arise in the details In the final lecture for this topic, we ll take a brief look at three additional regression models and their penalized versions, along with any complications that arise: Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 2/29

Penalized multinomial regression Suppose, instead of a binary response, we have an outcome with multiple possible categories (i.e., that follows a multinomial response, y i {1,..., K}) Extending penalized regression methods to multinomial regression is fairly straightforward: Q(β X, y) = 1 n K log π ik + k k=1 i y i =k P λ (β k ), where η ki = x T i β k and π ik = exp(η ki) l exp(η li) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 3/29

Model fitting Can we use the same algorithm we discussed last time for multinomial models? At first glance, it might appear the answer is no, since now we have matrices of weights, and so on However, if we adopt the general framework of looping over the response categories, the problem is equivalent to running logistic regression updates for each β k : repeat for k = 1, 2,..., K Construct quadratic approximation based on β k Coordinate descent to update β k until convergence Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 4/29

Identifiability and reference categories For the most part, if you have fit a multinomial regression model before, the penalized version will be familiar One noticeable exception, however, is the notion of a reference category In traditional multinomial regression, one category must be chosen as a reference group (i.e., have β k set to 0) or else the problem is not identifiable Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 5/29

Penalization and identifiability With penalized regression, this restriction is not necessary For example, suppose K = 2; then β 1j = 1, β 2j = 1 produces the exact same {π ik } as β 1j = 2, β 2j = 0 As it is impossible to tell the two models apart (and an infinite range of other models), we cannot estimate {β k } With, say, a ridge penalty, this is no longer the case, as k β2 jk = 2 in the first situation and 4 in the second; the proper estimate is clear A similar phenomenon occurs for the lasso penalty, although of course there is now the possibility of sparsity, perhaps with respect to multiple classes Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 6/29

Ramaswamy data As an example of such data, consider a study by Ramaswamy et al. (2001) The authors collected 198 tumor samples, spanning 14 common tumor types, which they split into training (n = 144) and testing (n = 54) sets Their goal was to develop a method for classifying the samples based on microarray data only, with the clinical goal of aiding in diagnosis, particularly of metastatic or atypical tumors Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 7/29

glmnet Penalized multinomial regression models are implemented in glmnet (but are not yet available in ncvreg) To use: fit <- glmnet(x, y, family="multinomial") ## Or: cvfit <- cv.glmnet(x, y, family="multinomial") Here, y is allowed to be a matrix, a factor, or a vector capable of being coerced into a factor Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 8/29

Cross-validation (training data only) 0 0 0 1 3 5 8 8 10 12 13 13 14 14 16 5 Multinomial Deviance 4 3 2 2 3 4 5 log(λ) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 9/29

Coefficients The number of selected variables reported at the top of the previous plot is potentially misleading; we have β 1, β 2,..., and the plot only lists the number of nonzero values in β 1 (which, here, refers to breast cancer) For example, β 1,5466 = 0.58, while β j,5466 = 0 for all j 1 This implies that a 1-unit change in gene 5466 doubles the odds of breast cancer (e 0.58 = 1.8), but has no effect on the relative odds of the other tumor types Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 10/29

Ramaswamy results Actual Pred. Br Pr Ln Cl Ly Bl Ml Ut Lk Rn Pn Ov Ms CN Tot. Br 0 0 0 0 0 0 0 0 0 33 0 0 0 0 1 Pr 0 50 0 0 0 0 0 0 0 0 0 25 0 0 4 Ln 0 0 75 0 0 0 0 0 0 0 0 0 0 0 3 Cl 0 0 0 100 0 0 0 0 0 0 33 0 0 0 5 Ly 0 0 0 0 100 0 0 0 0 0 0 0 0 0 6 Bl 25 0 0 0 0 33 0 0 0 0 0 0 0 0 2 Ml 0 0 0 0 0 0 100 0 0 0 0 0 0 0 2 Ut 0 0 0 0 0 0 0 100 0 0 0 0 0 0 2 Lk 0 33 0 0 0 0 0 0 100 0 0 0 0 0 8 Rn 25 0 0 0 0 33 0 0 0 0 33 0 0 0 3 Pn 0 0 0 0 0 0 0 0 0 0 33 0 0 0 1 Ov 50 17 0 0 0 33 0 0 0 33 0 75 0 0 8 Ms 0 0 25 0 0 0 0 0 0 33 0 0 100 25 6 CN 0 0 0 0 0 0 0 0 0 0 0 0 0 75 3 Tot. 4 6 4 4 6 3 2 2 6 3 3 4 3 4 54 Overall classification accuracy: 69% (random accuracy: 9%) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 11/29

Motivation Our linear regression models from earlier were based on a least squares loss function, which is appropriate for normally distributed data Real data, however, sometimes has thicker tails; least squares estimates tend to be highly influenced by outliers One alternative would be to fit a least absolute deviation (LAD) model: Q(β X, y) = 1 y i x T i β + P λ (β); n the effect is analogous to summarizing a univariate distribution with a median instead of a mean i Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 12/29

Model fitting: Complications The main challenge introduced by this model is the fact that now the loss function is not differentiable either This poses a problem for coordinate descent methods: we didn t go into too much detail about their convergence properties, but it requires separable penalties (OK) and a differentiable loss function (problem) Indeed, one can construct counterexamples in which coordinate descent fails for L 1 -penalized LAD regression Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 13/29

Huber loss approximation One solution is to approximate the absolute value function with a differentiable approximation known as the Huber loss, after Peter Huber, who first proposed the idea in 1964: { ri 2 if r i δ L(r i ) = δ(2 r i δ) if r i > δ One could either use this idea directly, or continuously relax δ 0 to solve the original LAD problem; both of these options are available in the R package hqreg Its usage is very similar to glmnet/ncvreg, with an option method=c("huber", "quantile", "ls") to specify the loss function Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 14/29

Simulation setup To see the potential advantages of robust regression, consider the following simple setup: y = Xβ + ε, n = p = 100, β 1 = β 2 = 1, β 3 = = β 100 = 0, x ij N(0, 1) However, instead of ε following a N(0, 1) distribution, in this simulation it will arise from a mixture distribution: With 90% probability, ε N(0, 1) With 10% probability, the errors are drawn from a noisier distribution, ε N(0, σ), where we will consider varying σ Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 15/29

Simulation results Least squares Hubert 1.2 1.0 MSE 0.8 0.6 0.4 0.2 2 4 6 8 10 σ Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 16/29

Remarks As you would expect, least squares is optimal for σ = 1, and still better than Huber loss for slightly messy data As the outliers get more extreme, however, Huber loss is clearly superior to least squares Both methods get worse as the noise increases, but least squares is much less robust Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 17/29

Challenge: Nondiagonal W Challenge: Cross-validation A popular model for time-to-event data is, after Sir David Cox, who proposed the method in 1972 Suppose individual j dies at time t j, and that R(t j ) denotes the set of subjects who were at risk (i.e., could have died) at time t j The model states that the relative likelihood that subject j is the one who died, conditional on the fact that someone died at t j (this is called a partial likelihood) is given by exp(x T j β) k R(t j ) exp(xt k β) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 18/29

Challenge: Nondiagonal W Challenge: Nondiagonal W Challenge: Cross-validation The Cox partial likelihood is differentiable, so we can form a quadratic approximation L(β) 1 2n (ỹ Xβ)T W(ỹ Xβ) as in the GLM case However, the relative nature of the likelihood poses a challenge: unlike the GLM case, W is no longer a diagonal matrix This is not necessarily a conceptual difficulty, but it presents a considerable obstacle in terms of practical computation The speed of coordinate descent relies on the fact that each coordinate update takes an insignificant amount of time; if we have to do lots of calculations like x T j Wx j, the algorithm will be much slower Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 19/29

Do we need the exact value of W? Challenge: Nondiagonal W Challenge: Cross-validation Thankfully, it turns out not to be terribly critical to use the exact Hessian matrix W when solving for the value of β that minimizes the quadratic approximation For example, in logistic regression using w i = 0.25 i also works in terms of converging to the solution β, and surprisingly is often just as fast as using w i = π i (1 π i ) Setting w i = 0.25 is an example of something called an MM algorithm; convergence follows from Jensen s inequality Approximating W is also the main idea behind what are known as quasi-newton methods Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 20/29

Approximating W with its diagonal Challenge: Nondiagonal W Challenge: Cross-validation A simple idea that turns out to work reasonably well is to simply approximate W by its diagonal Thus, although W is a dense n n matrix, we can get away with simply calculating the elements along the diagonal and using this diagonalized W in the same manner as we did for penalized GLMs I am not aware of any formal proof that this guarantees convergence (indeed, convergence is never guaranteed for quadratic approximations without additional steps such as step-halving), but it seems to work well in practice Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 21/29

Challenge: Cross-validation Challenge: Nondiagonal W Challenge: Cross-validation Another challenge for is how to carry out cross-validation In linear regression/logistic regression/multinomial regression/robust regression, we get actual predicted values for each observation and can evaluate the loss for each observation i in isolation For, we estimate only relative risk, and thus have no way of assessing whether, say, x T i β = 0.8 is a good predictor or not without a frame of reference Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 22/29

Proposal #1 Challenge: Nondiagonal W Challenge: Cross-validation One proposal for how to carry out cross-validation for Cox regression models was proposed by Verweij and van Houwelingen (1993) Their idea was to calculate L( β ( i) ), the partial likelihood for the full data based on data leaving out fold i, and L ( i) ( β ( i) ), the partial likelihood for the data leaving out fold i based The cross-validation error is then CVE = i {L( β ( i) ) L ( i) ( β ( i) )} This is the approach used by glmnet Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 23/29

Proposal #2 Challenge: Nondiagonal W Challenge: Cross-validation An alternative approach, used by ncvreg, is to: (1) Calculate the full set of cross-validated linear predictors, η = {η ( 1),..., η ( n) } (2) Calculate the original Cox partial regression for the full data set based on the cross-validated linear predictors Both approaches are equivalent to regular cross-validation for ordinary likelihoods Which method is better is still an open question, although the ncvreg approach tends to be a bit more liberal Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 24/29

Lung cancer data Challenge: Nondiagonal W Challenge: Cross-validation To illustrate, let s look at the data from a study by Shedden et al. (2008) In the study, retrospective data was collected for 442 patients with adenocarcinoma of the lung including their survival times, some additional clinical and demographic data, and expression levels of 22,283 genes taken from the tumor sample In the result that follow, I included age, sex, and treatment (whether the patient received adjuvant chemotherapy) as unpenalized predictors Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 25/29

R code Challenge: Nondiagonal W Challenge: Cross-validation To fit these models in glmnet: XX <- cbind(z, X) w <- rep(0:1, c(ncol(z), ncol(x))) cv.glmnet(xx, S, family="cox", penalty.factor=w) And in ncvreg: cv.ncvsurv(xx, S, penalty="lasso", penalty.factor=w) For both models, I used lambda.min=0.35; proceeding further along the path resulted in clear overfitting Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 26/29

Results: glmnet Challenge: Nondiagonal W Challenge: Cross-validation 3 4 5 5 6 9 10 16 20 26 35 47 59 74 81 12.9 Partial Likelihood Deviance 12.8 12.7 12.6 12.5 12.4 1.6 1.8 2.0 2.2 2.4 log(lambda) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 27/29

Results: ncvreg Challenge: Nondiagonal W Challenge: Cross-validation Variables selected 3 4 5 6 8 10 16 22 31 43 55 73 80 106 Cross validation error 1270 1265 1260 1.6 1.8 2.0 2.2 2.4 2.6 log(λ) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 28/29

Remarks Challenge: Nondiagonal W Challenge: Cross-validation Both approaches suggest an improvement in accuracy as the first dozen or so genes are added to the model, followed by a relatively flat region; whether this drop is significant or not may depend on the approach As with other penalized regression models, an appealing feature is that coefficients retain their interpretation For example, the gene GLCCI1 has a hazard ratio of 0.78 (e β j = 0.78), indicating that patients with high GLCCI1 expression have a 22% reduction in risk; as one might expect, the effect is estimated to be larger using MCP-penalized Cox regression (HR=0.67) Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 29/29