TGDR: An Introduction

Size: px

Start display at page:

Download "TGDR: An Introduction"

Harry Hunter
5 years ago
Views:

1 TGDR: An Introduction Julian Wolfson Student Seminar March 28, 2007

2 1 Variable Selection 2 Penalization, Solution Paths and TGDR 3 Applying TGDR 4 Extensions 5 Final Thoughts

3 Some motivating examples We are interested in identifying which covariates from a set X = {X 1,..., X p } best predict an outcome Y measured on n individuals, where p >> n. For example: Y is blood pressure at age 50, X is a set of answers from a lengthy Food Frequency Questionnaire Y is an indicator of volcano activity, X is a set of geological measurements in the vicinity of the volcano Y is a survival endpoint (T, C) representing time to acquisition of HIV drug resistance, X is a portion of the viral genome

4 For the last example, which we will pursue, a typical dataset might have n = 300 individuals with amino acid sequences of length sites 21 possible AAs per site covariates.

5 The Problem Idea When p >> n, standard regression approaches yield estimates with huge variance and poor predictive ability Cox regression typically fails with even modestly large numbers of covariates ( 100) Standard approaches typically force small/no bias of the parameter estimates, and so do not trade off bias and variance. MSE = Var + Bias 2 Accept some bias in exchange for more stable estimates with better predictive power Select a subset of variables which best predicts the outcome Use the available data to estimate their relative importance

6 The Problem Idea When p >> n, standard regression approaches yield estimates with huge variance and poor predictive ability Cox regression typically fails with even modestly large numbers of covariates ( 100) Standard approaches typically force small/no bias of the parameter estimates, and so do not trade off bias and variance. MSE = Var + Bias 2 Accept some bias in exchange for more stable estimates with better predictive power Select a subset of variables which best predicts the outcome Use the available data to estimate their relative importance

7 Loss functions Estimation is based on a loss function L: Squared-error loss (linear regression): L = (Y i X i β) 2 Negative Log-likelihood (many contexts): L = l(β; X ) Negative Log partial likelihood (Cox regression): L = l p (β; X )

8 Penalization Common way to trade off bias and variance: penalize loss function L via P(β) Yields modified loss L. Two common penalties: 1 P(β) = β 2 i (Ridge regression) 2 P(β) = β i (LASSO) Examples Linear regression, ridge penalty: Cox regression, LASSO penalty: L = (Y i X i β) 2 + λ β 2 i L = l p (β, X ) + λ β i

9 Penalization Common way to trade off bias and variance: penalize loss function L via P(β) Yields modified loss L. Two common penalties: 1 P(β) = β 2 i (Ridge regression) 2 P(β) = β i (LASSO) Examples Linear regression, ridge penalty: Cox regression, LASSO penalty: L = (Y i X i β) 2 + λ β 2 i L = l p (β, X ) + λ β i

10 We seek ˆβ = arg min β L arg min [L + λp(β)] β Constrained optimization problem (equivalent to arg min β L subj to P(β) λ ) λ controls how much the estimates are penalized It also indexes a one-dimensional path through the parameter space Optimal λ usually chosen via cross-validation

11 Solution Paths

12 Problems of Penalization? Choice of penalty P(β) defines a set of possible paths - but what if none of these paths passes near the true parameter value? We might prefer a technique which does not require us to choose a penalty function a priori Constrained optimization procedures can be tricky to use

13 Problems of Penalization? Choice of penalty P(β) defines a set of possible paths - but what if none of these paths passes near the true parameter value? We might prefer a technique which does not require us to choose a penalty function a priori Constrained optimization procedures can be tricky to use

14 Enter TGDR TGDR: Threshold Gradient Descent Regularization Suggested by Friedman and Popescu (2004) Idea Construct paths in the parameter space iteratively Choose a point on the constructed path which is closest to the true parameter value (usually via cross-validation)

15 Iterative path construction Basic calculus: g(β) = f β gives direction of steepest descent Steepest descent algorithm for finding minimum of a function f : ˆβ(λ + λ) = ˆβ(λ) + g(β) β= ˆβ(λ) To reduce instability of estimates, consider instead the step ˆβ(λ + λ) = ˆβ(λ) + T(β) g(β) β= ˆβ(λ) T i (β) = 1[ g i >= τ max ( g k )] k=1,...,p

16 Thresholding

17 Thresholding

18 Thresholding

19 Recap We now have a general method for constructing paths in the parameter space. To apply it, we need: A (differentiable) loss function (squared error, log-likelihood, etc.) A way to choose threshold parameter τ A way to choose path parameter λ

20 TGDR for Cox regression Gui and Li (2005) extended TGDR for Cox regression (partial likelihood loss) Recall: L = l p (β; X ) g = L β We started by adapting TGDR to handle time-varying covariates

21 Application: ACTG 398 Relevant Data HIV envelope protein sequences collected post-infection for approximately two years Current drug regimen Endpoint of Interest (T, C), where T is the time until a patient fails a drug regimen C is the censoring indicator Question Which amino acid positions on HIV (mutations, insertions, deletions) are associated with time until drug regimen failure?

22 Application: ACTG 398 Relevant Data HIV envelope protein sequences collected post-infection for approximately two years Current drug regimen Endpoint of Interest (T, C), where T is the time until a patient fails a drug regimen C is the censoring indicator Question Which amino acid positions on HIV (mutations, insertions, deletions) are associated with time until drug regimen failure?

23 Application: ACTG 398 Relevant Data HIV envelope protein sequences collected post-infection for approximately two years Current drug regimen Endpoint of Interest (T, C), where T is the time until a patient fails a drug regimen C is the censoring indicator Question Which amino acid positions on HIV (mutations, insertions, deletions) are associated with time until drug regimen failure?

24 Results: ACTG 398 Data Estimated coefficients from training set (60% of data) 70R 74V 103N 108I 118I 122E 123E 181C 184V 190A τ K L K V V K D Y M G

25 Results (cont d) Get ˆη = X ˆβ from test set (40% of data) HR = Hazard ratio comparing group with ˆη 0 ( high risk ) to ˆη < 0 ( low risk ) τ HR 95% CI

26 Extensions For log-likelihood (or log partial likelihood) loss, the descent direction is just g = l β l, the score function. Extensive literature on modified/adapted/approximate/quasi score functions which allow for: Missing data Measurement error Heteroskedasticity... Straightforward to incorporate these methods which propose some modification g of our original step direction g. Go Currently working on allowing TGDR to handle missing data (based on work of Lin and Ying) and measurement error (Augustin)

27 Extensions For log-likelihood (or log partial likelihood) loss, the descent direction is just g = l β l, the score function. Extensive literature on modified/adapted/approximate/quasi score functions which allow for: Missing data Measurement error Heteroskedasticity... Straightforward to incorporate these methods which propose some modification g of our original step direction g. Go Currently working on allowing TGDR to handle missing data (based on work of Lin and Ying) and measurement error (Augustin)

28 Extensions For log-likelihood (or log partial likelihood) loss, the descent direction is just g = l β l, the score function. Extensive literature on modified/adapted/approximate/quasi score functions which allow for: Missing data Measurement error Heteroskedasticity... Straightforward to incorporate these methods which propose some modification g of our original step direction g. Go Currently working on allowing TGDR to handle missing data (based on work of Lin and Ying) and measurement error (Augustin)

29 Extensions For log-likelihood (or log partial likelihood) loss, the descent direction is just g = l β l, the score function. Extensive literature on modified/adapted/approximate/quasi score functions which allow for: Missing data Measurement error Heteroskedasticity... Straightforward to incorporate these methods which propose some modification g of our original step direction g. Go Currently working on allowing TGDR to handle missing data (based on work of Lin and Ying) and measurement error (Augustin)

30 Extensions For log-likelihood (or log partial likelihood) loss, the descent direction is just g = l β l, the score function. Extensive literature on modified/adapted/approximate/quasi score functions which allow for: Missing data Measurement error Heteroskedasticity... Straightforward to incorporate these methods which propose some modification g of our original step direction g. Go Currently working on allowing TGDR to handle missing data (based on work of Lin and Ying) and measurement error (Augustin)

31 Crazy ideas (i.e. future work) TGDR with more sophisticated steps (Newton-Raphson, BFGS, etc.) Incorporating biological knowledge (restricting some coefficients > 0, etc.) TGDR for GEE? (based on estimating functions...) TGDR as a meta-method? (TGDR with LASSO loss...)

32 In Conclusion TGDR is... Variable selection based on thresholded gradient descent Beautifully simple Computationally tractable Easy to extend to more complex data structures But TGDR is not... Popular (yet) Particularly amenable to inference (confidence intervals?) Well studied from a theoretical perspective: When does it work? How well does it work? How does it compare to competing methods?

33 In Conclusion TGDR is... Variable selection based on thresholded gradient descent Beautifully simple Computationally tractable Easy to extend to more complex data structures But TGDR is not... Popular (yet) Particularly amenable to inference (confidence intervals?) Well studied from a theoretical perspective: When does it work? How well does it work? How does it compare to competing methods?

34 A word about L A TEX and presentations This presentation is a PDF file generated from a L A TEX (text) document, with the help of a package called beamer. More info available at http : //latex beamer.sourceforge.net/ Ask me if you have any questions... but no guarantees.

35 Acknowledgements Prof. Peter Gilbert (thesis supervisor) Prof. Victor DeGruttola (for providing ACTG data) Thanks! Questions?

Semiparametric Regression

Semiparametric Regression Patrick Breheny October 22 Patrick Breheny Survival Data Analysis (BIOS 7210) 1/23 Introduction Over the past few weeks, we ve introduced a variety of regression models under