Ensemble estimation and variable selection with semiparametric regression models

Size: px

Start display at page:

Download "Ensemble estimation and variable selection with semiparametric regression models"

Dora Chrystal Barker
5 years ago
Views:

1 Ensemble estimation and variable selection with semiparametric regression models Sunyoung Shin Department of Mathematical Sciences University of Texas at Dallas Joint work with Jason Fine, Yufeng Liu, and Steve Cole. July 30, 2018

org (Kaslow, 1987) Clinical research sites Los Angeles: UCLA Chicago: Northwestern

2 Multicenter AIDS cohort study (MACS) Largest HIV AIDS prospective cohort study created to examine natural history of AIDS: aidscohortstudy.org (Kaslow, 1987) Clinical research sites Los Angeles: UCLA Chicago: Northwestern University Pittsburgh/Columbus: University of Pittsburgh Baltimore/Washington DC: Johns Hopkins University 2 / 27

3 Multicenter AIDS cohort study (MACS) Participants: 7,000 homosexual men Study design: Every six months, the participants biological and behavioral data are collected from a physical exam, questionnaires and laboratory testing. Goal: Find risk factors for the natural history of HIV infection. Response: time to HIV infection from the date of birth Possible risk factors: demographics, drug usage, sexual behaviors/disease, and medical histories. 3 / 27

4 Prospective doubly censored data (PDCD) Many subjects are infected at the time of study enrollment while those who are uninfected at enrollment may not be infected during the course of follow-up. Subjects Not Monitoring Event Right Censoring Not Time 4 / 27

5 Decomposition of prospective doubly censored data Subjects Not Monitoring Event Right Censoring Not Subjects Not infected Not infected Not infected Not infected Time Monitoring Not infected Time Subjects Time Not Monitoring Event Right Censoring Not 5 / 27

6 Decomposition of prospecitve doubly censored data Subjects Not infected Not infected Subjects Current status data Not infected Prospective doubly censored data Not infected Monitoring Not infected Not Monitoring Event Right Censoring Time Subjects Not Not Monitoring Event Right Censoring Left truncated right censored data Not Time Time 6 / 27

7 Ensemble of current status data and left truncated right censored data Full likelihood for prospective doubly censored data Maximization is difficult. Approximate EM algorithms (Kim, 2010; Su and Wang, 2016) Component likelihoods for current status data and left truncated right censored data Easy to maximize both component likelihoods. Both have been well studied, with theoretical and computational issues addressed rigorously. Ensemble method Separate component estimation Efficient combination of the component estimators 7 / 27

8 Semiparametric regression models Semiparametric regression model with (θ, Λ) θ: p-dimensional regression parameter Λ: an infinite dimensional nuisance parameter True parameter: (θ 0, Λ 0 ) Sparse θ 0 = (θ T 10, θ T 20) T = (θ T 10, 0) T, where θ 10 R s and θ 20 = 0. 8 / 27

9 Factorizable semiparametric likelihoods Data: n independent identically distributed observations, (z 1,, z n ) The log likelihood based on the data: l n (θ, Λ) = n i=1 l θ,λ(z i ) The log likelihood separates into fixed K component log likelihoods (Cox, 2001): l n (θ, Λ) = K lθ,λ k (z). k=1 Goals Efficient estimation of regression parameter Selection of the zero coefficients of regression parameter Oracle estimation of the nonzero coefficients 9 / 27

10 PDCD: Likelihood construction (C i, T i, R i, x i ): enrollment time, failure time, study termination time, and covariate, i = 1,, n (C i, C i Y i, δ i, ν i, x i ) is observed, where Y i = T i R i, δ i = I (T i C i ) is infection status at enrollment, ν i = I (T i R i ) is right censoring status. n i=1 L(C i, Y i, δ i, ν i x i ): F (C i x i ) δ i f (Y i x i ) ν i (1 δ i ) (1 F (Y i x i )) (1 ν i )(1 δ i ) 10 / 27

11 PDCD: Likelihood factorization n i=1 L(C i, Y i, δ i, ν i x i ): F (C i x i ) δ i f (Y i x i ) ν i (1 δ i ) (1 F (Y i x i )) (1 ν i )(1 δ i ) n i=1 L(C i, δ i x i ): F (C i x i ) δ i (1 F (C i x i )) 1 δ i n i=1 L(Y i, ν i C i, δ i, x i ): f (Y i x i ) {[ 1 F (C i x i ) ]ν i [ 1 F (Y i x i ) 1 F (C i x i ) ]1 ν i } 1 δ i 11 / 27

12 Cox model with PDCD Cox model is designed to estimate the effect of covariates on the hazard rate as well as the baseline hazard function. (Cox, 1972) The model parameter has two components (β, H) β: the effect parameter H: the baseline survival curve/hazard function The conditional hazard rate is assumed to satisfy H(t x) = H(t)exp(xβ). n l PDCD n (β, H) = log[1 exp{ exp(x T i β)h(c i )}] δ i (1 δ i ){ exp(x T i β)h(c i )} i=1 n + i=1 n ν i (1 δ i ){x T i + (1 ν i )(1 δ i ){ exp(x T i=1 β + logh(y i ) exp(x T i β)h(y i ) + exp(x T i β)h(c i )} i β)h(y i ) + exp(x T i β)h(c i )}. 12 / 27

13 PDCD: Likelihoods in proportional hazards models l PDCD n (β, H) l LTRC n (β, H) n i=1 n ν i (1 δ i ){x T i β + logh(y i ) exp(x T i β)h(y i ) + exp(x T i β)h(c i )} + (1 ν i )(1 δ i ){ exp(x T i β)h(y i ) + exp(x T i β)h(c i )} i=1 n i=1 log[1 exp{ exp(x T i l CS n (β, H) n β)h(c i )}] δ i (1 δ i ){ exp(x T i β)h(c i )} i=1 13 / 27

14 I. Initial ensemble estimation Efficient estimators of θ from the component likelihoods: ˆθ k F, k = 1,, K An estimator of the kth asymptotic inverse covariance, I k θ 0,Λ 0 : Î k F Initial ensemble estimator K ˆθ F = argmin (θ ˆθ k F ) T ÎF k (θ ˆθ k K F ) = ( Î k K F ) 1 ( Î k ˆθ k F F ) θ k=1 k=1 k=1 An estimator of I θ0,λ 0 : ÎF = K k=1 Î F k (Cox, 2001) 14 / 27

15 II. Ensemble variable selection Intermediate estimator with ensemble variable selection ˆθ E,λn = argmin{(θ ˆθ F ) T p Î F (θ ˆθ F ) + λ n θ j / ˆθ Fj }. θ j=1 Least squares approximation of the profile likelihood of θ with adaptive lasso penalty (Wang and Leng, 2007; Zou, 2006) Tuning parameter minimizing modified BIC is selected: BIC λn = (ˆθ E,λn ˆθ F ) T Î F (ˆθ E,λn ˆθ F )+logn p j=1 I (ˆθ E,λn,j 0)/n The selected model from ensemble variable selection: A = {j ˆθ E,λn,j 0} 15 / 27

16 III. Refit & ensemble re-estimation Refit component estimators: ˆθ k Θ R p Its subvector indexed by the model, A: ˆθ k A = (ˆθ j k, j A) R A Its asymptotic inverse covariance estimator: ÎA k R A R A Ensemble re-estimator by A: ˆθ A = ( K k=1 Î A k ) 1 ( K k=1 Î Aˆθ k k ) R A Its asymptotic inverse covariance estimator: Î A = K k=1 Î k A R A R A The resulting estimator: ˆθ = [ˆθ A ; ˆθ A c ] = [ˆθ A ; 0] R p 16 / 27

17 Uncorrelated condition C1. Component score functions for θ are pairwise uncorrelated, E[ l k θ 0,Λ 0 l k T θ 0,Λ 0 ] = 0, k k. C2. Component tangent spaces are pairwise orthogonal, Ṗ k θ 0,Λ 0 Ṗ k θ 0,Λ 0, k k. C3. Component score functions for θ are orthogonal to all other component tangent spaces, l k θ 0,Λ 0 Ṗ k θ 0,Λ 0, k k. 17 / 27

18 Full efficient score and information additivity Proposition 1 Under Conditions 1-3, l θ 0,Λ 0 = K k=1 l k θ 0,Λ 0, and E(l k θ 0,Λ 0 l k T θ 0,Λ 0 ) = 0, k k. Consequently, I θ 0,Λ 0 = K k=1 I k θ 0,Λ 0. Example Data: i.i.d observations of (Z i, W i, X i ) X i : a covariate, W i and Z i : dependent variables X i (θ 0, Λ 0) There is a semiparametric regression model which leads to a full log likelihood: n i=1 l θ,λ (Z i, W i X i ) = n i=1 n lθ,λ(w 1 i X i ) + lθ,λ(z 2 i W i, X i ) i=1 18 / 27

19 Assumptions on the initial estimation A1. Semiparametric efficient estimation of component regression parameters ˆθ k F, k = 1,, K, are regular, asymptotically linear and semiparametric efficient with the component likelihoods such that n 1/2 (ˆθ k F θ 0 ) = n 1/2 n i=1 I θ k 1 0,Λ 0 lθ k 0,Λ 0 (z i ) + o p (1), and n 1/2 (ˆθ k F θ 0 ) D N(0, Iθ k 1 0,Λ 0 ). A2. Consistent estimation of component efficient information A consistent estimator of Iθ k 0,Λ 0, ÎF k, exists for k = 1,, K. Remark. The convergence rates of component nuisance parameter estimators may be slower than the convergence rate of n 1/2. 19 / 27

20 Asymptotic properties Theorem 1 (Asymptotic Efficiency) n 1/2 (ˆθ F θ 0 ) = O p (1) and n 1/2 (ˆθ F θ 0 ) D N(0, I 1 θ 0,Λ 0 ). Theorem 2 (Selection Consistency) If n 1/2 λ n 0 and nλ n, then n 1/2 (ˆθ θ 0 ) = O p (1) and P(ˆθ 2 = 0) 1. Theorem 3 (Oracle Property) If n 1/2 λ n 0 and nλ n, then n 1/2 (ˆθ 1 θ 10 ) D N(0, I a 1 θ 10,Λ 0 ). 20 / 27

21 Simulation set-up Data from the following exponential hazard model: h(t x) = exp(x T β), β = (0.8, 0, 0, 1, 0, 0, 0.6, 0, 0, 0) x N(0, Σ), Σ ij = 0.5 i j C i follows an exponential distribution, R i follows an exponential distribution shifted by C i. Censoring rates: (20%, 20%), (30%, 30%) n = 250 or n = / 27

22 Simulation results n = 250 (20%, 20%) RMSE TP FP UF CF OF Oracle CS MLE LSA Refit LSA Oracle LTRC MLE LSA Refit LSA Oracle EE Ensemble EVS Refit on CS Refit on LTRC Refit Ensemble Table : RMSE: Relative mean squared errors to the ensemble oracle estimator, TP: true positives, FP: false positives, (UF, CF, OF): ratios of simulated datasets which are underfitted, correctly fitted or overfitted to the true model 22 / 27

23 MACS data analysis Covariates EE (SE) EVS LTRC (SE) CS (SE) ENS (SE) UNEMP 0.19 (0.09).... BLACK 0.71 (0.08) * (0.18) 0.69 (0.08) 0.70 (0.08) HISPA 0.28 (0.10) * (0.22) 0.32 (0.13) 0.20 (0.11) OTHER 0.06 (0.20).... PRECOL (0.07) *.... COL (0.08) * (0.13) (0.06) (0.06) POSCOL (0.07) * (0.12) (0.11) (0.07) NDRNK (0.01).... PACKS (0.03).... NEEDL 0.54 (0.10) * (0.22) 0.45 (0.12) 0.57 (0.10) COK2Y 0.49 (0.06) * (0.10) 0.55 (0.09) 0.41 (0.07) HAS2Y 0.12 (0.08) *.... MSX2Y 0.31 (0.08) * (0.14) 0.27 (0.09) 0.24 (0.07) OPI2Y (0.12) / 27

24 MACS data analysis Covariates EE (SE) EVS LTRC (SE) CS (SE) ENS (SE) DIABE (0.29).... GONOE 0.53 (0.08) * (0.10) 0.50 (0.14) 0.62 (0.08) RADTE (0.21).... WARTE 0.37 (0.05) * (0.11) 0.39 (0.05) 0.34 (0.05) CON2P 0.07 (0.23).... CON2Y (0.21).... REC2P 0.58 (0.05) * (0.12) 0.59 (0.05) 0.57 ( 0.05) REC2Y 0.06 (0.05).... Table : EE: initial ensemble estimator, EVE: intermediate estimator with ensemble variable selection, LTRC, CS: refit estimators, ENS: ensemble re-estimator 24 / 27

25 Summary & Future research Summary Prospective doubly censored data Semiparametric factorizable likelihoods Ensemble method Efficient combination of component estimators Lasso penalization Future works Ensemble method in the high dimensional context, p >> n Ensemble estimation of nonparametric nuisance parameter, Λ 25 / 27

26 Bibliography I D.R. Cox. Regression models and life-tables. Journal of the Royal Statistical Society. Series B (Methodological), pages , DR Cox. Some remarks on likelihood factorization. Lecture Notes-Monograph Series, pages , C. R Kaslow, R. A. & Rinaldo. The multicenter aids cohort study: rationale, organization, and selected characteristics of the participants. American Journal of Epidemiology, 126(1): , Y Kim. Asymptotic properties of the maximum likelihood estimator for the proportional hazards model with doubly censored data. Journal of Multivariate Analysis, 101(101): , Y-R. Su and J-L Wang. Semiparametric efficient estimation for shared-frailty models with doubly-censored clustered data. The Annals of Statistics, 44: , R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages , H. Wang and C. Leng. Unified lasso estimation by least squares approximation. Journal of the American Statistical Association, 102(479): , H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101(476): , 2006.

27 Thank you! 27 / 27

28 Preliminary l θ0,λ 0 (z i ) = l θ,λ0 (z i )/( θ) θ=θ0 L 2 (P θ0,λ 0 ): the score function for θ at θ 0 l θ0,λ(t)(z i )/( t) t=0 : score function for the submodel of Λ Λ(t): one-dimensional parametric submodels of Λ Λ(t) Λ 0 as t 0 Ṗ θ0,λ 0 L 2 (P θ0,λ 0 ): tangent space for Λ at the true distribution closed span of the tangent set, which is a collection of score functions for one-dimensional submodels l θ 0,Λ 0 = l θ0,λ 0 0 l θ0,λ 0 : efficient score function at (θ 0, Λ 0 ), where 0 is the projection onto Ṗ θ0,λ 0 in L 2 (P θ0,λ 0 ). Iθ 0,Λ 0 = E(lθ 0,Λ 0 lθ T 0,Λ 0 ): efficient information matrix. Similarly, define them with respect to component likelihoods: l θ k 0,Λ 0 (z i ), lθ k 0,Λ(t) (z i)/( t) t=0, Ṗθ k 0,Λ 0, lθ k 0,Λ 0, and Iθ k 0,Λ / 27

29 Application of ensemble method I. Efficient MLEs: ˆβ CS F, ˆβ LTRC F Consistent variance estimator: ÎF CS 1, ÎF LTRC 1 Computation: iterative convex minorant algorithm, partial likelihood maximization Huang (1996), Andersen et al. (1997) Full ensemble estimator: ˆβ F = (Î CS F Î F = Î CS F + Î LTRC F + Î LTRC F ) 1 (Î LTRC F ˆβ CS F + ÎF LTRC ˆβ LTRC F ) 29 / 27

30 Application of ensemble method II. Intermediate estimator with ensemble variable selection: ˆβ E,λn = argmin(β ˆβ F ) T Î F (β ˆβ F ) + λ n β j / ˆβ Fj β R p The selected model: M = {j ˆβ Eλn,j 0} p j=1 30 / 27

31 Application of ensemble method III. Refit estimators: ˆβ CS, ˆβ LTRC R p Its subvector indexed by M: ˆβ LTRC M, ˆβ CS M Its asymptotic covariances: Î M CS 1, Î M LTRC 1 Ensemble re-estimator indexed by M: ˆβ M = (ÎM CS + ÎM LTRC ) 1 (ÎM CS Î M = (ÎM CS + ÎM LTRC ) ˆβ CS M + ÎM LTRC Final estimator: ˆβ = [ˆβ M ; ˆβ M c ] = [ˆβ M ; 0] R p LTRC ˆβ M ) 31 / 27

32 I. Full ensemble estimation MLE from lβ,h CS CS (C, δ x): ˆβ F A consistent bootstrap variance estimator, Î F CS 1 Computation: the iterative convex minorant (ICM) algorithm MLE from lβ,h LTRC(Y, ν C, δ, x): ˆβ LTRC F A consistent variance estimator, Î F LTRC Computation: the partial log likelihood Full ensemble estimator: ˆβ F = (Î CS F + Î LTRC F ) 1 (Î LTRC F ˆβ CS F + ÎF LTRC ˆβ LTRC F ) Its inverse covariance estimator is Î F = Î F CS + Î F LTRC 32 / 27

33 II. Ensemble variable selection Intermediate estimator with ensemble variable selection ˆβ E,λn = argmin(β ˆβ F ) T ÎF (β ˆβ F ) + λ n β j / ˆβ Fj, β R p p j=1 Selection of optimal tuning parameter: minimizing the modified BIC Computation: the path-finding algorithm least angle regression (lars) (Efron et al., 2004) The selected model: M = {j ˆβ Eλn,j 0} 33 / 27

34 III. Refit ensemble estimation Refit estimators based on M: ˆβ CS and ˆβ LTRC ˆβ LTRC M ˆβ CS M Its subvector indexed by M: and Its asymptotic inverse covariances: Î M CS LTRC and ÎM Computation: maximizing the component likelihoods under a constraint that {β j = 0, for j M Ensemble re-estimator indexed by M: ˆβ M = (Î CS M + Î LTRC M ) 1 (Î CS M ˆβ CS M + ÎM LTRC ˆβ LTRC M ) Asymptotic inverse covariance estimator: Î M Final estimator: ˆβ = [ˆβ M ; ˆβ M c ] = [ˆβ M ; 0] CS = (ÎM + Î M LTRC ) 34 / 27

35 Norm of a matrix A R m m Frobenius norm: A F = tr(a T A) Spectral norm: A 2 = λ max (A T A), λ max (A T A) is the largest eigenvalue of A T A. m m L 1 norm: A 1 = a ij i=1 j=1 35 / 27

36 Theoretical properties Corollary 4 (Asymptotic Efficiency) We have n 1/2 (ˆβ F β 0 ) D N(0, I 1 β 0,H 0 ). Corollary 5 (Selection Consistency) If n 1/2 λ n 0 and nλ n, then n 1/2 (ˆβ β 0 ) = O p (1) and P(M = M T ) 1. Corollary 6 (Oracle Property) If n 1/2 λ n 0 and nλ n, then n 1/2 (ˆβ a β 0a ) D N(0, I a 1 β 0a,H 0 ). 36 / 27

37 Baseline hazard rate estimators 1. The baseline hazard function estimator, Ĥ LTRC, in Cox model with left-truncated and right-censored data has a regular convergence rate, n 1/2. 2. The convergence rate of Ĥ CS in Cox model with current status data is n 1/3. Regardless of the convergence rate of the component baseline hazard rate estimators, our proposed method establishes valid asymptotical properties for estimation of β. Practically, we may use Ĥ LTRC as a baseline hazard function estimator. 37 / 27

38 Decomposition into marginal and conditional likelihoods Data: n independent identically distributed observations of (Z i, W i, X i ) X i is a covariate and W i and Z i are dependent variables. A semiparametric regression model with parameters (θ, Λ) which leads to a full log likelihood, n l θ,λ (Z i, W i X i ). i=1 Distribution of X i is independent of (θ, Λ). l θ,λ (Z, W X) = Decomposition of full log likelihood into the marginal and conditional log likelihoods: lθ,λ 1 n (W X) = lθ,λ 1 (W i X i ) and lθ,λ 2 n (Z W, X) = lθ,λ 2 (Z i W i, X i ). i=1 i=1 38 / 27

39 Variance estimators of ˆβ 1 n = 250 CS LTRC (20%, 20%) Oracle MLE Refit Oracle MLE Refit LSA LSA ESE ASE % coverage Ensemble Oracle EE Refit Refit Refit on CS on LTRC Ensemble ESE ASE % coverage Table : ESE: Empirical standard errors, ASE: Average estimated standard errors, 95% empirical coverage probabilities of ˆβ 1 39 / 27

40 Norm of empirical covariances between two component MLEs Censoring rates (20%, 20%) (30%, 30%) Norm n = 250 n = 500 n = 250 n = 500 Frobenius Spectral L / 27

CONTRIBUTIONS TO PENALIZED ESTIMATION. Sunyoung Shin

CONTRIBUTIONS TO PENALIZED ESTIMATION Sunyoung Shin A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements for the degree