Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina

Size: px

Start display at page:

Download "Lecture 9: Learning Optimal Dynamic Treatment Regimes. Donglin Zeng, Department of Biostatistics, University of North Carolina"

Ferdinand O’Brien’
5 years ago
Views:

1 Lecture 9: Learning Optimal Dynamic Treatment Regimes

2 Introduction

3 Refresh: Dynamic Treatment Regimes (DTRs) DTRs: sequential decision rules, tailored at each stage by patients time-varying features and intermediate outcomes in previous stages (Lavori & Dawson 1998, Lavori et al. 2000, Murphy et al. 2001). Used in cancer, psychiatry, substance abuse research. Examples of DTRs Adaptive Pharmacological Behavioral Treatments for Children with Attention Deficit Hyperactive Disorder (ADHD, Pelham 2002). DTR1: Prescribe medication (MED) as initial treatment; if a child responds then continue; if a child does not respond then augment with behavioral modification (BMOD). DTR2: Prescribe BMOD as initial treatment; if a child responds then continue; if a child does not respond then augment with MED.

Dynamic Treatment Regimes (DTRs) Examples of DTRs Adaptive Pharmacological Behavioral Treatments for Children with Attention Deficit Hyperactive Disorder (ADHD, Pelham

4 Dynamic Treatment Regimes (DTRs) Examples of DTRs Adaptive Pharmacological Behavioral Treatments for Children with Attention Deficit Hyperactive Disorder (ADHD, Pelham 2002). DTR1: Prescribe medication (MED) as initial treatment; if a child responds then continue; if a child does not respond then augment with behavioral modification (BMOD).

5 Dynamic Treatment Regimes (DTRs) Examples of DTRs Adaptive Pharmacological Behavioral Treatments for Children with Attention Deficit Hyperactive Disorder (ADHD, Pelham 2002). DTR2: Prescribe BMOD as initial treatment; if a child responds then continue; if a child does not respond then augment with MED.

6 Existing Methods

7 General Multi-Stage DTR notation A k : treatment at stage k, take value { 1, 1}. H k : historical information at stage k. R k : reward at stage k. DTR A sequence of decision functions D = (D 1, D 2,..., D K ) maps from the historical information domain (H 1, H 2,..., H K ) to treatment domain vector of ( 1, 1).

8 Value Function and Optimal DTR The value function associated with D is the expected total reward if D is actually implemented: V(D) = E D (R R K ). Optimal DTR: D = argmax D V(D). Key relationship based on SMART designs: V(D) is [ ] I(A1 = D 1 (H 1 ), A 2 = D 2 (H 2 ), ) E (R 1 + R 2 + ). P(A 1 H 1 )P(A 2 H 2 )

9 Existing Methods for Learning DTRs Dynamic modelling of clinical outcomes: G-computation, Monte-Carlo simulation, Bayesian approaches (Lavori & Dawson 2004; Wathen & Thall 2008) Sequential modelling of Q-functions (expected individual outcome given best treatment in prospect): Q-learning, A-learning or double robust regression models (Murphy et al. 2006; Robins 2004) Sequential maximization of value functions (maximal expected benefit for a given treatment strategy): O-learning (Zhao et al. 2012, 2014) We focus on the last two methods.

10 Q-learning: two-stage example Data: (H 1, A 1, R 1, H 2, A 2, R 2 ) where H: state; A: treatment; R: reward. Goal: maximize R 1 + R 2 to estimate the best treatment at each stage. Q-Learning: using backward-induction logic Compare the expected outcome from regression model of second stage for two treatments. Pick the treatment with larger expected outcome. Imputation: Create a pseudo second stage outcome R 2 by the maximum across the two treatments from the above regression model. Fit regression model where the output is { R 1 + R 2 }. Pick the treatment 1 maximizing the regression expected value for a given set of baseline variables.

11 Q-learning: algorithm At stage 2 (no future), we fit Q 2 (H 2, A 2 ) = E[R 2 H 2, A 2 ], then estimate D2 = argmax a { 1,1} Q 2 (H 2, a). At stage 1, we obtain individual optimal future reward as R 1 = R 1 + max Q 2(H 1, a); a { 1,1} so we estimate Q 1 (H 1, A 1 ) as E[R 1 H 1, A 1 ]. We obtain D1 = argmax a { 1,1} Q 1(H 1, a).

12 Pros and Cons Pros: Each step is a regression analysis. Make use of all the subjects. Cons: Regression models may be misspecified. The objective function is for model fitting but not directly for value maximization.

13 Extension Single-Stage O-Learning

14 O-learning: single stage Directly maximize value function (Zhao et al. 2012) ( ) RI(A = D(H)) V(D) = E D (R) = E. P(A H) Interpretation: Subjects with high rewards most likely, we want D(H) to be the same as the assigned treatment; Subjects with low rewards we may want D(H) to be the opposite to the assigned treatment. O-learning is a weighted classification problem with outcome as weights (classification tree, SVM).

15 O-learning: multiple stages A backward algorithm (Zhao et al. 2014): At stage 2, apply single stage O-learning to estimate D 2. For stage 1, only keep the subjects whose observed treatment is the same as the optimal one, A 2 = D 2 (H 2). For this subgroup of patients, apply single stage O-learning to estimate D 1.

16 Pros and Cons Pros: Directly maximize the value function for optimal treatments. It only uses the subjects who actually follow optimal regimes in future so is robust. Cons: Need to handle negative weights (Zhao et al recommends subtracting a small constant). Highly variable weights may affect performance. Discard a significant proportion of subjects in the backward procedure.

17 Improve O-Learning via Augmentation

18 New Approach: AMOL Augmented Multistage O-Learning (AMOL): based on a backward O-learning but with three novel improvements. Improvement 1: we aim to reduce the variability of weights. Improvement 2: we can handle negative weights. Improvement 3: we utilize all the subjects including those who may not take optimal treatments in future stages.

19 Improvement 1: Fitting residuals to reduce variability Fit regression model s(h): R i H i. Change the weights from observed outcome R i to R i s(h i ). ( ) RI(A = D(H)) V(D) = E P(A H) ( ) (R s(h))i(a = D(H)) = E + E[s(H)]. P(A H)

20 Improvement 2: Accommodate negative weights Note ( ) RI(A = D(H)) argmax D E P(A H) ( ) R I(Asign(R) = D(H)) = argmax D E. P(A H) When R i > 0, the desirable rule for large R i should be D(H i ) = A i. When R i < 0, the desirable rule for large R i is D(H i ) = A i.

21 Improved O-learning using surrogate loss: Weighted classification problem: f = argmin f n 1 n i=1 (1 sign(r i )A i f (H i )) + R i π i + λ f 2. where D (h) = sign(f (h)), f (x) = βx + β 0 ; and f 2 = β 2.

22 Improvement 3: Use all subjects at each stage Ideas At each stage, O-learning requires knowing incremental reward for each subjects, i.e., future reward if they are treated optimally. For subjects who actually take non-optimal treatments, their future value increment is missing. Augmentation technique for missing data can be used. The augmentation needs the imputation of incremental reward for these subjects: models in Q-learning provide natural imputation. Therefore, this approach integrates O- and Q-learning.

23 Augmented Inverse Probability Weighted Estimation AIPW in Missing data literature (Robins et al. 1994, Robins 1999): Estimate µ, mean of sample Y i s. Some Y i s are missing, Z i = I(Y i is observed ). H i s are predictors. If either of the two parametric models is correctly specified: µ(h, γ 1 ) = E(Y H), π(h, γ 3 ) = P(Z = 1 H), the estimator ˆµ is consistent. If both are correct, it is most efficient. n [ ˆµ = n 1 Z i Y i π(h i, ˆγ 3 ) Z ] i π(h i, ˆγ 3 )) µ(h i, ˆγ 1 ) π(h i, ˆγ 3 )) i=1

24 Augmented Multistage O-learning Algorithm AMOL Complete Algorithm At stage 2, r 2 = R 2 s 2 (H 2 ), we minimize [ ] I(sign(r2 )A 2 D 2 (H 2 )) r 2 E P(A 2 H 2 ) using O-learning to obtain D 2. At stage 2, fit a Q-learning model to compute R 2(h) = max E[R 2 A 2 = a, H 2 = h]. a { 1,1}

25 Augmented Multistage O-learning Algorithm AMOL Complete Algorithm (continued) Compute the augmented increment reward at stage 2: Q 2 = I(A 2 = D 2 (H 2)) P(A 2 H 2 ) R 2 I(A 2 = D 2 (H 2)) P(A 2 H 2 ) R P(A 2 H 2 ) 2(H 2 ). At stage 1, calculate r 1 = R 1 + Q 2 s 1 (H 1 ) then minimize [ ] I(sign(r1 )A 1 D 1 (H 1 )) r 1 E. P(A 1 H 1 )

26 AMOL Theoretical Results Theorem 1: Consistency of AMOL optimal treatment rule For any function µ(h) which maps from the history information H k to the outcome domain, R k 1 + K j=k I(A j = D j (H j)( K j=k R j) K j=k π j(a j, H j ) K j=k I(A j = Dj (H j)) j k π j(a j, H j ) j k π µ(h j(a j, H j ) k ) is always unbiased for E[R k 1 + R k H k], and for pure randomization, its conditional variance is minimized if µ(h k ) = R 2 (H k).

27 AMOL Theoretical Results Theorem 2: Convergence rate of the value function Under some regularity conditions including geometric noise conditions for boundary, convergence rates of Q-learning models and the rate of Gaussian kernel bandwidth for RKHS, we obtain K P V k( f k,..., f K ) V k (fk,..., f K ) c (K j) 0 ɛ nj (τ) where ɛ nk (τ) = c [ j=k 1 (K j + 1)e τ, λ 2 + (2 v k )(1+δ k ) 2+v k (2+v k )(1+q k ) nk n 2 2+v k + τ + τ nλ nk n β k ] qk + λ q k +1 nk.

28 Comparing Performance in Simulation Study

29 Simulation Set-up Two-stage and four-stage settings. 500 replicates and an independent 10, 000 test set. 50 baseline covariates: X 1, X 2,..., X 50 from N (0, 1). Treatment A is are randomly assigned to { 1, 1} equally. Scenario 1: R 1 = X 1 A 1 + N (0, 1); R 2 = (R 1 + X2 2 + X ) A 2 + N (0, 1). Scenario 2 extends 1 to four stages R 1 = X 3 A 1 + N (0, 1); R 2 = (R 1 + X X ) A 2 + N (0, 1); R 3 = 2 (R 2 + X 3 ) A 3 + 3X 4 4X 5 + N (0, 1); R 4 = 3 (R 3 + X 6 ) A 4 4X 3 + N (0, 1).

30 Scenario 1 Simulation Results scenario1 Emipirical Value/Std from optimal value Qlearning Olearning Olearning Residual AMOL n Figure: Scenario1: two-stage trial with 500 replicates and optimal value 2.86

31 Scenario 2 Simulation Results scenario2 Emipirical Value/Std from optimal value Qlearning Olearning Olearning Residual AMOL n Figure: Scenario2: four-stage trial with 500 replicates and optimal value 25.6

32 Analysis of ADHD Study

33 ADHD Data Analysis Interventions include different dose of methamphetamine (MED) and different intensities of behavioral modification (BMOD). The first stage lasted 2 months and impairment rating scale and individualized list of target behaviors were used to assess response. Children who didn t respond were rerandomized to either intensified or switched treatment. Primary outcome is a school performance score measured from 1 to 5.

34 Additional Information on ADHD Data A total of 150 subjects at the initial stage Four baseline covariates: prior medication history, ADHD impairment score, ODD diagnosis and race Two time varying co-variates (adherence to treatment, months to remission) for stage participants did not respond to first stage intervention, re-randomized in the second stage.

35 ADHD Data Analysis Q-learning O-learning AMOL Mean 3.601(0.0284) 3.097(0.0387) 3.660(0.0268) Q learning O learning AMOL AMOL Sparse Figure: Predicted Values based on fold CV

36 ADHD Coefficient for stage 2 Q-L O-L( 10 2 ) AMOL Intercept ODD Diagnosis ADHD score (cont.) Medication prior Race (white=1) trt1(1 for bmod;-1 for med) ODD Diagnosis* trt ADHD *trt Prior med*trt race*trt months tol non-response Adherence to trt months top non-response*trt Adherence to trt1*trt Adherence to trt1*trt Table: Coefficients for ADHD stage 2 Q-learning also include other interaction terms with trt2 which are omitted in the table.

37 ADHD coefficients for stage 1 Q-L O-L( 10 3 ) AMOL( 10 3 ) Intercept ODD Diagnosis ADHD score (cont.) Medication prior Race (white=1) trt1(1 for bmod;-1 for med) ODD Diagnosis* trt ADHD *trt Prior med*trt race*trt Table: Coefficients for ADHD stage 1

38 Interpretations of the Coefficients Sparse Optimal Rules from AMOL Children Prior Med= 1 Med in first stage, otherwise, BMOD For the second stage, Children adhere to initial treatment INTENSIFY, otherwise ADD the other TRT. DTRs Observed value BMOD then ADD MED BMOD then INTENSIFY BMOD MED then ADD BMOD MED then INTESIFY MED 2.789

39 Future Consideration

40 Additional Issues To find tailoring variables for future studies: Ranking the importance of feature variables and feature selection for DTR; More interpretable rules: incorporate tree model. Exploration of other classifiers; Identifying high-benefit subgroups. Other types of outcomes Multi-dimensional outcomes value function (benefit-risk)

SEQUENTIAL MULTIPLE ASSIGNMENT RANDOMIZATION TRIALS WITH ENRICHMENT (SMARTER) DESIGN

SEQUENTIAL MULTIPLE ASSIGNMENT RANDOMIZATION TRIALS WITH ENRICHMENT (SMARTER) DESIGN Ying Liu Division of Biostatistics, Medical College of Wisconsin Yuanjia Wang Department of Biostatistics & Psychiatry,