Supplement to Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes

Size: px

Start display at page:

Download "Supplement to Q- and A-learning Methods for Estimating Optimal Dynamic Treatment Regimes"

Chester Spencer
5 years ago
Views:

1 Submitted to the Statistical Science Supplement to - and -learning Methods for Estimating Optimal Dynamic Treatment Regimes Phillip J. Schulte, nastasios. Tsiatis, Eric B. Laber, and Marie Davidian bstract. This supplemental article contains technical details related to material presented in the parent article. Section and equation numbers not preceded by refer to those in the parent article. Phillip J. Schulte is Biostatistician, Duke Clinical Research Institute, Durham, North Carolina 2771, US ( phillip.schulte@duke.edu). nastasios. Tsiatis is Gertrude M. Cox Distinguished Professor, Department of Statistics, North Carolina State University, Raleigh, North Carolina , US ( tsiatis@ncsu.edu). Eric B. Laber is ssistant Professor, Department of Statistics, North Carolina State University, Raleigh, North Carolina , US ( eblaber@ncsu.edu). Marie Davidian is William Neal Reynolds Professor, Department of Statistics, North Carolina State University, Raleigh, North Carolina , US ( davidian@ncsu.edu). 1

2 2 P.J. SCHULTE ET L..1. DEMONSTRTION THT (5) (8) DEFINE N OPTIML REGIME For k = 1,..., and any d D, define the random variables α k { S k ( d k 1 )} such that (.1) α k { S k( d k 1 )}(ω) = V (1) k ( s k, ū k 1 ) for any ω Ω, where ( s k, ū k 1 ) are defined by (3). Because V one k as defined in (6) and (8) are functions of arguments ( s k, ā k 1 ) S k Āk 1, whereas the random variable S k ( d k 1 ) has realizations only in S k, we find it convenient to introduce (.1) to make this distinction clear. In words, α k { S k ( d k 1 )} is the expected outcome conditional on a patient in the population having followed regime d through decision point k 1 and then following the imal regime from that point on. We now argue that d (1) is an imal regime; i.e., d (1) satisfies (4). We first show that, for any d D, E{Y (d) S 1 = s 1, S 2(d 1 ),..., S ( d 1 )} E{Y ( d 1, d (1) ) S 1 = s 1, S 2(d 1 ),..., S ( d 1 )} (.2) = α {s 1, S 2(d 1 ),..., S ( d 1 )}. This follows because, for the set in Ω where {S 2 (d 1) = s 2,..., S ( d 1 ) = s }, the left- and right-hand sides of the first line of (.2) are equal to (.3) E{Y (d) S ( d 1 ) = s } = E{Y (ū 1, u ) S (ū 1 ) = s }, (.4) E{Y ( d 1, d (1) ) S ( d 1 ) = s } = E[Y {ū 1, d (1) ( s, ū 1 )} S (ū 1 ) = s ], respectively. By the definition of d (1) by the definition of V (1) in (5), (.4) is greater than or equal to (.3), and, (1) in (6), (.4) equals V ( s, ū 1 ). Because these results hold for sets {S 2 (d 1) = s 2,..., S ( d 1 ) = s } for any (s 2,..., s ) such that (s 1,..., s, u 1,..., u 1 ) Γ k, and by the definition of α in (.1), (.2) holds. Taking conditional expectations given S 1 = s 1 yields E{Y (d) S 1 = s 1 } E{Y ( d 1, d (1) ) S 1 = s 1 } (.5) = E[α {s 1, S 2(d 1 ),..., S ( d 1 )} S 1 = s 1 ]. The equality in (.5) holds for any d 1 = (d 1,..., d 1 ) with d D, hence it must hold for

3 SUPPLEMENT TO - ND -LERNING METHODS 3 (d 1,..., d 2, d (1) 1 ). Thus, we also have that (.6) E{Y ( d 2, d (1) 1, d(1) ) S 1 = s 1 } = E[α {S 1, S 2(d 1 ),..., S 1( d 2 ), S ( d 2, d (1) 1 )} S 1 = s 1 ]. Similarly, for any k = 1,..., 1, we can show that E[α k+1 {S 1, S 2 (d 1),..., S k+1 ( d k )} S 1 = s 1, S 2 (d 1),..., S k ( d k 1 )] E[α k+1 {S 1, S 2 (d 1),..., S k+1 ( d k 1, d (1) k )} S 1 = s 1, S 2 (d 1),..., S k ( d k 1 )] = α k {s 1, S 2 (d 1),..., S k ( d k 1 )}, which implies for k = 1,..., 1, E[α k+1 {s 1, S 2(d 1 ),..., S k+1( d k )} S 1 = s 1 ] E[α k+1 {s 1, S 2(d 1 ),..., S k+1( d k 1, d (1) k )} S 1 = s 1 ] (.7) = E[α k {s 1, S 2(d 1 ),..., S k( d k 1 )} S 1 = s 1 ] Using (.5) and (.7) with k = 1, we thus have (.8) E{Y (d) S 1 = s 1 } E{Y ( d 1, d (1) ) S 1 = s 1 } = E[α {s 1, S 2(d 1 ),..., S ( d 1 ) S 1 = s 1 ] E[α {s 1, S 2(d 1 ),..., S ( d 2, d (1) 1 ) S 1 = s 1 ] = E[α 1 {s 1, S 2(d 1 ),..., S 1( d 2 ) S 1 = s 1 ] Because of (.6), the term in (.8) is equal to E{Y ( d 2, d (1) 1, d(1) ) S 1 = s 1 }. Hence, E{Y (d) S 1 = s 1 } E{(Y ( d 1, d (1) ) S 1 = s 1 } E{Y ( d 2, d (1) 1, d(1) ) S 1 = s 1 } (.9) = E[α 1 {s 1, S 2(d 1 ),..., S 1( d 2 ) S 1 = s 1 ]. gain, because d 2 is arbitrary, if we replace it by ( d 3, d (1) 2 ), the equality in (.9) implies (.1) E{Y ( d 3, d (1) 2 ) S 1 = s 1 } = E[α 1 {s 1, S 2(d 1 ),..., S 1( d 3, d (1) 2 )} S 1 = s 1 ], where, for any d, d k = (d k,..., d ). Using (.7) with k = 2, (.9), and (.1), we obtain E{Y ( d 2, d (1) 1 ) S 1 = s 1 } = E[α 1 {s 1, S 2(d 1 ),..., S 1( d 2 )} S 1 = s 1 ] E[α 1 {s 1, S 2(d 1 ),..., S 1( d 3, d (1) 2 ) S 1 = s 1 ] = E{Y ( d 3, d (1) 2 )} S 1 = s 1 } = E[α 2 {s 1, S 2(d 1 ),..., S 2( d 3 )} S 1 = s 1 ]. Continuing in this fashion, we may conclude that, for any d D, E{Y (d) S 1 = s 1 } E{Y ( d k 1, d (1) k ) S 1 = s 1 } E{Y (d (1) ) S 1 = s 1 }, showing that d (1) defined in (5) and (7) is an imal regime satisfying (4).

4 4 P.J. SCHULTE ET L..2. DEMONSTRTION OF CORRESPONDENCE IN (19) We make the consistency, sequential randomization, and positivity (15) assumptions given in Section 3; the latter states that, for any ( s k, ā k 1 ) for which pr( S k = s k, Āk 1 = ā k 1 ) >, pr( k = a k S k = s k, Āk 1 = ā k 1 ) > if ( s k, ā k 1 ) Γ k and a k Ψ k ( s k, ā k 1 ), k = 1,...,. This ensures that the observed data contain information on the possible treatment ions involved in the class of regimes under consideration. We wish to show (16) (18), which we restate here for convenience as (.11) (.12) (.13) pr( S k = s k, Āk = ā k ) >, pr(s k+1 = s k+1 S k = s k, Āk = ā k ) = pr{s k+1(ā k ) = s k+1 S k = s k, Āk 1 = ā k 1 } = pr{s k+1(ā k ) = s k+1 S j = s j, Āj 1 = ā j 1, S j+1(ā j ) = s j+1,..., S k(ā k 1 ) = s k }, for any ( s k, ā k 1 ) Γ k and a k Ψ k ( s k, ā k 1 ), k = 1,...,, for j = 1,..., k, where we define (.13) with j = k to be the same as the expression on the right-hand side of (.12) and take S +1 = Y and S+1 (ā ) = Y (ā ). If we can show (.11), then the quantities in (9) (14) are well-defined. If (.12) (.13) also hold, then the conditional distributions of the observed data involved in the quantities in (9) (14) are the same as the conditional distributions of the potential outcomes involved in (5) (8), and (19) follows. ssume for the moment that (.11) is true. We now demonstrate (.12) and (.13) must follow. For any fixed k, by the consistency assumption, the left-hand expression in (.12) is equal to pr{sk+1 (ā k) = s k+1 S k = s k, Āk 1 = ā k 1, k = a k }. It follows by the sequential randomization assumption, which implies k Sk+1 (ā k) S k, Āk 1, that this is equal to the right-hand side of (.12). The equality in (.13) follows by induction. First note that right-hand side of (.12) is equal to (.13) with j = k. The equality of (.12) and (.13) for all j = 1,..., k can be deduced if we can show that (.13) being true for a given j implies that it is also true for j 1. For a given j = 2,..., k, by the consistency assumption, (.13) is equal to pr{sk+1 (ā k) = s k+1 S j 1 = s j 1, Āj 2 = ā j 2, j 1 = a j 1, Sj (ā j) = s j,..., Sk (ā k 1) = s k }. By the sequential randomization assumption, j 1 {Sj (ā j),..., Sk+1 (ā k)} S j 1, Āj 2, this expression is equal to pr{sk+1 (ā k) = s k+1 S j 1 = s j 1, Āj 2 = ā j 2, Sj (ā j) = s j,..., Sk (ā k 1) = s k }, which is (.13) for j 1. Note, then, that this implies that the conditional densities in (.13), which are j-dependent, are the same as those on the left-hand side of (.12), which are not and are also

5 SUPPLEMENT TO - ND -LERNING METHODS 5 equal to pr{s k+1 (ā k) = s k+1 S k (ā k 1) = s k }; i.e., the left-hand side of (.12) is completely in terms of the distribution of the observed data, whereas (.13) with j = 1 is completely in terms of the distribution of the potential outcomes. We now prove (.11) by induction. ssume we have shown (.11) for a fixed k; i.e., if ( s k, ā k 1 ) Γ k and a k Ψ k ( s k, ā k 1 ), then pr( S k = s k, Āk = ā k ) >. Then we must show that (.14) pr( S k+1 = s k+1, Āk+1 = ā k+1 ) > if ( s k+1, ā k ) Γ k+1 and a k+1 Ψ k+1 ( s k+1, ā k ). Note that (.15) pr( S k+1 = s k+1, Ākl = ā k ) = pr(s k+1 = s k+1 S k = s k, Āk = ā k ) pr S k = s k, Āk = ā k ). By (.15), (.14) will be true if pr(s k+1 = s k+1 S k = s k, Āk = ā k ) >. But by the arguments above, pr(s k+1 = s k+1 S k = s k, Āk = ā k ) = pr{s k+1(ā k ) = s k+1 S k(ā k 1 ) = s k }, which is positive because ( s k+1, ā k ) Γ k+1. Next, pr( S k+1 = s k+1, Āk+1 = ā k+1 ) = pr( k+1 = a k+1 S k+1 = s k+1, Āk = ā k )pr( S k+1 = s k+1, Āk = ā k ). However, because a k+1 Ψ k+1 ( s k+1, ā k ), by the positivity assumption, pr( k+1 = a k+1 S k+1 = s k+1, Āk = ā k ) >. The proof is completed by noting that pr(s 1 = s 1, 1 = a 1 ) = pr( 1 = a 1 S 1 = s 1 )pr(s 1 = s 1 ). If s 1 Γ 1, pr(s 1 = s 1 ) >, and pr( 1 = a 1 S 1 = s 1 ) > for a 1 Ψ(s 1 ) by the positivity assumption. To demonstrate (19), consider first the definitions of d (1) ( s, ā 1 ) and V (1) ( s, ā 1 ) given in (5) and (6). These quantities involve the conditional expectation of the potential outcome Y (ā ) given S (ā k 1), which by (.12)-(.13) is the same as the conditional expectation of Y given { S = s, Ā = ā }. Thus, d (1) ( s, ā 1 ) and V (1) ( s, ā 1 ) are the same as d ( s, ā 1 ) and V ( s, ā 1 ) defined in (1) and (11). Next, from (7) and (8), d (1) 1 ( s 1, ā 2 ) = arg (1) max E[V a 1 Ψ 1 ( s 1,ā 2 ) { s 1, S (ā 2, a 1 ), ā 2, a 1 } S 1(ā 2 ) = s 1 ]. This involves the conditional expectation of V (1), a function of S (ā 1), given S 1 (ā 2) = s 1. gain, by (.12)-(.13), this is the same as the conditional expectation of the function V (1) of S given { S = s, Ā 1 = ā 1 }. Because we have already shown that V (1) same as V, this implies that d (1) 1 ( s 1, ā 2 ) is given by arg max E{V ( s 1, S, ā 2, a 1 ) S = s, a 1 Ψ 1 ( s 1,ā 2 ) Ā 1 = (ā 2, a 2 )}, is the

6 6 P.J. SCHULTE ET L. which is the same as d 1 ( s 1, ā 2 ) given by (13) with k = 1. The argument continues in a backward iterative fashion for k = 2,..., JUSTIFICTION FOR ṼI IN -LERNING We wish to show that (.16) ( E V k+1 ( S k+1, Āk) + C k ( S k, Āk 1)[ I{C k ( S k, Āk 1) > } k ] S ) k, Āk 1 = V k ( S k, Āk 1). Defining Γ( S k+1, Āk) = V k+1 ( S k+1, Āk) + C k ( S k, Āk 1)[ I{C k ( S k, Āk 1) > } k ], we may write (.16) as (.17) E[ E{Γ( S k+1, Āk) S k, Āk} S k, Āk 1 ]. The inner expectation in (.17) may be seen to be equal to E{V k+1 ( S k+1, Āk) S k, Āk} + C k ( S k, Āk 1)[ I{C k ( S k, Āk 1) > } k ] = k ( S k, Āk) + C k ( S k, Āk 1)[ I{C k ( S k, Āk 1) > } k ]. Substituting k ( S k, Āk) = h k ( S k, Āk 1) + k C k ( S k, Āk 1), h k ( S k, Āk 1) = k ( S k, Āk 1, ), we obtain E{Γ( S k+1, Āk) S k, Āk} = h k ( S k, Āk 1)+C k ( S k, Āk 1)I{C k ( S k, Āk 1) > } = V k ( S k, Āk 1). Substituting this in (.17) yields the result..4. DEMONSTRTION OF EUIVLENCE OF - ND -LERNING IN SPECIL CSE We take = 1 and let pr( 1 = 1 S 1 = s 1 ) = π. Consider the -learning estimating equations (31) with k = 1, and take λ 1 (s 1 ; ψ 1 ) = / ψ 1 C 1 (s 1 ; ψ 1 ). Then the equations become n i=1 C 1 (S 1i ; ψ 1 ) ψ 1 ( 1i π){y i 1i C 1 (S 1i ; ψ 1 ) h 1 (S 1i ; β 1 )} =, n i=1 h 1 (S 1i ; β 1 ) β 1 {Y i 1i C 1 (S 1i ; ψ 1 ) h 1 (S 1i ; β 1 )} =. Likewise, under these conditions, taking 1 (s 1, a 1 ) = a 1 C 1 (s 1 ; ψ 1 ) + h(s 1 ; β 1 ), the -learning equation is n i=1 1 (S 1i, 1i ; ξ 1 ) ξ 1 {Y i 1i C 1 (S 1i ; ψ 1 ) h 1 (S 1i ; β 1 )} =,

7 where, with ξ 1 = (ψ T 1, βt 1 )T, SUPPLEMENT TO - ND -LERNING METHODS 7 1 (S 1i, 1i ; ξ 1 ) = ξ 1 1i C 1 (S 1i ; ψ 1 ) ψ 1 h 1 (S 1i ; β 1 ) β 1 Thus note that, with C 1 (s 1 ; ψ 1 ) and h 1 (s 1 ; β 1 ) linear in functions of S 1, as long as terms of the form in C 1 (s 1 ; ψ 1 ) are contained in those in h 1 (s 1, β 1 ), the - and -learning estimating equations are identical, as then n i=1. C 1 (S 1i ; ψ 1 ) ψ 1 {Y i 1i C 1 (S 1i ; ψ 1 ) h 1 (S 1i ; β 1 )} =. For example, if C 1 (s 1 ; ψ 1 ) = ψ 1 + s T 1 ψ 11 and h 1 (s 1 ; β 1 ) = β 1 + s T 1 β 11, then note that C 1 (S 1i ; ψ 1 ) = h 1(S 1i ; β 1 ) = 1, ψ 1 β 1 S 1i and the result is immediate. See Chakraborty et al. (21) for discussion of the case = EXMPLE OF INCOMPTIBILITY OF -FUNCTION MODELS To show (33), noting H 2 = (1, s 1, a 1, s 2 ) T = ( T 1, s 2) T, we have E{V 2 (s 1, S 2, a 1 ; ξ 2 ) S 1 = s 1, 1 = a 1 } = T 1 β 21 + β 22 E(S 2 S 1 = s 1, 1 = a 1 ) +( T 1 ψ 21 )E{I( T 1 ψ 21 + S 2 ψ 22 > ) S 1 = s 1, 1 = a 1 } +ψ 22 E{S 2 I( T 1 ψ 21 + S 2 ψ 22 > ) S 1 = s 1, 1 = a 1 }. Taking ψ 22 >, we also have I( T 1 ψ 21 + S 2 ψ 22 > ) = I(S 2 > T 1 ψ 21/ψ 22 ), from which it follows that E{I( T 1 ψ 21 + S 2 ψ 22 > ) S 1 = s 1, 1 = a 1 } = 1 Φ{( T 1 ψ 21/ψ 22 T 1 γ)/σ} = 1 Φ(η) for η = T 1 (ψ 21/ψ 22 + γ)/σ. Similarly, E{S 2 I( T 1 ψ 21 + S 2 ψ 22 > ) S 1 = s 1, 1 = a 1 } = E{S 2 I(S 2 > T 1 ψ 21/ψ 22 ) S 1 = s 1, 1 = a 1 }. It is straightforward to deduce that this is equal to η (σt + T 1 γ) ϕ(t) dt = σϕ(η) + (T 1 γ){1 Φ(η)}. Using E(S 2 S 1 = s 1, 1 = a 1 ) = 1 T γ and combining yields (33)..6. CLCULTION OF E{H( D OPT )} ND R( D OPT ) Calculation for = 1. We consider the generative data model in Section 6.1 and treatment regimes of the form d(s 1 ) = d 1 (s 1 ) = I(ψ 1 + ψ 11 s 1 > ) for arbitrary ψ 1, ψ 11. It is possible to

8 8 P.J. SCHULTE ET L. derive analytically H(d) = E{Y (d)} in this case. Under the generative data model, E{Y (d)} = E[E{Y (d) S 1 }] = E[E{Y S 1, 1 = d 1 (S 1 )}] = β 1 + β 11 E(S 1) + β 12 E(S2 1 ) + E{I(ψ 1 + ψ 11 S 1 > )(ψ 1 + ψ 11 S 1)}, and S 1 Normal(, 1). It is straightforward to deduce that E{I(ψ 1 + ψ 11 S 1 > )} = pr(s 1 > ψ 1 /ψ 11 ) or pr(s 1 < ψ 1 /ψ 11 ) as ψ 11 > or ψ 11 <, which is readily obtained from the standard normal cdf. Likewise, E{S 1 I(ψ 1 +ψ 11 S 1 > )} = E(S 1 S 1 > ψ 1 /ψ 11 )pr(s 1 > ψ 1 /ψ 11 ) if ψ 11 > and E{S 1 I(ψ 1 + ψ 11 S 1 > )} = E(S 1 S 1 < ψ 1 /ψ 11 )pr(s 1 < ψ 1 /ψ 11 ) if ψ 11 <, which are again easily calculated in a manner similar to that in Section.5. Thus, E{Y (d )} is obtained by substituting ψ 1, ψ 11 E{H( )} and hence R( ) for = in the resulting expression. To approximate or, we may use Monte Carlo simulation. Specifically, for the bth of B Monte Carlo data sets, substitute the estimates ψ 1,b, ψ 11,b, say, defining for that data set in the expression for E{Y (d)}, and call the resulting quantity U b. Then E{H( )} is approximated by B 1 B b=1 U b. Combining yields the approximation to R( ). Calculation for = 2. The developments are analogous to those above. We consider the generative data model in Section 6.2 and treatment regimes of the form d = (d 1, d 2 ), where d 1 (s 1 ) = I(ψ 1 + ψ 11 s 1 > ) and d 2 (s 1, s 2, a 1 ) = I(ψ 2 + ψ 21 a 1 + ψ 22 s 2 > ) for arbitrary ( ) { ( ψ 1, ψ 11, ψ 2, ψ 21, ψ 22. Here, E{Y (d)} = E E[E{Y (d) S2 (d), S 1} S 1 ] = E E E[Y S 2, S 1, 1 = ) } d 1 (S 1 ), 2 = d 2 {S 2, S 1, d 1 (S 1 )}] S 1, 1 = d 1 (S 1 ). Because S 1 is binary taking values in {, 1}, ( ) E{Y (d)} = E E[Y S 2, S 1, 1 = d 1 (), 2 = d 2 {S 2,, d 1 ()}] S 1 =, 1 = d 1 () pr(s 1 = ) + ( ) E E[Y S 2, S 1, 1 = d 1 (1), 2 = d 2 {S 2, 1, d 1 (1)}] S 1 = 1, 1 = d 1 (1) pr(s 1 = 1). Under the generative model, writing a 1 = I(ψ 1 + ψ 11 s 1 > ) for brevity, these expectations are of the ( ) form E E[Y S 2, S 1, 1 = d 1 (s 1 ), 2 = d 2 {S 2, s 1, d 1 (s 1 )}] S 1 = s 1, 1 = d 1 (s 1 ) = β 2 + β21 s 1 + β 22 a 1 + β 23 s 1a 1 + β 24 E{(S 2 S 1 = s 1, 1 = d 1 (s 1 )} + β 25 E{S2 2 S 1 = s 1, 1 = d 1 (s 1 )} + (ψ 2 + ψ 21 a 1)E{I(ψ 2 + ψ 21 a 1 + ψ 22 S 2 > ) S 1 = s 1, 1 = d 1 (s 1 )} + ψ 22 E{S 2I(ψ 2 + ψ 21 a 1 + ψ 22 S 2 > ) S 1 = s 1, 1 = d 1 (s 1 )}, for s 1 =, 1. In the generative data model, the conditional distribution of S 2 given S 1, 1 is normal; accordingly, it is straightforward to calculate E{S 2 S 1 = s 1, 1 = d 1 (s 1 )}, E{S 2 2 S 1 = s 1, 1 = d 1 (s 1 )}, E{I(ψ 2 + ψ 21 a 1 + ψ 22 S 2 > ) S 1 = s 1, 1 = d 1 (s 1 )}, and E{S 2 I(ψ 2 + ψ 21 a 1 + ψ 22 S 2 > ) S 1 = s 1, 1 = d 1 (s 1 )} in a manner analogous to those for the case = 1. pproximation of E{H( )} and hence R( ) for = carried out as for the case = 1. or may then be

9 SUPPLEMENT TO - ND -LERNING METHODS 9 Calculation by simulation. When an analytical expression for H(d) = E{Y (d)} for regimes of a certain form d is not available, H(d) for a fixed d may be approximated by simulation using the g-computation algorithm of Robins (1986). We demonstrate for = 2, so that d = (d 1, d 2 ); the procedure for = 1 is then immediate. For total number of simulations B, for each b = 1,..., B, the steps are: (i) Generate s 1b from the true distribution of S 1 ; (ii) generate s 2b from the true conditional distribution of S 2 given S 1 = s 1b and 1 = d 1 (s 1b ); (iii) evaluate the true E(Y S 2 = s 2, Ā2 = ā 2 ) at s 2 = s 2b = (s 1b, s 2b ) and ā 2 = [d 1 (s 1b ), d 2 { s 2b, d 1 (s 1b )}], and call the resulting value U b ; and (iv) estimate H(d) = E{Y (d)} by B 1 B b=1 U b. When d = or, one would follow the above procedure for each Monte Carlo data set. In each of steps (i) (iii), it is important to recognize that, while and are determined by the estimated ψ, the distributions from which realizations are generated depend on the true β and ψ. The values of E{H( and H( )} and E{H( )} may then be approximated by the average of the estimated H( ) across the Monte Carlo data sets, as before. ).7. CRETING EUIVLENTLY MISSPECIFIED PIRS WHEN BOTH THE PROPENSITY MODEL ND -FUNCTION RE MISSPECIFIED Consider the = 2 decision point scenario; the developments apply equally to the = 1 setting. To identify pairs (β25, φ 25 ) that are equivalently misspecified, for each of the combinations of β25 and φ 25 within a pre-specified grid, say (β 25, φ 25 ) [ 1, 1] [ 1, 1] with a step size of.5, we generate a large data set of size n = 1, from the generative data model in Section 6.2 with all other parameters fixed at their true values. This yields = 1681 combinations and hence such data sets. For each data set, the linear regression model for the response and the logistic model for propensity of treatment assignment are then fitted, and the ratio of standard errors for φ 25 and β 25, SE( φ 25 )/SE( β 25 ), say, obtained. We then fit to these values a polynomial model in φ 25, f(φ 25 ), say, and select the polynomial degree yielding a sufficiently large adjusted R 2. Setting β25 = φ 25 /f(φ 25 ) then yields the result that the corresponding t-statistics will be approximately equal. These were re-checked in the course of running the simulations so that the t-statistics differed by less than some reasonable value, usually at most a 5 percent difference, as it cannot be guaranteed that they will be precisely the same.

10 1 P.J. SCHULTE ET L..8. DERIVTION OF H1 (S 1; β1 ) ND C 1 (S 1; ψ1 ) IN THE TWO DECISION POINT SCENRIO We seek to identify the true h 1 (s 1) and C 1 (s 1), where S 1 and 1 are Bernoulli. With h 1 (s 1) = β 1 + β 11 s 1 and C 1 (s 1) = ψ 1 + ψ 11 s 1, it follows that the true -function at the first decision is 1 (s 1, a 1 ) = h 1 (s 1) + a 1 C 1 (s 1). We thus calculate 1 (s 1, a 1 ) under the generative model and equate terms to determine the form of β1, β 11, ψ 1, and ψ 11. The true value function at the second decision is V 2 (S 1, S 2, 1 ) = h 2 (S 1, S 2, 1 ) + C 2 (S 1, S 2, 1 )I{C 2 (S 1, S 2, 1 ) > }. Thus, 1 (s 1, a 1 ) = E{V 2 (S 1, S 2, 1 ) S 1 = s 1, 1 = a 1 } = β 2 + β 21 s 1 + β 22 a 1 + β 23 s 1a 1 + β 24 E{S 2 S 1 = s 1, 1 = a 1 } + β 25 E{S2 2 S 1 = s 1, 1 = a 1 } + E{C 2 (S 1, S 2, 1 )I{C 2 (S 1, S 2, 1 ) > ) S 1 = s 1, 1 = a 1 }. The conditional expectations in this expression may be calculated in a manner analogous to that in Section.5 to obtain the form of 1 (s 1, a 1 ). It follows that 1 (, ) = β 1, 1 (1, ) = β1 + β 11, 1 (, 1) = β 1 + ψ 1, and 1 (1, 1) = β 1 + β 11 + ψ 1 + ψ 11, which may be solved to yield expressions for β 1, β 11, ψ 1, and ψ DDITIONL SIMULTION RESULTS In this section, we present additional summaries of results of the simulations reported in Sections 6.1 and 6.2. The quantities R( ) and R( ) plotted in Figures 1 6 are the v-efficiencies of the estimated regimes produced by each method, as discussed at the beginning of Section 6 of the parent article. These are based on the averages E{H( H( )} and E{H( )}. Because the distributions of the ) and H( ) may be left-skewed, it is instructive to also present both the distributions themselves and alternatives to R( ) and R( ) based on the median. The left-hand panel of Figure.1 is the same as the right-most panel of Figure 1 in the main paper shown for convenience. For comparison, the right-hand panel shows both these distributions and the alternative measures of estimated regime efficiency based on medians rather than averages. Figures.2 and.3 show the same corresponding to Figures 2 and 3 in the main paper. Likewise, Figures.4.6 show the same information corresponding to the lower right-hand panels of Figures 4 6 in the main article. In nearly all cases, the plots using the median dominate those using the mean, reflecting the expected skewness.

11 SUPPLEMENT TO - ND -LERNING METHODS 11 1 % of Ideal Response learning learning φ 12 φ 12 Fig.1. One decision point, misspecified propensity model. Left panel is same as the right-most panel of Figure 1 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions. 1 % of Ideal Response learning learning β 12 β 12 Fig.2. One decision point, misspecified -function. Left panel is same as the right-most panel of Figure 2 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions.

12 12 P.J. SCHULTE ET L. 1 % of Ideal Response learning learning φ 12 (corr. to equivalent β 12 ) φ 12 (corr. to equivalent β 12 ) Fig.3. One decision point, both models misspecified. Left panel is same as the right-most panel of Figure 3 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions. 1 % of Ideal Response learning learning φ 25 φ 25 Fig.4. Two decision points, misspecified propensity model. Left panel is same as the lower right-most panel of Figure 4 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions.

13 SUPPLEMENT TO - ND -LERNING METHODS 13 1 % of Ideal Response learning learning β 25 β 25 Fig.5. Two decision points, misspecified -function. Left panel is same as the lower right-most panel of Figure 5 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions. 1 % of Ideal Response learning learning φ 25 (corr. to equivalent β 25 ) φ 25 (corr. to equivalent β 25 ) Fig.6. Two decision points, both models misspecified. Left panel is same as the right-most panel of Figure 6 of the main article. Right panel shows alternative measure of efficiency based on medians and their distributions.

14 14 P.J. SCHULTE ET L. For each level of misspecification in the misspecified propensity score scenario for one decision, Figure.7 plots the average across all simulated data sets of the standard deviations of the estimated propensity scores for each simulated data set. The figure shows that as the level of misspecification increases, the standard deviation decreases, as consistent with the claim in the main article. The bottom panels of Figure.7 show similar results for the two decision point misspecified propensity scenario. For the second decision point, shown in the bottom right panel of Figure.7, where the propensity model is misspecified, the behavior is similar (standard deviation decreases as level of misspecification increases). For comparison, the bottom left panel of Figure.7 shows the behavior of the estimated propensities at the first decision point, where the propensity model is correctly specified; as expected, there is no discernible pattern. We also carried out simulations under the case of misspecified contrast function in a scenario similar to that in Section 6.1. Here, we used the class of generative models defined in (34) with the exception that Y S 1 = s 1, 1 = a 1 Normal{β 1 + β 11s 1 + β 12s a 1 (ψ 1 + ψ 11s 1 + ψ 12s 2 1), 9}, so that ψ = (ψ 1, ψ 11, ψ 12 )T, and thus d = d 1, d 1 (s 1 ) = I(ψ 1 + ψ 11 s 1 + ψ 12 s2 1 > ). For -learning, we assumed models h 1 (s 1 ; β 1 ) = β 1 + β 11 s 1, C 1 (s 1 ; ψ 1 ) = ψ 1 + ψ 11 s 1, and π 1 (s 1 ; φ 1 ) = expit(φ 1 + φ 11 s 1 ), and for -learning used 1 (s 1, a 1 ; ξ 1 ) = h 1 (s 1 ; β 1 ) + a 1 C 1 (s 1 ; ψ 1 ). To simplify notation and distinguish between different components of the -function, we refer to h 1 (s 1 ; β 1 ) as the h-function. These models involve correctly specified contrast functions only when ψ12 =. The h-function is correctly specified when β 12 =, and the propensity model, π 1(s 1 ; φ 1 ), is correctly specified when φ 12 =. We studied the effects of misspecification of the contrast function by systematically varying β 12, φ 12, and ψ 12 while keeping the others fixed, considering parameter settings of the form φ = (, 2, φ 12 )T, β = (1, 1, β12 )T, and ψ = (1,.5, ψ12 )T. Three scenarios were considered: 1. misspecified contrast function alone (β12 = φ 12 =, and nonzero ψ 12 ) 2. misspecified contrast and h-function (φ 12 =, and nonzero β 12 and ψ 12 ) 3. misspecified contrast and propensity model (β12 =, and nonzero φ 12 and ψ 12 ).

15 SUPPLEMENT TO - ND -LERNING METHODS 15 SD of Estimated Propensities φ 12 SD of Estimated Propensities SD of Estimated Propensities φ φ 25 Fig.7. Misspecified propensity model. Each symbol represents the average across all simulated data sets of the standard deviation of the estimated propensity scores for each data set. Top: one decision point. Lower left: two decision points; first stage. Lower right: two decision points; second stage.

16 16 P.J. SCHULTE ET L. Under scenarios 2 and 3, we identified β12 = β 12 (ψ 12 ) and ψ 12 = ψ 12 (φ 12 ), respectively, using the approach outlined in Section.7, so that an analyst might be equally likely to detect either form of misspecification. Further, we observe that in the second and third scenarios, sgn(β 12 ) = sgn(ψ 12 ) and sgn(φ 12 ) = sgn(ψ 12 ), respectively, where sgn(x) = I(x > ). Figure.8 presents the v-efficiencies of estimators for d under - and -learning for each scenario. Here, the true d is an indicator for whether the quadratic function ψ 1 +ψ 11 s 1 +ψ 12 s2 1 is positive. For the given parameterization, when ψ 12 > the true d 1 (s 1 ) = 1 for all s 1. However, estimators for d under - and -learning assume a linear contrast function such that or (s 1) = for some s 1. This is illustrated in all three panels by R( (s 1) = ) < 1 and R( ) < 1. The top panel of Figure.8 indicates that, when only the contrast function is misspecified (scenario 1) and when ψ 12 for ψ 12 >,, a small gain in v-efficiency is achieved by yields slightly better performance as ψ 12 increases. over. lternatively, For the second scenario, we considered misspecification of the contrast and h-function. The results are shown in the bottom left panel of Figure.8. Similar to the prior scenario, we observe small gains in v-efficiency for over when ψ 12 (and β 12 = β 12 (ψ 12 ) ) and superior performance under -learning for some values of ψ12 >. Finally, the bottom right panel shows the results for the scenario of both misspecified contrast and misspecified propensity (scenario 3). Both - and -learning yield estimators for d that exhibit poor v-efficiency for much of the range where φ 12 < (and ψ 12 < ), while shows better v-efficiency relative to when φ 12 > (and ψ 12 > ). These results demonstrate that neither method need dominate the other in terms of performance as reflected by v-efficiency when the contrast function is misspecified..1. DESIGN OF STR*D Figure.9 presents a schematic of the STR*D study design. t levels 2 and 3, patients/physicians expressed preference for switch or augment, and were then randomized to the ions shown. Patients entering level 2 were randomized; those entering level 4 were not..11. EXPLORTORY NLYSIS ND DIGNOSTICS The development of the posited -functions used in the analysis of STR*D data in Section 7 of the main paper relied on input from the investigators. The information included in S 1 and the

17 SUPPLEMENT TO - ND -LERNING METHODS 17 % of Ideal Response learning ψ 12 % of Ideal Response learning % of Ideal Response learning ψ 12 (corr. to equivalent β 12 ) φ 12 (corr. to equivalent ψ 12 ) Fig.8. V-efficiencies R( ) and R( ) for estimating the true d (right panel) under misspecification of the contrast model, one decision point. Top: misspecified contrast function only. Lower left: misspecified contrast function and misspecified h-function. Lower right: misspecified contrast function and misspecified propensity model.

18 18 P.J. SCHULTE ET L. Level 1 Initial treatment: CIT Level 2 Switch treatments: BUP, CT, SER, or VEN ugment treatments: CIT + (BUP, BUS, or CT) Level 2a If received CT in Level 2: BUP or VEN Follow-up Level 3 Switch treatments: MIRT or NTP ugment treatments: previous treatment + (Li or THY) Level 4 Switch to TCP or MIRT + VEN Fig.9. Schematic depiction of the STR*D study. dependence of the models on IDS slope was guided by discussions with John Rush, the principal investigator. We present diagnostic plots for the first and second stage regressions used in the STR*D analysis. Figure.1 displays standard regression diagnostics for the second stage regression models used in - and -learning. The figure suggests that there are no major deviations from the assumed linear model and that there is no evidence of outliers or high-influence points. Figure.11 displays regression diagnostics for the first stage regression models used in - and -learning. The figure suggests that a linear model may be a satisfactory approximation, though it appears that the linear model overestimates the first stage -function for non-responders and underestimates the first stage -function for responders. Recall that, by design, responders have higher responses at the end of the first stage. Thus, responders tend to have higher fitted values and positive residuals. There do not appear to be any outliers or high-influence points. CNOWLEDGMENTS This work was supported by NIH grants R37 I31789, R1 C51962, R1 C85848, P1 C142538, and T32 HL79896.

19 SUPPLEMENT TO - ND -LERNING METHODS Fitted Values Residuals Fitted Values Sqrt(Studentized Residuals) Theoretical uantiles Sample uantiles Leverage Studentized Residuals Fig.1. Regression diagnostics for the second stage -function used in - and -learning. Upper left: displays residuals vs. fitted values; the figure does not suggest deviation from an underlying linear model with homoscedastic variance, and there do not appear to be any outliers. Upper right: displays the studentized residuals vs. the fitted values; the plot does not suggest deviation form an underlying linear model with homoscedastic variance, and there do not appear to be any outliers. Lower left: displays a -plot of the residuals; note that the sample quantiles are a step-function because we have treated the discrete variable IDS as continuous. Lower right: displays the studentized residuals vs. leverage; there do not appear to be any high-influence points.

20 2 P.J. SCHULTE ET L Fitted Values Residuals Fitted Values Sqrt(Studentized Residuals) Theoretical uantiles Sample uantiles Leverage Studentized Residuals Fig.11. Regression diagnostics for the first stage -function used in - and -learning. Responders are depicted with a and non-responders with a. Upper left: displays residuals vs. fitted values; there is a clear distinction between the responders and non-responders. Responders, by design, have higher first stage responses and thus tend to have higher fitted values and positive residuals. Upper right: displays the studentized residuals vs. the fitted values; again, the separation between responders and non-responders is clear. Lower left: displays a -plot of the residuals; note that the sample quantiles are a step-function because we have treated the discrete variable IDS as continuous. Lower right: displays the studentized residuals vs. leverage; there do not appear to be any high-influence points.

21 SUPPLEMENT TO - ND -LERNING METHODS 21 REFERENCES Chakraborty, B., Murphy, S.. and Strecher, V. (21). Inference for non-regular parameters in imal dynamic treatment regimes. Statistical Methods in Medical Research Robins, J. M. (1986). new approach to causal inference in mortality studies with sustained exposure periods: pplications to control of the healthy worker survivor effect. Math. Model

Web-based Supplementary Materials for A Robust Method for Estimating. Optimal Treatment Regimes

Biometrics 000, 000 000 DOI: 000 000 0000 Web-based Supplementary Materials for A Robust Method for Estimating Optimal Treatment Regimes Baqun Zhang, Anastasios A. Tsiatis, Eric B. Laber, and Marie Davidian