Penalized D-optimal design for dose finding

Penalized for dose finding Luc Pronzato Laboratoire I3S, CNRS-Univ. Nice Sophia Antipolis, France

2/ 35 Outline

3/ 35 Bernoulli-type experiments: Y i {0,1} (success or failure) η(x,θ) = Prob(Y i = 1 x i = x,θ) ˆθ n the maximum-likelihood estimator: ˆθ n = argmax θ Θ n {Y i log[η(x i,θ)]+(1 Y i )log[1 η(x i,θ)]} i=1 Fisher information matrix: M(ξ,θ) = with X f θ (x)f θ (x)ξ(dx) f θ (x) = {η(x,θ)[1 η(x,θ)]} 1/2 η(x,θ) θ

4/ 35 Maximize log det M(ξ,θ) under the contraint Φ(ξ,θ) C with Φ(ξ,θ) = X φ(x,θ)ξ(dx) (φ(x,θ) = cost of one observation at x) NSC for the optimality of ξ : Φ(ξ,θ) C and there exists λ = λ (C,θ) 0 such that { λ [C Φ(ξ,θ)] = 0 x X, f θ (x)m 1 (ξ,θ)f θ (x) p +λ [φ(x,θ) Φ(ξ,θ)] In practice: maximize H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) for an increasing sequence {λ i } of Lagrange multipliers, starting from λ 0 = 0 and stopping at the first λ i for which ξ (λ i ) satisfies Φ[ξ (λ i ),θ] C [Mikulecká 1983] not more complicated than the determination of (a sequence of) (s)!

5/ 35 Dose-response problem in clinical trials: Define φ(x,θ) from the success probability (efficacy, no toxicity) [Dragalin & Fedorov 2006; Dragalin, Fedorov & Wu 2008] λ in H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) sets a compromise between the information gain (a pb. of collective ethics) and the rejection of non efficient or toxic doses (a pb. of individual ethics for the enroled patients)

6/ 35 ξ (λ) maximizes H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ), with Φ(ξ,θ) = X φ(x,θ)ξ(dx) 3 remarks: 1. If Φ(ξ,θ) = 1 p logψ(ξ,θ) ] maximize log det [ M(ξ,θ) Ψ λ (ξ,θ) maximize log det[nm(ξ,θ)] with the total cost constraint NΨ λ (ξ,θ) C, any C > 0 [Dragalin & Fedorov 2006] 2. Φ[ξ (λ),θ] min x φ(x,θ)+ dim(θ) λ

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

8/ 35 Example 1 [Dragalin & Fedorov 2006] Cox model for bivariate responses: Y efficacy, Z toxicity 11 doses available, evenly spread in [ 3,3], X = {x (,...,x (1 }, x (i) < x (i+, i = 1,...,10 Prob{Y = y,z = z x,θ} = π yz (x,θ), Y,y,Z,z {0,1} θ = (a 11,b 11,a 10,b 10,a 01,b 01 ) = (3,3,4,2,0,2) R 6 π 11 (x,θ) = π 10 (x,θ) = π 01 (x,θ) = π 00 (x,θ) = e a 11+b 11 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 10+b 10 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 01+b 01 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x (1+e ) a 01+b 01 x + e a 10+b 10 x + e a 1 11+b 11 x

9/ 35 φ 1 (x,θ) = π 1 10 (x,θ) (π 10 = efficacy and no toxicity) Optimal dose x : minimizes φ 1 (x,θ) x = x (5) = 0.6 9 8 7 6 5 4 3 2 1 3 2 1 0 1 2 3

10/ 35 φ 1 (x,θ) = π 1 10 (x,θ) (π 10 = efficacy and no toxicity) Optimal dose x : minimizes φ 1 (x,θ) x = x (5) = 0.6 ξ * {x (i) } 3 2 1 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ

11/ 35 Other cost function: φ 2 (x,θ) = {π10 1 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 60 50 40 30 20 10 0 3 2 1 0 1 2 3

12/ 35 φ 2 (x,θ) = {π 1 10 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 3 2 1 ξ * {x (i) } 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ [1,100] λ

12/ 35 φ 2 (x,θ) = {π 1 10 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 3 2 1 ξ * {x (i) } 0 1 2 3 0 100 200 300 400 500 600 700 800 900 1000 λ [1,1000] λ

13/ 35 yet another cost function: (more importance given to toxicity) φ 3 (x,θ) = π 1 10 (x,θ)[1 π 1(x,θ)] 1 with π 1 (x,θ) = π 01 (x,θ)+π 11 (x,θ) = marginal probability of toxicity (φ 3 (x,θ) is minimum at x (4) = 1.2) 80 70 60 50 40 30 20 10 0 3 2 1 0 1 2 3

14/ 35 φ 3 (x,θ) = π 1 10 (x,θ)[1 π 1(x,θ)] 1 with π 1 (x,θ) = π 01 (x,θ)+π 11 (x,θ) = marginal probability of toxicity (φ 3 (x,θ) is minimum at x (4) = 1.2) ξ * {x (i) } 3 2 1 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ

15/ 35 Motivation: why sequential? Regression model η(x,θ), nonlinear in θ the optimal for estimating θ depends on θ! Information matrix M(ξ,θ) = X f θ(x)fθ (x)ξ(dx) with ξ a probability measure on X f θ (x) = 1 η(x,θ) σ θ (cont. diff. w.r.t. θ for any x X) ξd (θ) is for θ: ξ D (θ) maximizes log det[m(ξ,θ)]

16/ 35 Examples [Box & Lucas, 1959]: η(x,θ) = θ 1 θ 1 θ 2 [exp( θ 2 x) exp( θ 1 x)] for θ = (0.7, 0.2), ξd = 1 2 δ x ( + 1 2 δ x (2) with x ( 1.25 and x (2) 6.60 Michaelis-Menten: η(x,θ) = θ 1 x θ 2 +x, x (0,x], θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = θ 2x 2θ 2 +x and x(2) = x Exponential decrease: η(x,θ) = θ 1 exp( θ 2 x), x x, θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = x and x (2) = x + 1 θ 2...

full sequential : choose x 1,...,x n0, estimate ˆθ n 0, set k = n 0 then x k+1 observe Y k+1 re-estimate ˆθ k+1 k k + 1... ity: [ k ] x k+1 = arg max x X det i=1 fˆθ (x k i )f ˆθ k(x i)+fˆθ (x)f ˆθ k k(x) or equivalently x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) with ξ k = 1 k k i=1 δ x i the empirical measure defined by x 1,...,x k M(ξ k,θ) = 1 k k i=1 f θ(x i )f θ (x i) we hope that ˆθ n a.s. θ and n(ˆθ n θ) d N(0,M 1 [ξ D ( θ), θ]) 17/ 35

18/ 35 LS estimation in nonlinear regression Y i = η(x i, θ)+ε i, x i X R d, θ Θ R p, {ε i } i.i.d., variance σ 2 LS estimation: ˆθ n = argmin θ Θ S n (θ) with S n (θ) = n i=1 [Y i η(x i,θ)] 2 Conditions for strong consistency of ˆθ n (ˆθ n a.s. θ) very much differ depending whether the x k are constants or depend on ε i, i < k

19/ 35 Sequential For n 0 k < n, x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) How to ensure that ˆθ n a.s. θ and n(ˆθn θ) d N(0,M 1 [ξd ( θ), θ]) when n? ❶ deterministic choice of x k when k {k 1,k 2,...}, with k i i α, α (1,2) [Lai 1994] ❷ let n 0 tend to when n ❸ suppose that X is a finite set replication of observations

20/ 35 X is a finite set X = {x (,...,x (K) } Convergence of ˆθ n to θ: everything is fine if D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 grows to fast enough for all θ θ Theorem 1: convergence [LP, S&P Letters, 2009] If D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 satisfies [ ] for all δ > 0, inf D n (θ, θ) /(log log n) a.s. θ θ δ/τ n with X finite and {τ n } a non-decreasing sequence of positive constants, then ˆθ n satisfies τ n ˆθ n a.s. θ 0

21/ 35 Theorem 2: asymptotic normality [LP, S&P Letters, 2009] If there exists a sequence of matrices C n symmetric pos. def. such that C 1 n M 1/2 (ξ n, θ) p I with c n = λ min (C n ) and D n (θ, θ) satisfying n 1/4 c n and [ ] for all δ > 0, inf θ θ c nδ 2 D n (θ, θ) /(log log n) a.s. then ˆθ n satisfies n M 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Apply that to x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθk(x) with X finite for any sequence {ˆθ k }, the sampling rate of a non-singular is > 0

22/ 35 x ( x (2)... x (j)... x (l)... x (K) }{{} lim inf n n i /n > α > 0 ˆθ n a.s. θ M(ξ n,ˆθ n ) a.s. M[ξ D ( θ), θ] [nm(ξ n,ˆθ n )] 1/2 (ˆθ n θ) d N(0,I)

23/ 35 } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Again, sequential construction independency is lost Use the assumption that X is finite: X = {x (,...,x (K) } If λ k = constant λ: ˆθ n a.s. θ and M(ξ n, θ) M ( θ) (optimal for criterion log det M(ξ, θ) λ Φ(ξ, θ)) and nm 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Also true if λ k = bounded measurable function of x 1,Y 1,...,x k,y k (e.g., λ k = λ (ˆθ k ) = optimal Lagrange coefficient for minimization of log detm(ξ,ˆθ k ) under the constraint Φ(ξ,ˆθ k ) C)

24/ 35 If λ k ր, (λ k log log k)/k 0, then ˆθ n a.s. θ moreover, convergence to minimum-cost : Φ(ξ n, θ) = 1 n n i=1 φ(x i, θ) a.s. φ θ = min x X φ(x, θ)...and ξ N (x (i ) ) a.s. 1 if φ(x, θ) has a unique minimum at x (i ) X = {x (,...,x (K) } We can thus optimize n i=1 φ(x i, θ) without knowing θ: self-tuning optimization } x k+1 = arg max x X {f ˆθ km 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Already suggested for linear regression (η(x,θ) linear in θ) [Åström & Wittenmark, 1989], condition on λ k in [LP, AS 2000] Here, LS in nonlinear regression, or ML, but with X finite Beware: x k+1 = arg min x X φ(x,ˆθ k ) may not work!

Example 2 (self-tuning regulation) Observe Y i = θ 1 x θ 2 +x +ε i, {ε i } i.i.d. N(0,0. Find x such that Ψ(x, θ) = T Ψ(x,θ) = θ 1 [1 exp( θ 2 x/3)] η(x, θ) ( θ = (1, Ψ(x, θ) = 1/2 for x = 3log(2) 2.08) minimise φ(x,θ) = [Ψ(x,θ) T] 2 note that we do not observe Ψ(x, θ)! x 1 = 1, x 2 = 10, then } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k λ )fˆθk k φ(x,ˆθ k ) for three sequences {λ k } (a) λk = log 2 k (b) λk = k/(1+log 2 k) (c) λk = k 1.1 25/ 35

26/ 35 Ψ(x k, θ), k = 1,...,500 (a) λ k = log 2 k (b) λ k = k/(1+log 2 k) (c) λ k = k 1.1 1 (a) 1 (b) 1 (c) 0.9 0.9 0.9 0.8 0.8 0.8 ψ(x n,θ) 0.7 0.6 0.5 0.7 0.6 0.5 0.7 0.6 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0 200 400 600 n 0.1 0 200 400 600 n 0.1 0 200 400 600 n

27/ 35 Replace φ(x,θ) = [ψ(x,θ) T] 2 by φ(x,θ) = [ψ(x,θ) T] 4, take λ k = 10 3 log 2 k the support points of ξ k tend to concentrate around x 1 10 ψ(x n,θ) 0.9 0.8 0.7 0.6 0.5 x n 9 8 7 6 5 4 0.4 3 0.3 2 0.2 0 500 1000 1500 2000 n 1 0 500 1000 1500 2000 n

28/ 35 Example 1 [Dragalin & Fedorov 2006] (continued) Clinical trials: 11 doses, Y for efficacy, Z for toxicity, 36 patients up and down method [Ivanova 2003] max{x (in,x ( } ց si Z N = 1, x N+1 = x (i N) si Y N = 1 and Z N = 0, min{x (in+,x (1 } ր si Y N = 0 and Z N = 0, for the first 10 patients, then sequential switching at first observed toxicity (with maximum increase of one dose at each step)[dragalin & Fedorov 2006])

29/ 35 for (Y = 0,Z = 0), for (Y = 1,Z = 0), for (Y = 1,Z = and for (Y = 0,Z = Penalty function φ 1 (x,θ) = π10 1 (x,θ) x N 3 2 1 0 1 2 3 0 5 10 15 20 25 30 35 N

30/ 35 Penalty function φ 3 (x,θ) = π 1 (x,θ)[1 π 1(x,θ)] 1 x N 3 2 1 0 1 10 2 3 0 5 10 15 20 25 30 35 N

31/ 35 (36 patients, 1000 repetitions) Φ 1 (ξ,θ) J(ξ,θ) x {t<4} x {t=4} x {t=5} x {t=6} x {t>6} #x (1 S1 1.87 28.02 2% 38.6% 36.9% 8.6% 13.9% 0 ξ u&d ( θ) 1.47 29.4 S2 3.16 17.23 0 19.8% 70.5% 7.8% 1.9% 5% ξd ( θ) 4.45 14.99 S3 2.38 18.78 0 22.3% 68.2% 7% 2.5% 2.3% ξλ=2 ( θ) 1.97 17.00 φ 1 (ξ,θ) = π 1 10 (x,θ), J(ξ,θ) = det 1/6 [M(ξ,θ)], OSD = x (5) S1 = up & down S2 = up & down + sequential S3 = up & down + sequential

32/ 35 (240 patients, 150 repetitions) Φ 1 (ξ,θ) J(ξ,θ) x {t<4} x {t=4} x {t=5} x {t=6} x {t>6} #x (1 S1 1.54 29.04 0 14% 77.3% 7.3% 1.3% 0 ξ u&d ( θ) 1.47 29.4 S4 1.52 27.87 0 8.7% 90% 0.7% 0.7% 0.1% S4 = up & down + sequential with logarithmic increase λ N ր switching when σ(optimal estimated dose) < x x = interval between 2 consecutive doses cautious increase of doses when σ(optimal estimated dose) > x /2 S4 better than S1 (= up & down)

33/ 35 Convergence and asymptotic normality of estimators under sequential x k+1 = arg max x X f ˆθk (x)m 1 (ξ k,ˆθ k )fˆθ k (x) and sequential } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) when λ k bounded...assuming that X is finite (only a technical assumption?) λ n not too fast self-tuning regulation/optimization

34/ 35 Asymptotic normality when λ n slowly enough? λ min [M(ξ n,ˆθ n )] A/λ n? Find a sequence of matrices C n symmetric pos. def. such that C 1 n M1/2 (ξ n, θ) p I: construct Cn from x k+1 = argmax x X { f θ M 1 (ξ k, θ)f θ λ k φ(x, θ) }? use Cn = M 1/2 (ξ ( θ,λ n ), θ) with ξ (θ,λ) maximizing log det M(ξ,θ) λφ(ξ,θ)?

35/ 35 Thank you for your attention!