Penalized D-optimal design for dose finding

Similar documents
ON THE REGULARIZATION OF SINGULAR C-OPTIMAL

PENALIZED AND RESPONSE-ADAPTIVE OPTIMAL DESIGNS

One-step ahead adaptive D-optimal design on a finite design. space is asymptotically optimal

ONE-STEP AHEAD ADAPTIVE D-OPTIMAL DESIGN ON A

On the efficiency of two-stage adaptive designs

Information in a Two-Stage Adaptive Optimal Design

Graduate Econometrics I: Maximum Likelihood II

Graduate Econometrics I: Maximum Likelihood I

ADAPTIVE EXPERIMENTAL DESIGNS. Maciej Patan and Barbara Bogacka. University of Zielona Góra, Poland and Queen Mary, University of London

Theory of Maximum Likelihood Estimation. Konstantin Kashin

The information complexity of best-arm identification

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

OPTIMAL DESIGN INPUTS FOR EXPERIMENTAL CHAPTER 17. Organization of chapter in ISSO. Background. Linear models

Multiple Testing in Group Sequential Clinical Trials

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

Paper Review: Variable Selection via Nonconcave Penalized Likelihood and its Oracle Properties by Jianqing Fan and Runze Li (2001)

Mathematical statistics

Model Specification Testing in Nonparametric and Semiparametric Time Series Econometrics. Jiti Gao

Statistics 3858 : Maximum Likelihood Estimators

On construction of constrained optimum designs

Math 533 Extra Hour Material

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

LECTURE 18: NONLINEAR MODELS

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Max. Likelihood Estimation. Outline. Econometrics II. Ricardo Mora. Notes. Notes

Invariant HPD credible sets and MAP estimators

Hypothesis Test. The opposite of the null hypothesis, called an alternative hypothesis, becomes

Estimation theory. Parametric estimation. Properties of estimators. Minimum variance estimator. Cramer-Rao bound. Maximum likelihood estimators

Group Sequential Designs: Theory, Computation and Optimisation

Mathematics Ph.D. Qualifying Examination Stat Probability, January 2018

Computer Intensive Methods in Mathematical Statistics

Testing Algebraic Hypotheses

The properties of L p -GMM estimators

Statistics and econometrics

Master s Written Examination

Inference in non-linear time series

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Covariance function estimation in Gaussian process regression

Theoretical Statistics. Lecture 23.

Spring 2017 Econ 574 Roger Koenker. Lecture 14 GEE-GMM

Introduction to Maximum Likelihood Estimation

Optimal experimental design, an introduction, Jesús López Fidalgo

Chapter 1: A Brief Review of Maximum Likelihood, GMM, and Numerical Tools. Joan Llull. Microeconometrics IDEA PhD Program

M- and Z- theorems; GMM and Empirical Likelihood Wellner; 5/13/98, 1/26/07, 5/08/09, 6/14/2010

An Extended BIC for Model Selection

Bandits : optimality in exponential families

Semiparametric posterior limits

arxiv:math/ v2 [math.st] 15 May 2006

MULTIPLE-OBJECTIVE DESIGNS IN A DOSE-RESPONSE EXPERIMENT

Locally optimal designs for errors-in-variables models

Minimax Estimation of Kernel Mean Embeddings

1. Let A be a 2 2 nonzero real matrix. Which of the following is true?

CS-E4830 Kernel Methods in Machine Learning

Minimum Message Length Analysis of the Behrens Fisher Problem

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Estimation and Model Selection in Mixed Effects Models Part I. Adeline Samson 1

Classical Estimation Topics

An effective approach for obtaining optimal sampling windows for population pharmacokinetic experiments

AGEC 661 Note Eleven Ximing Wu. Exponential regression model: m (x, θ) = exp (xθ) for y 0

A Very Brief Summary of Statistical Inference, and Examples

ECE531 Lecture 10b: Maximum Likelihood Estimation

Statistica Sinica Preprint No: SS

Introduction to Estimation Methods for Time Series models Lecture 2

STA414/2104 Statistical Methods for Machine Learning II

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Optimizing Combination Therapy under a Bivariate Weibull Distribution, with Application to Toxicity and Efficacy Responses

Two-stage Adaptive Randomization for Delayed Response in Clinical Trials

Discussion Paper SFB 823. Nearly universally optimal designs for models with correlated observations

Estimation for two-phase designs: semiparametric models and Z theorems

Bandit models: a tutorial

Bayesian Interpretations of Regularization

COMP2610/COMP Information Theory

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

A Note on the Expectation-Maximization (EM) Algorithm

Topic 12 Overview of Estimation

Support Vector Machine (continued)

Minimum Hellinger Distance Estimation in a. Semiparametric Mixture Model

Machine Learning And Applications: Supervised Learning-SVM

Week 5 Quantitative Analysis of Financial Markets Modeling and Forecasting Trend

Chapter 3: Maximum Likelihood Theory

Tutorial on Approximate Bayesian Computation

Z-estimators (generalized method of moments)

Chapter 3. Point Estimation. 3.1 Introduction

Masters Comprehensive Examination Department of Statistics, University of Florida

On the Complexity of Best Arm Identification with Fixed Confidence

Active Learning: Disagreement Coefficient

The information complexity of sequential resource allocation

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Construction and estimation of high dimensional copulas

What s New in Econometrics. Lecture 15

Kernel methods, kernel SVM and ridge regression

System Identification, Lecture 4

On the Adversarial Robustness of Robust Estimators

Maximum Likelihood Estimation

Group Sequential Tests for Delayed Responses. Christopher Jennison. Lisa Hampson. Workshop on Special Topics on Sequential Methodology

System Identification, Lecture 4

Bickel Rosenblatt test

Maximum Likelihood Tests and Quasi-Maximum-Likelihood

Support Vector Machines

ICS-E4030 Kernel Methods in Machine Learning

Transcription:

Penalized for dose finding Luc Pronzato Laboratoire I3S, CNRS-Univ. Nice Sophia Antipolis, France

2/ 35 Outline

3/ 35 Bernoulli-type experiments: Y i {0,1} (success or failure) η(x,θ) = Prob(Y i = 1 x i = x,θ) ˆθ n the maximum-likelihood estimator: ˆθ n = argmax θ Θ n {Y i log[η(x i,θ)]+(1 Y i )log[1 η(x i,θ)]} i=1 Fisher information matrix: M(ξ,θ) = with X f θ (x)f θ (x)ξ(dx) f θ (x) = {η(x,θ)[1 η(x,θ)]} 1/2 η(x,θ) θ

3/ 35 Bernoulli-type experiments: Y i {0,1} (success or failure) η(x,θ) = Prob(Y i = 1 x i = x,θ) ˆθ n the maximum-likelihood estimator: ˆθ n = argmax θ Θ n {Y i log[η(x i,θ)]+(1 Y i )log[1 η(x i,θ)]} i=1 Fisher information matrix: M(ξ,θ) = with X f θ (x)f θ (x)ξ(dx) f θ (x) = {η(x,θ)[1 η(x,θ)]} 1/2 η(x,θ) θ

4/ 35 Maximize log det M(ξ,θ) under the contraint Φ(ξ,θ) C with Φ(ξ,θ) = X φ(x,θ)ξ(dx) (φ(x,θ) = cost of one observation at x) NSC for the optimality of ξ : Φ(ξ,θ) C and there exists λ = λ (C,θ) 0 such that { λ [C Φ(ξ,θ)] = 0 x X, f θ (x)m 1 (ξ,θ)f θ (x) p +λ [φ(x,θ) Φ(ξ,θ)] In practice: maximize H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) for an increasing sequence {λ i } of Lagrange multipliers, starting from λ 0 = 0 and stopping at the first λ i for which ξ (λ i ) satisfies Φ[ξ (λ i ),θ] C [Mikulecká 1983] not more complicated than the determination of (a sequence of) (s)!

4/ 35 Maximize log det M(ξ,θ) under the contraint Φ(ξ,θ) C with Φ(ξ,θ) = X φ(x,θ)ξ(dx) (φ(x,θ) = cost of one observation at x) NSC for the optimality of ξ : Φ(ξ,θ) C and there exists λ = λ (C,θ) 0 such that { λ [C Φ(ξ,θ)] = 0 x X, f θ (x)m 1 (ξ,θ)f θ (x) p +λ [φ(x,θ) Φ(ξ,θ)] In practice: maximize H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) for an increasing sequence {λ i } of Lagrange multipliers, starting from λ 0 = 0 and stopping at the first λ i for which ξ (λ i ) satisfies Φ[ξ (λ i ),θ] C [Mikulecká 1983] not more complicated than the determination of (a sequence of) (s)!

4/ 35 Maximize log det M(ξ,θ) under the contraint Φ(ξ,θ) C with Φ(ξ,θ) = X φ(x,θ)ξ(dx) (φ(x,θ) = cost of one observation at x) NSC for the optimality of ξ : Φ(ξ,θ) C and there exists λ = λ (C,θ) 0 such that { λ [C Φ(ξ,θ)] = 0 x X, f θ (x)m 1 (ξ,θ)f θ (x) p +λ [φ(x,θ) Φ(ξ,θ)] In practice: maximize H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) for an increasing sequence {λ i } of Lagrange multipliers, starting from λ 0 = 0 and stopping at the first λ i for which ξ (λ i ) satisfies Φ[ξ (λ i ),θ] C [Mikulecká 1983] not more complicated than the determination of (a sequence of) (s)!

5/ 35 Dose-response problem in clinical trials: Define φ(x,θ) from the success probability (efficacy, no toxicity) [Dragalin & Fedorov 2006; Dragalin, Fedorov & Wu 2008] λ in H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ) sets a compromise between the information gain (a pb. of collective ethics) and the rejection of non efficient or toxic doses (a pb. of individual ethics for the enroled patients)

6/ 35 ξ (λ) maximizes H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ), with Φ(ξ,θ) = X φ(x,θ)ξ(dx) 3 remarks: 1. If Φ(ξ,θ) = 1 p logψ(ξ,θ) ] maximize log det [ M(ξ,θ) Ψ λ (ξ,θ) maximize log det[nm(ξ,θ)] with the total cost constraint NΨ λ (ξ,θ) C, any C > 0 [Dragalin & Fedorov 2006] 2. Φ[ξ (λ),θ] min x φ(x,θ)+ dim(θ) λ

6/ 35 ξ (λ) maximizes H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ), with Φ(ξ,θ) = X φ(x,θ)ξ(dx) 3 remarks: 1. If Φ(ξ,θ) = 1 p logψ(ξ,θ) ] maximize log det [ M(ξ,θ) Ψ λ (ξ,θ) maximize log det[nm(ξ,θ)] with the total cost constraint NΨ λ (ξ,θ) C, any C > 0 [Dragalin & Fedorov 2006] 2. Φ[ξ (λ),θ] min x φ(x,θ)+ dim(θ) λ

6/ 35 ξ (λ) maximizes H θ (ξ,λ) = log detm(ξ,θ) λφ(ξ,θ), with Φ(ξ,θ) = X φ(x,θ)ξ(dx) 3 remarks: 1. If Φ(ξ,θ) = 1 p logψ(ξ,θ) ] maximize log det [ M(ξ,θ) Ψ λ (ξ,θ) maximize log det[nm(ξ,θ)] with the total cost constraint NΨ λ (ξ,θ) C, any C > 0 [Dragalin & Fedorov 2006] 2. Φ[ξ (λ),θ] min x φ(x,θ)+ dim(θ) λ

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

7/ 35 3. If φ(x,θ) has a unique minimum in X at x and is flat enough around x, the support points of ξ (λ) minimizing log detm(ξ,θ) λφ(ξ,θ), tend to gather around x when λ [LP, JSPI, 2009] Suppose X finite with φ(x (,θ)... φ(x (m),θ) }{{}... φ(x(k),θ) Define X m = {x (,...,x (m) }, δ = φ(x (m+,θ) φ(x (m),θ) and suppose that f θ (x ( ),...,f θ (x (m) ) span R p Define φ(x,θ) = [φ(x,θ) φ(x (,θ)] and consider Φ q (ξ,θ) = X φ q (x,θ)ξ(dx) Then ξ (λ) maximizing log detm(ξ,θ) λφ q (ξ,θ) is supported on X m when λ and q large enough [q > log(2ˆt m )/ log{1+δ/[φ(x (m),θ) φ(x (,θ)]} λ > λ m,q = p/φ q (ˆξ m,θ) with ˆt m = f θ (x)m 1 (ˆξ m,θ)f θ (x) and ˆξ m = argmin ξ supp. on X m max x X\Xm f θ (x)m 1 (ξ,θ)f θ (x)]

8/ 35 Example 1 [Dragalin & Fedorov 2006] Cox model for bivariate responses: Y efficacy, Z toxicity 11 doses available, evenly spread in [ 3,3], X = {x (,...,x (1 }, x (i) < x (i+, i = 1,...,10 Prob{Y = y,z = z x,θ} = π yz (x,θ), Y,y,Z,z {0,1} θ = (a 11,b 11,a 10,b 10,a 01,b 01 ) = (3,3,4,2,0,2) R 6 π 11 (x,θ) = π 10 (x,θ) = π 01 (x,θ) = π 00 (x,θ) = e a 11+b 11 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 10+b 10 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 01+b 01 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x (1+e ) a 01+b 01 x + e a 10+b 10 x + e a 1 11+b 11 x

8/ 35 Example 1 [Dragalin & Fedorov 2006] Cox model for bivariate responses: Y efficacy, Z toxicity 11 doses available, evenly spread in [ 3,3], X = {x (,...,x (1 }, x (i) < x (i+, i = 1,...,10 Prob{Y = y,z = z x,θ} = π yz (x,θ), Y,y,Z,z {0,1} θ = (a 11,b 11,a 10,b 10,a 01,b 01 ) = (3,3,4,2,0,2) R 6 π 11 (x,θ) = π 10 (x,θ) = π 01 (x,θ) = π 00 (x,θ) = e a 11+b 11 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 10+b 10 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x e a 01+b 01 x 1+e a 01+b 01 x + e a 10+b 10 x + e a 11+b 11 x (1+e ) a 01+b 01 x + e a 10+b 10 x + e a 1 11+b 11 x

9/ 35 φ 1 (x,θ) = π 1 10 (x,θ) (π 10 = efficacy and no toxicity) Optimal dose x : minimizes φ 1 (x,θ) x = x (5) = 0.6 9 8 7 6 5 4 3 2 1 3 2 1 0 1 2 3

10/ 35 φ 1 (x,θ) = π 1 10 (x,θ) (π 10 = efficacy and no toxicity) Optimal dose x : minimizes φ 1 (x,θ) x = x (5) = 0.6 ξ * {x (i) } 3 2 1 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ

11/ 35 Other cost function: φ 2 (x,θ) = {π10 1 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 60 50 40 30 20 10 0 3 2 1 0 1 2 3

11/ 35 Other cost function: φ 2 (x,θ) = {π10 1 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 60 50 40 30 20 10 0 3 2 1 0 1 2 3

12/ 35 φ 2 (x,θ) = {π 1 10 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 3 2 1 ξ * {x (i) } 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ [1,100] λ

12/ 35 φ 2 (x,θ) = {π 1 10 (x,θ) [max x π 10 (x,θ)] 1 } 2 (more flat than φ 1 (x,θ) around x ) 3 2 1 ξ * {x (i) } 0 1 2 3 0 100 200 300 400 500 600 700 800 900 1000 λ [1,1000] λ

13/ 35 yet another cost function: (more importance given to toxicity) φ 3 (x,θ) = π 1 10 (x,θ)[1 π 1(x,θ)] 1 with π 1 (x,θ) = π 01 (x,θ)+π 11 (x,θ) = marginal probability of toxicity (φ 3 (x,θ) is minimum at x (4) = 1.2) 80 70 60 50 40 30 20 10 0 3 2 1 0 1 2 3

13/ 35 yet another cost function: (more importance given to toxicity) φ 3 (x,θ) = π 1 10 (x,θ)[1 π 1(x,θ)] 1 with π 1 (x,θ) = π 01 (x,θ)+π 11 (x,θ) = marginal probability of toxicity (φ 3 (x,θ) is minimum at x (4) = 1.2) 80 70 60 50 40 30 20 10 0 3 2 1 0 1 2 3

14/ 35 φ 3 (x,θ) = π 1 10 (x,θ)[1 π 1(x,θ)] 1 with π 1 (x,θ) = π 01 (x,θ)+π 11 (x,θ) = marginal probability of toxicity (φ 3 (x,θ) is minimum at x (4) = 1.2) ξ * {x (i) } 3 2 1 0 1 2 3 0 10 20 30 40 50 60 70 80 90 100 λ

15/ 35 Motivation: why sequential? Regression model η(x,θ), nonlinear in θ the optimal for estimating θ depends on θ! Information matrix M(ξ,θ) = X f θ(x)fθ (x)ξ(dx) with ξ a probability measure on X f θ (x) = 1 η(x,θ) σ θ (cont. diff. w.r.t. θ for any x X) ξd (θ) is for θ: ξ D (θ) maximizes log det[m(ξ,θ)]

16/ 35 Examples [Box & Lucas, 1959]: η(x,θ) = θ 1 θ 1 θ 2 [exp( θ 2 x) exp( θ 1 x)] for θ = (0.7, 0.2), ξd = 1 2 δ x ( + 1 2 δ x (2) with x ( 1.25 and x (2) 6.60 Michaelis-Menten: η(x,θ) = θ 1 x θ 2 +x, x (0,x], θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = θ 2x 2θ 2 +x and x(2) = x Exponential decrease: η(x,θ) = θ 1 exp( θ 2 x), x x, θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = x and x (2) = x + 1 θ 2...

16/ 35 Examples [Box & Lucas, 1959]: η(x,θ) = θ 1 θ 1 θ 2 [exp( θ 2 x) exp( θ 1 x)] for θ = (0.7, 0.2), ξd = 1 2 δ x ( + 1 2 δ x (2) with x ( 1.25 and x (2) 6.60 Michaelis-Menten: η(x,θ) = θ 1 x θ 2 +x, x (0,x], θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = θ 2x 2θ 2 +x and x(2) = x Exponential decrease: η(x,θ) = θ 1 exp( θ 2 x), x x, θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = x and x (2) = x + 1 θ 2...

16/ 35 Examples [Box & Lucas, 1959]: η(x,θ) = θ 1 θ 1 θ 2 [exp( θ 2 x) exp( θ 1 x)] for θ = (0.7, 0.2), ξd = 1 2 δ x ( + 1 2 δ x (2) with x ( 1.25 and x (2) 6.60 Michaelis-Menten: η(x,θ) = θ 1 x θ 2 +x, x (0,x], θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = θ 2x 2θ 2 +x and x(2) = x Exponential decrease: η(x,θ) = θ 1 exp( θ 2 x), x x, θ 1,θ 2 > 0 ξd = 1 2 δ x ( + 1 2 δ x (2) with x( = x and x (2) = x + 1 θ 2...

full sequential : choose x 1,...,x n0, estimate ˆθ n 0, set k = n 0 then x k+1 observe Y k+1 re-estimate ˆθ k+1 k k + 1... ity: [ k ] x k+1 = arg max x X det i=1 fˆθ (x k i )f ˆθ k(x i)+fˆθ (x)f ˆθ k k(x) or equivalently x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) with ξ k = 1 k k i=1 δ x i the empirical measure defined by x 1,...,x k M(ξ k,θ) = 1 k k i=1 f θ(x i )f θ (x i) we hope that ˆθ n a.s. θ and n(ˆθ n θ) d N(0,M 1 [ξ D ( θ), θ]) 17/ 35

full sequential : choose x 1,...,x n0, estimate ˆθ n 0, set k = n 0 then x k+1 observe Y k+1 re-estimate ˆθ k+1 k k + 1... ity: [ k ] x k+1 = arg max x X det i=1 fˆθ (x k i )f ˆθ k(x i)+fˆθ (x)f ˆθ k k(x) or equivalently x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) with ξ k = 1 k k i=1 δ x i the empirical measure defined by x 1,...,x k M(ξ k,θ) = 1 k k i=1 f θ(x i )f θ (x i) we hope that ˆθ n a.s. θ and n(ˆθ n θ) d N(0,M 1 [ξ D ( θ), θ]) 17/ 35

18/ 35 LS estimation in nonlinear regression Y i = η(x i, θ)+ε i, x i X R d, θ Θ R p, {ε i } i.i.d., variance σ 2 LS estimation: ˆθ n = argmin θ Θ S n (θ) with S n (θ) = n i=1 [Y i η(x i,θ)] 2 Conditions for strong consistency of ˆθ n (ˆθ n a.s. θ) very much differ depending whether the x k are constants or depend on ε i, i < k

19/ 35 Sequential For n 0 k < n, x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) How to ensure that ˆθ n a.s. θ and n(ˆθn θ) d N(0,M 1 [ξd ( θ), θ]) when n? ❶ deterministic choice of x k when k {k 1,k 2,...}, with k i i α, α (1,2) [Lai 1994] ❷ let n 0 tend to when n ❸ suppose that X is a finite set replication of observations

19/ 35 Sequential For n 0 k < n, x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) How to ensure that ˆθ n a.s. θ and n(ˆθn θ) d N(0,M 1 [ξd ( θ), θ]) when n? ❶ deterministic choice of x k when k {k 1,k 2,...}, with k i i α, α (1,2) [Lai 1994] ❷ let n 0 tend to when n ❸ suppose that X is a finite set replication of observations

19/ 35 Sequential For n 0 k < n, x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) How to ensure that ˆθ n a.s. θ and n(ˆθn θ) d N(0,M 1 [ξd ( θ), θ]) when n? ❶ deterministic choice of x k when k {k 1,k 2,...}, with k i i α, α (1,2) [Lai 1994] ❷ let n 0 tend to when n ❸ suppose that X is a finite set replication of observations

19/ 35 Sequential For n 0 k < n, x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθ k (x) How to ensure that ˆθ n a.s. θ and n(ˆθn θ) d N(0,M 1 [ξd ( θ), θ]) when n? ❶ deterministic choice of x k when k {k 1,k 2,...}, with k i i α, α (1,2) [Lai 1994] ❷ let n 0 tend to when n ❸ suppose that X is a finite set replication of observations

20/ 35 X is a finite set X = {x (,...,x (K) } Convergence of ˆθ n to θ: everything is fine if D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 grows to fast enough for all θ θ Theorem 1: convergence [LP, S&P Letters, 2009] If D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 satisfies [ ] for all δ > 0, inf D n (θ, θ) /(log log n) a.s. θ θ δ/τ n with X finite and {τ n } a non-decreasing sequence of positive constants, then ˆθ n satisfies τ n ˆθ n a.s. θ 0

20/ 35 X is a finite set X = {x (,...,x (K) } Convergence of ˆθ n to θ: everything is fine if D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 grows to fast enough for all θ θ Theorem 1: convergence [LP, S&P Letters, 2009] If D n (θ, θ) = n k=1 [η(x i,θ) η(x i, θ)] 2 satisfies [ ] for all δ > 0, inf D n (θ, θ) /(log log n) a.s. θ θ δ/τ n with X finite and {τ n } a non-decreasing sequence of positive constants, then ˆθ n satisfies τ n ˆθ n a.s. θ 0

21/ 35 Theorem 2: asymptotic normality [LP, S&P Letters, 2009] If there exists a sequence of matrices C n symmetric pos. def. such that C 1 n M 1/2 (ξ n, θ) p I with c n = λ min (C n ) and D n (θ, θ) satisfying n 1/4 c n and [ ] for all δ > 0, inf θ θ c nδ 2 D n (θ, θ) /(log log n) a.s. then ˆθ n satisfies n M 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Apply that to x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθk(x) with X finite for any sequence {ˆθ k }, the sampling rate of a non-singular is > 0

21/ 35 Theorem 2: asymptotic normality [LP, S&P Letters, 2009] If there exists a sequence of matrices C n symmetric pos. def. such that C 1 n M 1/2 (ξ n, θ) p I with c n = λ min (C n ) and D n (θ, θ) satisfying n 1/4 c n and [ ] for all δ > 0, inf θ θ c nδ 2 D n (θ, θ) /(log log n) a.s. then ˆθ n satisfies n M 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Apply that to x k+1 = arg max x X f ˆθ k(x)m 1 (ξ k,ˆθ k )fˆθk(x) with X finite for any sequence {ˆθ k }, the sampling rate of a non-singular is > 0

22/ 35 x ( x (2)... x (j)... x (l)... x (K) }{{} lim inf n n i /n > α > 0 ˆθ n a.s. θ M(ξ n,ˆθ n ) a.s. M[ξ D ( θ), θ] [nm(ξ n,ˆθ n )] 1/2 (ˆθ n θ) d N(0,I)

22/ 35 x ( x (2)... x (j)... x (l)... x (K) }{{} lim inf n n i /n > α > 0 ˆθ n a.s. θ M(ξ n,ˆθ n ) a.s. M[ξ D ( θ), θ] [nm(ξ n,ˆθ n )] 1/2 (ˆθ n θ) d N(0,I)

22/ 35 x ( x (2)... x (j)... x (l)... x (K) }{{} lim inf n n i /n > α > 0 ˆθ n a.s. θ M(ξ n,ˆθ n ) a.s. M[ξ D ( θ), θ] [nm(ξ n,ˆθ n )] 1/2 (ˆθ n θ) d N(0,I)

22/ 35 x ( x (2)... x (j)... x (l)... x (K) }{{} lim inf n n i /n > α > 0 ˆθ n a.s. θ M(ξ n,ˆθ n ) a.s. M[ξ D ( θ), θ] [nm(ξ n,ˆθ n )] 1/2 (ˆθ n θ) d N(0,I)

23/ 35 } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Again, sequential construction independency is lost Use the assumption that X is finite: X = {x (,...,x (K) } If λ k = constant λ: ˆθ n a.s. θ and M(ξ n, θ) M ( θ) (optimal for criterion log det M(ξ, θ) λ Φ(ξ, θ)) and nm 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Also true if λ k = bounded measurable function of x 1,Y 1,...,x k,y k (e.g., λ k = λ (ˆθ k ) = optimal Lagrange coefficient for minimization of log detm(ξ,ˆθ k ) under the constraint Φ(ξ,ˆθ k ) C)

23/ 35 } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Again, sequential construction independency is lost Use the assumption that X is finite: X = {x (,...,x (K) } If λ k = constant λ: ˆθ n a.s. θ and M(ξ n, θ) M ( θ) (optimal for criterion log det M(ξ, θ) λ Φ(ξ, θ)) and nm 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Also true if λ k = bounded measurable function of x 1,Y 1,...,x k,y k (e.g., λ k = λ (ˆθ k ) = optimal Lagrange coefficient for minimization of log detm(ξ,ˆθ k ) under the constraint Φ(ξ,ˆθ k ) C)

23/ 35 } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Again, sequential construction independency is lost Use the assumption that X is finite: X = {x (,...,x (K) } If λ k = constant λ: ˆθ n a.s. θ and M(ξ n, θ) M ( θ) (optimal for criterion log det M(ξ, θ) λ Φ(ξ, θ)) and nm 1/2 (ξ n,ˆθ n )(ˆθ n θ) d N(0,I) Also true if λ k = bounded measurable function of x 1,Y 1,...,x k,y k (e.g., λ k = λ (ˆθ k ) = optimal Lagrange coefficient for minimization of log detm(ξ,ˆθ k ) under the constraint Φ(ξ,ˆθ k ) C)

24/ 35 If λ k ր, (λ k log log k)/k 0, then ˆθ n a.s. θ moreover, convergence to minimum-cost : Φ(ξ n, θ) = 1 n n i=1 φ(x i, θ) a.s. φ θ = min x X φ(x, θ)...and ξ N (x (i ) ) a.s. 1 if φ(x, θ) has a unique minimum at x (i ) X = {x (,...,x (K) } We can thus optimize n i=1 φ(x i, θ) without knowing θ: self-tuning optimization } x k+1 = arg max x X {f ˆθ km 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Already suggested for linear regression (η(x,θ) linear in θ) [Åström & Wittenmark, 1989], condition on λ k in [LP, AS 2000] Here, LS in nonlinear regression, or ML, but with X finite Beware: x k+1 = arg min x X φ(x,ˆθ k ) may not work!

24/ 35 If λ k ր, (λ k log log k)/k 0, then ˆθ n a.s. θ moreover, convergence to minimum-cost : Φ(ξ n, θ) = 1 n n i=1 φ(x i, θ) a.s. φ θ = min x X φ(x, θ)...and ξ N (x (i ) ) a.s. 1 if φ(x, θ) has a unique minimum at x (i ) X = {x (,...,x (K) } We can thus optimize n i=1 φ(x i, θ) without knowing θ: self-tuning optimization } x k+1 = arg max x X {f ˆθ km 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Already suggested for linear regression (η(x,θ) linear in θ) [Åström & Wittenmark, 1989], condition on λ k in [LP, AS 2000] Here, LS in nonlinear regression, or ML, but with X finite Beware: x k+1 = arg min x X φ(x,ˆθ k ) may not work!

24/ 35 If λ k ր, (λ k log log k)/k 0, then ˆθ n a.s. θ moreover, convergence to minimum-cost : Φ(ξ n, θ) = 1 n n i=1 φ(x i, θ) a.s. φ θ = min x X φ(x, θ)...and ξ N (x (i ) ) a.s. 1 if φ(x, θ) has a unique minimum at x (i ) X = {x (,...,x (K) } We can thus optimize n i=1 φ(x i, θ) without knowing θ: self-tuning optimization } x k+1 = arg max x X {f ˆθ km 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) Already suggested for linear regression (η(x,θ) linear in θ) [Åström & Wittenmark, 1989], condition on λ k in [LP, AS 2000] Here, LS in nonlinear regression, or ML, but with X finite Beware: x k+1 = arg min x X φ(x,ˆθ k ) may not work!

Example 2 (self-tuning regulation) Observe Y i = θ 1 x θ 2 +x +ε i, {ε i } i.i.d. N(0,0. Find x such that Ψ(x, θ) = T Ψ(x,θ) = θ 1 [1 exp( θ 2 x/3)] η(x, θ) ( θ = (1, Ψ(x, θ) = 1/2 for x = 3log(2) 2.08) minimise φ(x,θ) = [Ψ(x,θ) T] 2 note that we do not observe Ψ(x, θ)! x 1 = 1, x 2 = 10, then } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k λ )fˆθk k φ(x,ˆθ k ) for three sequences {λ k } (a) λk = log 2 k (b) λk = k/(1+log 2 k) (c) λk = k 1.1 25/ 35

Example 2 (self-tuning regulation) Observe Y i = θ 1 x θ 2 +x +ε i, {ε i } i.i.d. N(0,0. Find x such that Ψ(x, θ) = T Ψ(x,θ) = θ 1 [1 exp( θ 2 x/3)] η(x, θ) ( θ = (1, Ψ(x, θ) = 1/2 for x = 3log(2) 2.08) minimise φ(x,θ) = [Ψ(x,θ) T] 2 note that we do not observe Ψ(x, θ)! x 1 = 1, x 2 = 10, then } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k λ )fˆθk k φ(x,ˆθ k ) for three sequences {λ k } (a) λk = log 2 k (b) λk = k/(1+log 2 k) (c) λk = k 1.1 25/ 35

Example 2 (self-tuning regulation) Observe Y i = θ 1 x θ 2 +x +ε i, {ε i } i.i.d. N(0,0. Find x such that Ψ(x, θ) = T Ψ(x,θ) = θ 1 [1 exp( θ 2 x/3)] η(x, θ) ( θ = (1, Ψ(x, θ) = 1/2 for x = 3log(2) 2.08) minimise φ(x,θ) = [Ψ(x,θ) T] 2 note that we do not observe Ψ(x, θ)! x 1 = 1, x 2 = 10, then } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k λ )fˆθk k φ(x,ˆθ k ) for three sequences {λ k } (a) λk = log 2 k (b) λk = k/(1+log 2 k) (c) λk = k 1.1 25/ 35

Example 2 (self-tuning regulation) Observe Y i = θ 1 x θ 2 +x +ε i, {ε i } i.i.d. N(0,0. Find x such that Ψ(x, θ) = T Ψ(x,θ) = θ 1 [1 exp( θ 2 x/3)] η(x, θ) ( θ = (1, Ψ(x, θ) = 1/2 for x = 3log(2) 2.08) minimise φ(x,θ) = [Ψ(x,θ) T] 2 note that we do not observe Ψ(x, θ)! x 1 = 1, x 2 = 10, then } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k λ )fˆθk k φ(x,ˆθ k ) for three sequences {λ k } (a) λk = log 2 k (b) λk = k/(1+log 2 k) (c) λk = k 1.1 25/ 35

26/ 35 Ψ(x k, θ), k = 1,...,500 (a) λ k = log 2 k (b) λ k = k/(1+log 2 k) (c) λ k = k 1.1 1 (a) 1 (b) 1 (c) 0.9 0.9 0.9 0.8 0.8 0.8 ψ(x n,θ) 0.7 0.6 0.5 0.7 0.6 0.5 0.7 0.6 0.5 0.4 0.4 0.4 0.3 0.3 0.3 0.2 0.2 0.2 0.1 0 200 400 600 n 0.1 0 200 400 600 n 0.1 0 200 400 600 n

27/ 35 Replace φ(x,θ) = [ψ(x,θ) T] 2 by φ(x,θ) = [ψ(x,θ) T] 4, take λ k = 10 3 log 2 k the support points of ξ k tend to concentrate around x 1 10 ψ(x n,θ) 0.9 0.8 0.7 0.6 0.5 x n 9 8 7 6 5 4 0.4 3 0.3 2 0.2 0 500 1000 1500 2000 n 1 0 500 1000 1500 2000 n

28/ 35 Example 1 [Dragalin & Fedorov 2006] (continued) Clinical trials: 11 doses, Y for efficacy, Z for toxicity, 36 patients up and down method [Ivanova 2003] max{x (in,x ( } ց si Z N = 1, x N+1 = x (i N) si Y N = 1 and Z N = 0, min{x (in+,x (1 } ր si Y N = 0 and Z N = 0, for the first 10 patients, then sequential switching at first observed toxicity (with maximum increase of one dose at each step)[dragalin & Fedorov 2006])

29/ 35 for (Y = 0,Z = 0), for (Y = 1,Z = 0), for (Y = 1,Z = and for (Y = 0,Z = Penalty function φ 1 (x,θ) = π10 1 (x,θ) x N 3 2 1 0 1 2 3 0 5 10 15 20 25 30 35 N

30/ 35 Penalty function φ 3 (x,θ) = π 1 (x,θ)[1 π 1(x,θ)] 1 x N 3 2 1 0 1 10 2 3 0 5 10 15 20 25 30 35 N

31/ 35 (36 patients, 1000 repetitions) Φ 1 (ξ,θ) J(ξ,θ) x {t<4} x {t=4} x {t=5} x {t=6} x {t>6} #x (1 S1 1.87 28.02 2% 38.6% 36.9% 8.6% 13.9% 0 ξ u&d ( θ) 1.47 29.4 S2 3.16 17.23 0 19.8% 70.5% 7.8% 1.9% 5% ξd ( θ) 4.45 14.99 S3 2.38 18.78 0 22.3% 68.2% 7% 2.5% 2.3% ξλ=2 ( θ) 1.97 17.00 φ 1 (ξ,θ) = π 1 10 (x,θ), J(ξ,θ) = det 1/6 [M(ξ,θ)], OSD = x (5) S1 = up & down S2 = up & down + sequential S3 = up & down + sequential

32/ 35 (240 patients, 150 repetitions) Φ 1 (ξ,θ) J(ξ,θ) x {t<4} x {t=4} x {t=5} x {t=6} x {t>6} #x (1 S1 1.54 29.04 0 14% 77.3% 7.3% 1.3% 0 ξ u&d ( θ) 1.47 29.4 S4 1.52 27.87 0 8.7% 90% 0.7% 0.7% 0.1% S4 = up & down + sequential with logarithmic increase λ N ր switching when σ(optimal estimated dose) < x x = interval between 2 consecutive doses cautious increase of doses when σ(optimal estimated dose) > x /2 S4 better than S1 (= up & down)

33/ 35 Convergence and asymptotic normality of estimators under sequential x k+1 = arg max x X f ˆθk (x)m 1 (ξ k,ˆθ k )fˆθ k (x) and sequential } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) when λ k bounded...assuming that X is finite (only a technical assumption?) λ n not too fast self-tuning regulation/optimization

33/ 35 Convergence and asymptotic normality of estimators under sequential x k+1 = arg max x X f ˆθk (x)m 1 (ξ k,ˆθ k )fˆθ k (x) and sequential } x k+1 = arg max x X {f ˆθk M 1 (ξ k,ˆθ k )fˆθ λ k k φ(x,ˆθ k ) when λ k bounded...assuming that X is finite (only a technical assumption?) λ n not too fast self-tuning regulation/optimization

34/ 35 Asymptotic normality when λ n slowly enough? λ min [M(ξ n,ˆθ n )] A/λ n? Find a sequence of matrices C n symmetric pos. def. such that C 1 n M1/2 (ξ n, θ) p I: construct Cn from x k+1 = argmax x X { f θ M 1 (ξ k, θ)f θ λ k φ(x, θ) }? use Cn = M 1/2 (ξ ( θ,λ n ), θ) with ξ (θ,λ) maximizing log det M(ξ,θ) λφ(ξ,θ)?

35/ 35 Thank you for your attention!