Statistical analysis of short-term studies in regulatory toxicology using R

Statistical analysis of short-term studies in regulatory toxicology using R Ludwig A. Hothorn hothorn@biostat.uni-hannover.de Institute of Biostatistics, Leibniz University Hannover, Germany May 17, 2010 1 / 96

Aims for the next two hours I - Evaluation strategies of toxicological studies - Using confidence intervals - Using two-sample confidence intervals for a proof of hazard without FWER-control - Using Dunnett/Williams procedures: parametric (incl. variance heterogeneity), non-parametric, proportions with FWER-control according to NTP recommendations - Using proof of hazard or proof of safety - Using R (real data example based). Sorry: not interactive with R. But frequent R-user can run the program parallel to the slides 2 / 96

Motivating examples I - What means short-term studies? Repeated administration of a drug/compound on rats, mice or dogs (not really today since too small sample sizes), e.g. 4 weeks, 13 week or 6 month studies, e.g. acc.to the OECD 408 guideline: repeated dose 90-day oral toxicity (in opposite to long-term carcinogenicity studies whereas time to-death/tumor-relationships are of interest) 3 / 96

Motivating examples II - Example I: Continuous endpoints The data of a 13 weeks feeding study on Sodium dichromate dihydrate in F344 Rats was downloaded from NTP For each sex 10 rats were randomized to control, 62.5, 125, 250, 500 and 1000 mg/kg. Several hematological and clinical chemistry endpoints were measured after 5, 23 and 93 days of administration We use here clinical chemistry endpoints after 93 days as an example Furthermore, organ weight data and weekly-measured body weight data are available as well. These data are almost representative for short-term studies: both sexes, 10 animals/group, many endpoints. However, three instead five dose groups are common and only one final measurement, whereas sometimes a baseline measure is available. 4 / 96

Motivating examples III - Example II: Histopathological findings Incidence data: incidences of tubular epithelia hyaline droplet degeneration in male rats were reported for a 28-day oral dose toxicity study of nonylphenol to: 0/10, 0/10, 3/10, 8/10 [WSI + 07]. - Example III: Graded findings Non-Neoplastic lesions in the P-Cresidine carcinogenicity study on each 30 male mice: 1) hyperplasia in parotid gland (salivary glands) and 2) kidney hydoephoris (where the single finding minimal was categorized as none finding as the unlisted animals) The second example shows no finding in the control at all. - Data files in.cvs format available 5 / 96

Motivating examples IV - Data characteristics in toxicology: small sample sizes, particular in in-vivo studies, i.e. 5 to 12 animals/group. the randomized unit animal is the relevant sample size unit, not the pub in reprotox studies comparisons versus control treatment groups, but more often dose groups, i.e. dose-response analysis, multiple endpoints, e.g. chronic toxicity studies with more than 100 endpoints Specific: endpoint are approx. normal distributed, continuous but not normal distributed, proportions, and counts Specific: variance heterogeneity occur commonly in continuous data 6 / 96

Required R packages I - multcomp: parametric simultaneous inference using multiple contrasts [BHW02] - pairwiseci: two-sample confidence intervals - MCPAN: simultaneous confidence intervals for proportions (and poly3-estimates) - nparcomp non-parametric simultaneous inference using multiple contrasts - mratios ratio-to-control inference [DSH07] - ETC Bofinger approach for equivalence of several treatments with respect to control and related Bonferroni-TOST Please, download these packages from CRAN to your R installation! 7 / 96

Guidelines I - Regulatory toxicology is according to guidelines. However, the statistical design and analysis is described rather noncommittal, e.g. OECD 408 guideline:when applicable, numerical results should be evaluated by an appropriate and generally acceptable statistical method. - U.S. National Toxicology Program: Body/organ weights are to be assumed normally distributed and hence Dunnett/Williams [Dun55],[Wil71] approach is recommended controlling the FWER (familywise error rate). All other continuous endpoints should be analyzed non-parametrically by the Dunn/Shirley [DUN64], [SHI77] procedure, for proportions the arcsine transformation to normality is used and for severity data (ordered categorical data) should be analyzed by the WMW test (More details in the paper-version) 8 / 96

Guidelines II - In the 2001 FDA Guidance on Statistical Aspects of the Design, Analysis, and Interpretation of Chronic Rodent Carcinogenicity Studies of Pharmaceuticals a detailed description of the evaluation of neoplastic lesions is described only (not today) - However, in the separate recommendation for the evaluation of immunotoxicological studies, the Bartlett on homogeneity of variances in an one-way k-sample design is used as a pre-test. When significant, i.e. heterogeneous variance, non-parametric approaches are recommended, otherwise parametric. - Some text books on statistics in toxicology exists, e.g. [Gad05], [Mor96], [PB97], [Kir00], [Cho98],[KP03], but no textbook provides data examples and their appropriate evaluation with related software. - Conclusion: Dunnett/Williams respective Dunn/Shirley procedures using R and alternatives in this ASA-Webinar 9 / 96

Presentation style of significances I For tests the following presentation styles of significances are common: 1) yes/no decision H 0 vs. H 1 2) yes/no decision H 0 vs. H 1, but for three α levels: 0.05, 0.01, 0.001 with the symbols: 3) p values, e.g. [MZL + 08] H 0 0.05 0.01 0.001 10 / 96

Presentation style of significances II 4) confidence intervals 11 / 96

Presentation style of significances III - The p-value is motivated by Poppers falsification principle..we can never proof an effect directly; only by the small probability of its opposite. I.e. the p-value is the smallest possible f + rate (Explain it!) - Advantage: simple, reduction of any complex question to one value, commonly used - Disadvantage: i) it is a probability between 0 and 1 only - but we need a measure of efficacy, ii) it depends on sample size n, i.e. in un-designed experiments, any small p-value can be achieved by increasing sample size, independent on the effect size µ T µ C iii) (commonly) for a point-zero null-hypothesis H 0 : µ T µ C = 0, but in bio-medicine we are never interested in tiny differences 12 / 96

Presentation style of significances IV - A better alternative is the use of effect sizes and their confidence intervals - For continuous data: µ T µ C or µ T /µ C - Confidence intervals (CI) for these measures by re-formulating the x t-test: T x C = t df,1 p=min(α) into SD (2/n) (µ T µ C ) SD (2/n)t df,1 α, (µ T µ C ) + SD (2/n)t df,1 α - ICH E9 Guidance for RCT: Estimates of treatment effects should be accompanied by confidence intervals, whenever possible... - Sometimes, interpretation is easier as percentage change, e.g. k-fold rule in mutagenicity assays, and a confidence interval for µ T /µ C is recommended (switch from additive into multiplicative model). A bit more complicated (no formula here) according to Fieller [Fie54] 13 / 96

Presentation style of significances V - Properties of confidence intervals for ratio-to-control: i) asymmetric, because they are multiplicative [0.5, 1.5] ii) problems with control values near to zero, iii) useful for a direct comparison of multiple endpoints, iv) useful for endpoints with different scales (contin., proportions, counts), v) one confidence interval approach for superiority and non-inferiority, vi) others - Notice, one-sided intervals available as well 14 / 96

Presentation style of significances VI - Problems: i) the width of the confidence interval, i.e. SD (2/n)t df,1 α is a function of sample size, i.e. larger sample sizes, smaller (more significant) width and analogously smaller p-values- independent of effect size and variance. The sample size must be defined a-priori - by guideline, by power approach (see later) ii) Variance heterogeneity iii) Violation of normal distribution assumption - The common mis-understanding between statistical significance and biological relevance results from inappropriate use of p-values, testing point-zero H 0, and un-designed experiments. Therefore, toxicological studies should be characterized by effect measures and their confidence intervals. 15 / 96

Effect sizes and their confidence intervals I - Recent critique for studies with individuals, i.e. volunteers in RTC [Bro10] I once asked a well-published medical researcher what p < 0.05 meant to him. He said: It means that everyone on X did better than everyone on Y - What is needed is...translating a significant t-test p value and sample size into a form that can more clearly express the magnitude of the effect While traditional significance testing usually deals with differences between population means, there is an increasing focus in fields such as medicine on the probability of one treatment being more successful than another on a per-individual basis 16 / 96

Effect sizes and their confidence intervals II - We need..a measure of how often a random subject receiving treatment X will outperform a random subject receiving treatment Y, typically expressed as P(X > Y ). - Relative effect size Measure for stochastic order : p01 = P(X 01 < X 11 ) (for continous data). I.e. a probability that a randomly selected patient in the control reveals a smaller response value than a randomly selected patient in the treatment group. This measure p01 was denoted relative effect [BM00]. 17 / 96

Effect sizes and their confidence intervals III Important generalization for non-continous date (ties, scores data,...): p 01 as ordinal effect size measure [RA08] and is defined: p 01 = F 0 df 1 = P(X 01 < X 11 ) + 0.5P(X 01 = X 11 ). (1) This is an effect size according to Browne [Bro10], and nowadays both confidence intervals and software (R library nparcomp) exist Interpretation: X 1j tends in comparison to X 0j stochastically to larger values, if p 01 > 0.5, to smaller values, if p 01 < 0.5, to no decision against H 0 if p 01 = 0.5. 18 / 96

Proof of hazard vs. proof of safety I - Common decision in toxicology Harmless, if the p-value of an appropriate test for D j vs. C is non-significant > 0.05, otherwise harmful(based on the sample sizes in the guidelines, for the design C, D 1, D 2, D 3, independent for both sexes, each endpoint, each time) and with a consecutive discussion of biological relevance if p < 0.05, - Issue 1: Point-zero-null-hypotheses H 0 : µ C µ D δ = 0 are not appropriate Better: a-priori definition of relevance thresholds in toxicology. But a consensus- particularly for the many endpoints- seems to be hopeless Therefore, estimation of confidence intervals and their post-hoc interpretation in terms of tolerable thresholds 19 / 96

Proof of hazard vs. proof of safety II - Issue 2: Kirkland (1999) be confident in negative results [LYH + 00]. But common used hypotheses: H 0 : µ C = µ D harmless H 1 : µ C < µ D harmful Remember falsification principle Crux: Neyman-Pearson tests are asymmetric, i.e. only one error rate can be controlled directly, namely the error of falsely rejecting H 0 (f + ) error rate, type I error rate, α). I.e. the common-used t-test at level α = 0.05 allows a 5% f + rate to reject H 0. Alternative: Proof of safety, i.e. we formulate the hypotheses, that we control the more important error in toxicology, i.e. the f rate (Explain it! π = 1 f ): H 0 : µ C µ D δ harmful(toxic) H 1 : µ C µ D < δ harmless(non-toxic) 20 / 96

Proof of hazard vs. proof of safety III - Issue 3: In chron-tox studies, the variances are endpoint-specific (or in long-term carcinogenicity studies are the spontaneous tumor rates specific). Therefore, the f rate in the common-used proof of hazard are endpoint-specific: crazy! - Issue 4: Although multiple endpoints occur in a chron-tox study, we will not perform multiplicity adjustment Differently to multiple efficacy endpoints in a RCT, because claiming efficacy for Y 1 OR Y 2 OR... - Issue 5: Although in the design C, D i,..., D k multiple comparisons occur, we will not adjust against multiplicity in tox because we are interested to keep the more important f error rate low instead of a strict control of the less important f + error rate 21 / 96

Proof of hazard vs. proof of safety IV - Curious In tox with the design [C, D 1,..., D k ] the main used approach is Dunnett-Test (it is the 14. most cited statistical paper [RW05] with a majority in tox) An example Van Vleet et al. [VVWS + 07]: Statistical Analyses Dunnetts test was used to confirm/rule-out apparent dose-related trends - Why multiplicity adjustment according to Dunnett is less appropriate in the proof of hazard? i) dose-related trend test is better, e.g. [Wil71] procedure ii) Problems with down-turn effects at higher doses? A protected Williams approach is available [BH03] iii) Why multiplicity adjustment in toxicology? For an efficacy endpoint in RCT, we must pay an price because claim for D 1 OR D 2 OR D 3 What happens in toxicology when using multiplicity adjustment? 22 / 96

Proof of hazard vs. proof of safety V i) The power π will be reduced; this is particularly critical because most sample sizes are (too) small ii) There is no claim for a toxic effect at D 1 OR D 2 OR...; any toxic dose is an outcome - Either proof of safety, i.e. D 1 OR D 2 OR D 3 are equivalent (or non-inferior) to control OR proof of hazard, but without multiplicity adjustment. I.e. we tolerate in increasing f + rate instead an increasing f error rate Precisely: the control of the comparisonwise error rate may be sufficient in toxicology 23 / 96

Proof of hazard vs. proof of safety VI - But NTP recommendation: Dunnett/Dunn and Williams/Shirley do control FWER. Therefore, today all three approaches - Notice: Why are simple confidence intervals of two-sample tests sometimes not the best? Answer: The small df: ν 0i = n 0 + n i 2 are less than common df of a one-way layout (notice, assuming variance homogeneity, i.e. ν = n 0 + n 1 +...n k k 1, which is particular important in tox because of the little number of animals. 24 / 96

One- or two-sided formulation of hypothesis? I - There are controversial arguments for/against one/two-sided hypotheses. In RCT the efficacy will be commonly tested by a two-sided hypothesis or an one-sided at level α/2 - Most endpoints in tox are directed, e.g. increasing tumor rate, ASAT, finding rates. But two-sided problems exists (rarely) as well, e.g. body weight changes. - A simple way: two-sided generally, but: i) the f error rate increases unnecessarily in the proof of hazard ii) In the proof of safety a clear distinction between testing equivalence (2-sided) and non-inferiority (1-sided) exists - Therefore: most hypotheses in tox are one-sided; and therefore the testing non-inferiority is the main approach 25 / 96

Dunnett procedure or Williams procedure or...? I - The NTP recommends the parametric Dunnett/Williams or the non-parametric Dunn/Shirley procedure. Which one is appropriate? - The common dose-response design C = D = 0, D i,..., D k should be analyzed by the Williams procedure assuming an one-sided and monotonic trend H 1 : µ C µ 1... µ k (Notice, two-sided trend hypotheses are possible, but hard to imagine) - Why the Dunnett procedure for H 1 : µ 0 < µ i (at least one i, anyone) should be used? i) changes are of interest, i.e. two-sided alternatives, for which the Dunnett procedure was constructed, ii) still one-sided, but doubts on monotonicity - An alternative for down-turn effects at high doses: modified Williams test [BH02] - For high-throughput analysis: two-sided Dunnett procedure. For specific analysis: one-sided Williams procedure, sometimes modified against non-monotonicity 26 / 96

Proof of hazard using unadjusted comparisons I - A consequence from the primary importance of f in the proof of hazard, is not to control a familywise f + rate neither against several doses/treatments nor against multiple endpoints, i.e. the use of unadjusted two-sample comparisons throughout. Even when the not really estimable f + rate increases seriously and sentences as although statistically significant, this increase in... is biologically not relevant are used frequently - To achieve comparability between differently-scaled multiple endpoints, unadjusted two-sided (1 α) confidence intervals for ratios-to-control can be recommended, whereas a parametric Fieller-type version [Fie54] for heterogeneous variances is available [TL04], [HVH08] in the R package pairwiseci. Hodges-Lehman-type intervals are proposed [HM02] whereas a Behrens-Fisher modification is not available. 27 / 96

Proof of hazard using unadjusted comparisons II - Alternatively, related confidence intervals for relative effects for a Behrens-Fisher solution can be used, by means of the R package nparcomp. Notice, two serious limitations exist for the non-parametric approach: control values near-to-zero and small sample sizes (e.g. n i < 10) - Example 1: evaluation of relative organ weights analogously to [WJD + 09]. Re-analysis using pairwiseci Analyze using R!: parametric approach, Hodges-Lehmann approach, relative effect size approach - Questions so far? Jump to next chapter 28 / 96

Evaluation of Example 1 I setwd("e:\\aktuell_e\\ PUB\\_PAPER\\_StatTox2010\\Datenbeispiele") organ <- read.csv("organ.csv") organ$dose <- as.factor(organ$dose) library(pairwiseci) exa1 <- pairwiseci(weight ~ Dose, by="organ", data=organ, alternative="two.sided", method="hl.ratio", control=" plot(exa1,civert=false, H0line=c(0.8,1, 1.25), H0lty=c(2,1,2), main="relative organ weights", xlab="non-parame exa1a <- pairwiseci(weight ~ Dose, by="organ", data=organ, alternative="two.sided", method="param.ratio", var.e plot(exa1a,civert=false, H0line=c(0.8,1, 1.25), H0lty=c(2,1,2), main="relative organ weights", xlab="parametri library(pairwiseci) library(nparcomp) tym <- organ[organ$organ=="thymus", ]; tym500 <- tym[tym$dose==0 tym$dose==500, ] tym500$dose <- factor(tym500$dose, levels=c(0,500)); tym500 <- tym500[,c(4,6)] npar.t.test(weight ~ DOSE, data=tym500, alternative="two.sided", p.permu = TRUE, plot.simci = TRUE, info = TRUE pairwisetest(weight ~ DOSE, data=tym500, alternative="two.sided", method="t.test.ratio") Interpretation: i) Compare parametric vs. non-parametric! ii) Use directional decisions!, iii) Interprete: significant, but biologically not relevant 29 / 96

Evaluation of Example 1 II Tests instead: Compare p rel.effect = 0.026 with p ratio.to control.sasabuchi = 0.045 Notice the high f + for this approach 30 / 96

Simultaneous confidence intervals in toxicological studies I - Tox studies use similar designs: [C, T 1,..., T k ] resp. [C, D 1,..., D k ],i.e. comparing of treatments or doses versus C - Still better design: include a further positive control [C, D 1,..., D k, C + ]. Two options: i) Proof of assay sensitivity in advance (to limit f error rate), ii) to characterize a dose effect relative to C- and relative to C+. - Typical point-zero-hypothesis for T i vs. C for a difference: H 0 : µ 0 =... = µ k vs. H 1 : µ 0 < µ i (at least one i, anyone)(0... index of control) - OR for non-inferiority( toxic): H 0 : µ i µ 0 δ i vs. H 1 : µ i µ 0 > δ i 31 / 96

Simultaneous confidence intervals in toxicological studies II - Ordered alternative: H 1 : µ 0 µ 1... µ k ; at least µ 0 < µ k - Therefore only two methods, assuming N(µ i, σ 2 ): i) Dunnett (1955) [Dun55] two- or one-sided, ii) Williams (1971) [Wil71], one-sided on monotone increase (or decrease) 32 / 96

Multiple Comparison procedures for differences of µ i - demonstrated as multiple contrast test I - Aim: Simultaneous confidence intervals for (µ i µ i ), using linear test statistics - Special case: comparisons vs. control (µ i µ 0 ) - Simultaneous lower confidence limits acc. to Dunnett (1955) [Dun55]: [ x i x 0 S n 1 i + n0 1 t k,df,r,1 α; ] - A contrast is a suitable linear combination of means: k i=0 c i x i. A contrast test is standardized t Contrast = k i=0 c k i x i /S i ci 2/n i where k i=0 c i = 0 guaranteed a t df,1 α distributed level-α-test. - A multiple contrast test is defined as maximum test: t MCT = max(t 1,..., t q ) which follows jointly (t 1,..., t q ) a q-variate t- distribution with degree of freedom df and the correlation matrix R, with ρ ab = k i=1 a i b i /n i k i=1 a2 i /n k i i=1 b2 i /n i 33 / 96

Multiple Comparison procedures for differences of µ i - demonstrated as multiple contrast test II - Notice: With increasing average correlation and lower number of contrasts q the q-variate t-distribution tends to the univariate t- distribution, i.e. the degree of adjustment reduces - Question: which contrasts and how much? Aim: less, correlated contrasts, which are relevant to the tox questions - Simple examples (balanced design k=3) - Dunnett one-sided c i C T 1 T 2 c a -1 0 1 c b -1-1 0 34 / 96

Multiple Comparison procedures for differences of µ i - demonstrated as multiple contrast test III - Tukey all pairs comparisons (two-sided) c i C T 1 T 2 c a -1 0 1 c b -1 1 0 c c 0-1 1 c d 1-1 0 c e 1-1 0 c f 0 1-1 - Williams Procedure as multiple contrast [Bre06] c i C D 1 D 2 c a -1 0 1 c b -1 1/2 1/2 35 / 96

Multiple Comparison procedures for differences of µ i - demonstrated as multiple contrast test IV - Two-sided confidence intervals: [ k i=0 c i x i ± St q,df,r,2 sided,1 α k i c 2 i /n i] - Notice: multiplicity-adjusted p-values are available alternatively to simultaneous confidence intervals. And they are compatible, i.e. they yield the same decisions - Notice: although recently simultaneous confidence intervals for stepwise MCP were made available [SB08] they are non-informative and can not be recommended, regardless of their (small) power advantage 36 / 96

Multiple Comparison procedures for differences of µ i - demonstrated as multiple contrast test V - Example 2: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoint ALT Analyze using R!: Jittered box-plots, Dunnett procedure two-sided, variance homogeneity (heterogeneity see below), two-sided simultaneous confidence intervals Jump to next chapter 37 / 96

Evaluation of Example 2 I clin <- read.csv("clin.csv") clin$dose <- as.factor(clin$dose) boxplot(alt ~ Dose, data=clin, outline=false) points(jitter(as.integer(clin$dose)),clin$alt, cex=1, pch=17) library(multcomp) library(sandwich) myalt <- lm(alt ~ Dose,data=clin) exa2 <- glht(myalt, linfct = mcp(dose = "Dunnett"), alternative="two.sided") plot(exa2, xlim=c(0,450), main="clinical chemistry: ALT- variance homogeneity") 38 / 96

Evaluation of Example 2 II - Interpretation: i) Although box-plots indicate variance homogeneity, standard Dunnett procedure is used ii) Use directional decisions!, iii) Interprete: significant, but biologically not relevant Notice the high f for this approach since FWER is controlled 39 / 96

Multiple comparisons for ratios of µ i I - Aim: simultaneous confidence intervals for µ i /µ 0 - Trick: Re-formulation the ratios in a linear form Z i0 = x i θ x 0 (Fieller, 1954) [Fie54] (Assumption θ = const.) [ ] - Therefore Z i0 N(0, σz 2 i 0 ), where σ2 Z i0 = 1 + θ2 ni n 0 σ 2 - t i0 (θ) = x i θ x 0 S Zi0 is univariate t- distributed - Simultaneous confidence intervals for the ratios γ i0 = µ i /µ 0 ( γ i G) ± [ ( ( γ i G) 2 (1 G) γ i 2 N G n i )] 1 2 /(1 G) i = 1,..., q, where G = S 2 q 2 α,m,ν,r /(N x 2 0 ) - Notice, the equi-coordinate percentage point t q,ν,r,1 α depends on the unknown ratios γ i0 by the correlation matrix 40 / 96

Multiple comparisons for ratios of µ i II - Solutions: Bonferroni, Sidak, Plug-in [DBGH04] - Software: R package mratios [DSH07] - Example 3: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoints Cholesterol and Triglyceride. Analyse using R!: Jittered box-plots,two-sided Dunnett-type procedure for ratios-to-control assuming variance homogeneity Jump to next chapter 41 / 96

Evaluation of Example 3 I library(mratios) plot(sci.ratio(cholesterol~dose, data=clin, type="dunnett"), main="cholesterol") plot(sci.ratio(triglyceride~dose, data=clin, type="dunnett"), main="triglyceride") - Interpretation: i) Use the dimensionlessness of ratios-to-control comparisons and interpreted both endpoints in terms of significance and relevance, ii) Use directional decisions! 42 / 96

Modifications for variance heterogeneity I - Variance heterogeneity is more likely in real toxicological data than variance homogeneity, since a possible proportionality between variance and mean - Particularly in unbalanced designs ( n i inverse to s i ) neither two-sample tests nor multiple contrast tests control α - Therefore, modifications for variance heterogeneity are highly recommended in toxicology. They can used as default approach (accepting some conservativeness for homogeneous variances) or conditional to pre-tests e.g. according to Levene [PF09] - Three approaches: i) Using a sandwich estimator for variance-covariance matrix in the linear model [HSH10], ii) Welch-type df-adjustment for multiple contrast tests [Has09], iii) Behrens-Fisher modification of non-parametric tests [FK09]. R-programs are available. 43 / 96

Modifications for variance heterogeneity II - Example 4: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoint ALT Analyse using R!: Jittered box-plots, two- and one-sided Dunnett-type procedure assuming variance heterogeneity Jump to next chapter 44 / 96

Evaluation of Example 4 I boxplot(alt ~ Dose, data=clin, outline=false) points(jitter(as.integer(clin$dose)),clin$alt, cex=1, pch=17) sandwich(myalt) myvcov <- vcovhc(myalt, type = "HC") exa4 <- glht(myalt, linfct=mcp(dose = "Dunnett"),vcov=sandwich, alternative="two.sided") plot(exa4,xlim=c(0,450), main="clinical chemistry: ALT- variance heterogeneity") exa4a <- glht(myalt, linfct=mcp(dose = "Dunnett"),vcov=sandwich, alternative="greater") plot(exa4a,xlim=c(0,450), main="clinical chemistry: ALT- variance heterogeneity for an increase") 45 / 96

Evaluation of Example 4 II - Interpretation: i) Notice variance heterogeneity in the box-plots, ii) Use sandwich estimator, - Compare one-and two-sided intervals 46 / 96

Evaluation of Example 5 I Example 5: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoints Cholesterol Analyse using R!: Jittered box-plots, two-sided Dunnett-type procedure assuming variance heterogeneity Jump to next chapter boxplot(cholesterol~dose, data=clin, outline=false, main="cholesterol") points(jitter(as.integer(clin$dose)),clin$cholesterol, cex=0.5, pch=16) library(mratios) exa5 <- sci.ratiovh(cholesterol~dose, data=clin, type="dunnett") plot(exa5, main="cholesterol: variance homogeneity") 47 / 96

Evaluation of Example 5 II - Interpretation: i) Notice variance heterogeneity in the box-plots, ii) Use new Welch modification [Has09],iii) Compare one-and two-sided intervals - Questions so far? 48 / 96

Trend tests and related simultaneous confidence intervals I - Important criteria of relevance in the proof of hazard: a significant trend. Question: what means trend? Two criteria: i) one-sided, ii) monotone, i.e. H 1 : µ C µ 1... µ k, i.e. all possible elementary hypotheses, not just a linear trend. This alternative H 1 : µ C < µ 1 =... = µ k is hard to accept as a trend by some toxicologists, but it is a trend alternative - Therefore, a trend test must be sensitive against all possible elementary alternatives, not against just one, e.g. the linear as the wide-spread used Cochran-Armitage trend test [Arm55] for proportions or the Jonckheere trend test for pairwise ranks. - At least two approaches: MLE-test acc. to [Bar59] quadratic test statistics, and MCT linear test statistics - A trend test, which compares vs. control: Williams trend test [Wil71]. 49 / 96

Trend tests and related simultaneous confidence intervals II - For studies for 2 to 4 doses (typically in toxicology), model-based approaches difficult (see R library MCPMod [BPB05]) - But Williams (1971) procedure [Wil71] is first choice, because monotone alternative vs. control - The contrast structure c i C D 1 D 2 c a -1 0 1 c b -1 1/2 1/2 - Example 6: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoint Cholesterol Analyse using R!: Williams procedure for a monotonous decrease assuming variance heterogeneity Jump to next chapter 50 / 96

Trend tests and related simultaneous confidence intervals III library(multcomp) mychol <- lm(cholesterol ~ Dose,data=clin) sandwich(mychol) myvcov <- vcovhc(mychol, type = "HC") exa6 <- glht(mychol, linfct=mcp(dose = "Williams"),vcov=sandwich, alternative="less") plot(exa6, main="williams trend approach: Cholesterol- variance heterogeneity") - Interpretation: i) global monotonic decrease, ii) minimal effective dose: 1000 51 / 96

Trend tests and related simultaneous confidence intervals IV - Example 7: Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only and the endpoint Cholesterol Analyse using R!: Williams-type procedure for monotonous decreasing ratios-to-control assuming variance heterogeneity Jump to next chapter library(mratios) exa7 <- sci.ratiovh(cholesterol~dose, data=clin, type="williams",alternative="less") plot(exa7,main="cholesterol: Williams-type ratios-to-control" ) 52 / 96

Non-parametric approaches and related simultaneous confidence intervals I - For non-normal data, the trend test according to Shirley [SHI77] is widely used toxicology. I.e. the observations are jointly ranked and Williams test [WIL72] is applied. - H0 F : F 0 =... = F k formulated in terms of the distribution functions against the ordered alternative H1 F : F 0... F k with at least one strict inequality F i < F s, i s. It controls the FWER strongly. - The distribution of the rank means is unknown under the alternative, neither simultaneous confidence intervals are numerically available for a general unbalanced design, nor power can be estimated. - Tied or ordered categorical data, such as severity counts, should be analyzed as well. Therefore, a non-parametric approach is required that includes continuous, discrete, and even dichotomous data in a unified way. 53 / 96

Non-parametric approaches and related simultaneous confidence intervals II - Since variance heterogeneity occur (particularly increasing variances with increasing effects in unbalanced designs with n Control = kn Doses ), the control of the FWER may be problematic. Therefore, a related robust procedure is needed, the so-called Behrens-Fisher (BF) modification as an analogue to the related parametric approach for multiple contrast tests under variance heterogeneity [HH08]. - Using relative effect size [BM00],[RA08]: p 01 = F 0 df 1 = P(X 01 < X 11 ) + 0.5P(X 01 = X 11 ). (2) Hereby, the addition 0.5P(X 01 = X 11 ) ensures that data with ties are taken into account. 54 / 96

Non-parametric approaches and related simultaneous confidence intervals III - Note that the numerator of the Wilcoxon statistic estimates the relative effect p 01. The Wilcoxon test, however, can only be used for testing the hypothesis H0 F : F 0 = F 1 formulated in terms of the distribution functions. Moreover, the Wilcoxon-test procedure is not robust against variance heterogeneity. Therefore, test procedures which test the hypothesis in terms of the relative effect, e.g. the Brunner-Munzel-test [BM00], are more appropriate. - Relative Shirley-type effects: Let 1 0... 0 1 n 1 0... k 1 n k 1 +n k C q (k+1) =...... n 1 1 n 1 +...+n k... n k 1 n 1 +...+n k denote the Williams contrast matrix [Bre06]. n k n k 1 +n k n k n 1 +...+n k 55 / 96

Non-parametric approaches and related simultaneous confidence intervals IV - E.g., for the common balanced design with three dose groups and one control, the three contrasts are: C 3 4 = 1 0 0 1 1 0.5.5 1.33.33.33 That is, the first contrast indicates a strictly global trend, the second contrast a plateau for the two higher doses, and the third contrast a plateau of all doses, just different from control.. 56 / 96

Non-parametric approaches and related simultaneous confidence intervals V - Therefore, treatment effects can be defined by using the relative effect between the distribution of the negative control group F 0 and the distribution of the samples M l, l = 1,..., q: p 1 = p 0k p 2 = n k 1 n k p n k 1 + n 0(k 1) + p 0k k n k 1 + n k. p q = n 1 n k p 01 +... + p 0k. n 1 +... + n k n 1 +... + n k - The effects p 1,..., p q are called relative Shirley-type effects and they denote linear combinations of the two-sample relative effects between the negative control group and the active treatments. Therefore, in case of a monotonically increasing order of location, the relative Shirley-type effects p 1,..., p q decrease, i.e., p 1 p 2... p q. 57 / 96

Non-parametric approaches and related simultaneous confidence intervals VI - Example 8: Shirley-type test for Potassium (Clinical chemistry data of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only) The box-plots indicate a right-skewed distribution with variance heterogeneity and therefore a Behrens-Fisher modification of the Shirley trend test for relative effects by means of the R library nparcomp is used. The relative effect size allows a scale-independent comparison of multiple endpoints, such as in clinical chemistry. Analyse using R! Jump to next chapter Interpretation: i) normal distribution questionable, ii) NTP requires non-parametric trend test, iii) variance heterogeneity, iv) relative effect sizes and their extremest confidence limits: not harmful 58 / 96

Non-parametric approaches and related simultaneous confidence intervals VII boxplot(potassium~dose, data=clin, outline=false, main="potassium") points(jitter(as.integer(clin$dose)),clin$potassium, cex=0.5, pch=15) library(nparcomp) nparcomp(potassium~dose, data=clin, type ="Williams", conflevel = 0.95, alternative ="two.sided", rounds 59 / 96

Non-parametric approaches and related simultaneous confidence intervals VIII - Example 9: Shirley-type test for graded histopathological findings Scores data are particularly suitable for statistics of relative effects [RA08] The graded findings [none, Mild, Moderate, Marked] will be transferred into the equal-distant scores [0,1,2,3] The relative effect sizes and their Shirley-type simultaneous confidence intervals will be estimated by means of the R library nparcomp Analyse using R! Jump to next chapter parotid <- read.csv("parotid.csv") boxplot(score~group, data=parotid, outline=false, main="graded histopathological findings") points(jitter(as.integer(parotid$group)),parotid$score, cex=0.5, pch=16) library(nparcomp) nparcomp(score ~Group, data=parotid, asy.method = "mult.t", type = "Williams", alternative = "grea 60 / 96

Non-parametric approaches and related simultaneous confidence intervals IX Interpretation: i) analysis of ordered categorical data, particularly on trend, is not trivial, ii) interpretation of ordered categorial data as ordinal effect size measure [RA08], iii) - Questions so far? 61 / 96

Simultaneous confidence intervals for proportions I - Rates are rather typically in toxicological studies, e.g. histopathological findings, mortality, tumor rates - General contradiction in toxicological risk assessment: the evaluation of continous endpoints is powerful and related statistical approaches are widely available- however their predictive value is limited, such as body weight, hematology. On the other hand, the predictive value of proportions, such as selected clinical findings, is larger, but the power is much lower and appropriate statistical approaches are rarely available for such small sample sizes - Moreover, for sample sizes of n i = 50...10 there is no hope for valid (1 α) Wald intervals- therefore we need confidence intervals where its coverage probability is also for smaller samples (not really small samples) is approximately 95% 62 / 96

Simultaneous confidence intervals for proportions II - And, for all proportions a one-sided alternative for an increase is appropriate, never a two-sided alternative - As effect size the difference of proportions is common. Alternatively the relative risk or the odds ratio could be used 63 / 96

Two-sample comparisons I - There is an ongoing discussion on appropriate small-sample confidence intervals, whereas we focus on one-sided lower limits. - Based on the score interval proposed by [New98] introduced an interval for the difference of proportions (referred to as Newcombes Hybrid Score interval (NHS). Its variance term is constructed based on Wilson score confidence limits for the single proportions. The lower (1 α/2) limit is: ˆπ 2 ˆπ 1 z 1 α/2 l 2 (1 l 2 ) n 2 + u 1(1 u 1 ) n 1, (3) where l i, u i is the lower, upper limit of the (1 α) Wilson score interval for the single proportion π i : 64 / 96

Two-sample comparisons II [l i, u i ] = y i + z2 1 α/2 2 n i + z1 α/2 2 ± z 1 α/2 n i ˆπ i (1 ˆπ i ) + z2 1 α/2 4 n i + z 2 1 α/2 (4) - In the R library pairwiseci related confidence intervals for the option "NHS" are available. - Example 10: Unadjusted two-sample lower NHS confidence limits for tubular epithelia hyaline droplet degeneration in male rats. Notice, the data are not in a two-by-two table, but more realistic in an animal-specific flat file - Analyse using R! Jump to next chapter 65 / 96

Two-sample comparisons III tubepi <- read.csv("tubepi.csv") # NTP P-Cresidine carcinogenicity study histop library(pairwiseci) exa10 <- pairwiseci(cbind(tubularepithelia,without) ~ Group, data=tubepi, alter plot(exa10, main="proportions of tubular epithelia") - Interpretation: 66 / 96

A Dunnett-type for proportions I - One-sided, lower (1 α) Wald-type confidence limits for the difference of the proportions of treatment against those from a control are I c i p i z I q,r,1 α ˆV (p i ) (5) i=1 with ˆV (p i ) = p i (1 p i ) /n i and z q,r,1 α denoting the (1 α) quantile of the q-variate normal distribution whereas its correlation matrix R depends not only on the known contrast coefficients c im and sample sizes n i but also on the unknown π i and V (p i ) where the plug-in of the ML-estimators ˆπ i and Ṽ (p i ) works well. - However, Wald limits for binomial proportions are known to keep the (1 α) coverage probability only for asymptotically large sample sizes [AC00], [PB04] i=1 c 2 i 67 / 96

A Dunnett-type for proportions II - [AC98] showed that adding a total of four pseudo-observations to the observed successes and failures yields approximate confidence intervals for one binomial proportion with good small sample performance - One-sided limits was investigated only by [Cai05] in the case of a single binomial proportion, and recently [SV09] I c i p i z I q,r,1 α ci 2Ṽ ( p i ) (6) i=1 i=1 Table: Choices for p i and Ṽ (p i) Notation p i Ṽ (p i ) Wald Y i /n i p i (1 p i ) /n i add-1 (Y i + 0.5) / (n i + 1) p i (1 p i ) / (n i + 1) add-2 (Y i + 1) / (n i + 2) p i (1 p i ) / (n i + 2) 68 / 96

A Dunnett-type for proportions III - A simulation study [SSH08] indicates the use of the add1 approximation for one-sided lower limits when sample sizes are not too small - Example 11: Simultaneous confidence limits for tubular epithelia hyaline droplet degeneration in male rats. Notice, the data are not in a two-by-two table, but more realistic in an animal-specific flat file - Analyse using R! Jump to next chapter 69 / 96

A Dunnett-type for proportions IV library(mcpan) exa11 <- binomrdci(tubularepithelia ~ Group, data=tubepi,type="dunnett", cmat=n plot(exa11, main="dunnett-type procedure for proportions of tubular epithelia") - Interpretation 70 / 96

A Williams-type for proportions I - Lower simultaneous confidence limits for a small-to-medium sample sizes Williams-type approach is analogously to the Dunnett-type approach, whereas the contrast coefficients c i are for the specific order restriction, see the parametric approach above (a recent publication: Hothorn and Schaarschmidt 2010) - Example 12: Simultaneous confidence limits for tubular epithelia hyaline droplet degeneration in male rats. Notice, the data are not in a two-by-two table, but more realistic in an animal-specific flat file - Analyse using R! Jump to next chapter 71 / 96

A Williams-type for proportions II library(mcpan) exa12 <- binomrdci(tubularepithelia ~ Group, data=tubepi,type="williams", cmat= plot(exa12, main="williams-type procedure for proportions of tubular epithelia" - Interpretation - Questions so far? 72 / 96

Proof of safety by means of confidence intervals I - Proof of hazard is not adequate: Absence of evidence is not evidence of absence [AB04] - Advantage: direct control of the more important f error rate, i.e. consumers risk - Therefore, hypotheses on equivalence for endpoints where increase OR decrease are possible toxic effects, e.g. body weight change, OR on non-inferiority for endpoints where exactly one direction is a toxic effect, e.g. increasing tumor rates - Both hypotheses need an a priori definition of a relevance threshold δ (for difference to control) or ratio to control θ, e.g. 2-fold rule of the Ames Assay. - However, in the guidelines such endpoint-specific thresholds are rarely to find (k-fold rules). A more realistic strategy is to estimate confidence interval and interpret those post-hoc as thresholds. 73 / 96

Proof of safety by means of confidence intervals II - Question: Difference or ratios? i) Choice between additive or multiplicative model, ii) Ratio is dimensionless, i.e. % change; appropriate for multiple endpoints - Definition of local or global safety in the common design [C, D 1, D 2, D 3 ] related to the dose groups: i) local: D 1 OR D 2 OR D 3 are safe (UIT), ii) global D 1 AND D 2 AND D 3 are safe (IUT) - Marginal or global safety for multiple endpoints: i) marginal Y i is safe, ii) global Y 1 AND Y 2 AND...AND Y p are safe (IUT) - Notice, IUTs are rather conservative and a solution taking the correlations into account is not available yet 74 / 96

Proof of safety for non-inferiority: normal distributed endpoints I - Assumptions: i) directional toxic decision, e.g.increase is toxic, ii) N(µ i, σ 2 ), iii) in a randomized oneway-layout - Hypotheses: H 0 : µ D µ C δ with δ > 0 harmful H 1 : µ D µ C < δ harmless - Translation into a confidence limit: Harmless [upper µi µ 0 ; ] < δ respective [upper µi /µ 0 ] < θ; harmful otherwise - One-sided simultaneous (1 α) confidence intervals acc. to Dunnett [Dun55] using library(multcomp) 75 / 96

Proof of safety for non-inferiority: normal distributed endpoints II - Example 13: Hematology parameter hemoglobine of the 13 weeks study on sodium dichromate dihydrate in female F344 rats, final data at day 93 only A priori we assume: only decreasing hemoglobine values are hazardous The box-plots indicate approximate symmetric distribution, but variance heterogeneity occurs. I.e. the one-sided lower limits are of interest for claiming non-inferiority- but we do not know any safety threshold Analyse using R! Jump to next chapter 76 / 96

Proof of safety for non-inferiority: normal distributed endpoints III hema <- read.csv("hema.csv") hema$dose <- as.factor(hema$dose) hemaf <- hema[hema$sex=="female", ] boxplot(hb ~ Dose, data=hemaf, outline=false, main="hemoglobin") points(jitter(as.integer(hemaf$dose)),hemaf$hb, cex=1, pch=12) myhb <- lm(hb ~ Dose, data=hemaf) library(multcomp) exa13 <-glht(myhb, linfct = mcp(dose = "Dunnett"), alternative="greater") plot(exa13, main="hb- proof of non-inferiority") 77 / 96

Proof of safety for non-inferiority: normal distributed endpoints IV Interpretation using δ = 1.0 the lower doses 62.5,.., 500mg/kg are harmless (non-inferior with respect to control), but the high doses is not harmless. The post-hoc choice of δ in the scale of hemoglobin is the problem... 78 / 96

Proof of safety for non-inferiority: ratios to control I - Relative changes are easier to interpret, particularly for multiple endpoints, e.g. in chronic studies - One-sided simultaneous (1 α) confidence intervals for ratios to control acc. to [DBGH04] using library(mratio) - Example 14: Hemoglobin, again the lower limits are of interest. Interpretation using a relative threshold θ = 0.8 may be more appropriate Jump to next chapter 79 / 96

Proof of safety for non-inferiority: ratios to control II library(mratios) exa14 <-sci.ratiovh(hb~dose, data=hemaf, type="dunnett",alternative="greater") plot(exa14,rho0 = c(0.9,1), rho0lty=c(2,1), rho0col=c("blue","black"),main="hb- 80 / 96

Proof of safety for non-inferiority: proportions I - Rates are rather typically in toxicological studies, e.g. histopathological findings, mortality, tumor rates - For sample sizes of n i = 50...10 there is no hope on valid (1 α) Wald intervals - And, for all proportions a one-sided alternative for an increase is appropriate, never a two-sided alternative - Alternative: [RM99], but for two-sample comparisons only - Alternative, for one-sided confidence intervals: add-1 intervals acc. to Agresti [AC00] can be used for moderate sample sizes [SV09], i.e. instead of p i = r i /n i we use p i = r i +0.5 n i +1. Using the R library(mcpan) 81 / 96

Proof of safety for non-inferiority: proportions II - Example 15: In a chronic study with a design (0, 10,50,100 mg/kg) 4,1,6,8 animals died of 40,20,20,20 randomized animals - Analyse using R! Jump to next chapter library(mcpan) died <- c(4,1,6,8) animals <- c(40,20,20,20) dosesn <- c("0", "10", "50", "100") exa15 <-binomrdci(n=animals, x=died, names=dosesn, alternative="less", method=" plot(exa15, main="mortality rates- proof of safety") - Interpretation 82 / 96

Two-sided hypotheses: claiming equivalence I - Bofinger s [BB95]procedure for claiming equivalence in several treatments with respect to a control group Hypotheses: H 0i : µ i µ 0 δ (harmful) vs. H 1i : µ i µ 0 < δ (harmless) (1 i k) with a relevant threshold δ > 0. The null hypotheses can be formed as H 0i : µ i µ 0 δ or µ i µ 0 δ (1 i k). 83 / 96

Two-sided hypotheses: claiming equivalence II The limits of the two-sided (1 α)100% simultaneous confidence intervals are given as ( ˆδ (l) i = min X i X 1 0 t k,1 α (ν, R) S ˆδ (u) i = max ( X i X 0 + t k,1 α (ν, R) S + 1 ), 0, n i n 0 1 + 1 ), 0 n i n 0 (1 i (7) k) with the lower (1 α) quantile t k,1 α (ν, R) of an underlying k-variate t-distribution with ν = k i=0 (n i 1) degrees of freedom and correlation matrix R = (r im ) i,m according to Tong [Ton69] and Bofinger and Bofinger [BB95], 84 / 96

Two-sided hypotheses: claiming equivalence III where 1, i = m, ρ, i m, i, m {1, 2,..., t} or i, m {t + 1, t + 2,... r im = ρ, i m, i {1, 2,..., t} and m {t + 1, t + 2,..., k} ρ, i m, m {1, 2,..., t} and i {t + 1, t + 2,..., k} (8) with ρ = 1 (1 + n0 /n 1 )(1 + n 0 /n 1 ) and t = k/2 (the integral part of k/2). Note, the approach of Bofinger and Bofinger [BB95] is only correct for balancedness within the non-control group doses. For the case of unbalancedness, one can not derive a single ρ for all i and m, and a Bonferroni-adjusted TOST approach is an alternative. (9) 85 / 96

Two-sided hypotheses: claiming equivalence IV - Example 16: Evaluation of relative thymus weights ( 1000) using ETC Analyse using R!: parametric Bofinger approach, assuming hazardous changes in thymus weights are possible (i.e. increase or decrease), relevance threshold δ = 0.15 Jump to next chapter library(etc) organ <- read.csv("organ.csv") organ$dose <- as.factor(organ$dose) tym <- organ[organ$organ=="thymus", ] ; tym$tym_r_weight <- tym$weight*1 boxplot(tym_r_weight ~ Dose, data=tym, outline=false, main="relative thym points(jitter(as.integer(tym$dose)),tym$tym_r_weight, cex=1, pch=21) summary(etc.diff(tym_r_weight~dose, data=tym, margin.up=0.15, method="bof 86 / 96

Two-sided hypotheses: claiming equivalence V estimate statistic lower upper p.value 62.5-0 -0.006887-2.919-0.12213 0.10836 0.01194 125-0 0.021996-2.611-0.09325 0.13724 0.02636 250-0 0.049691-2.046-0.06555 0.16494 0.09822 500-0 -0.078376-1.461-0.19362 0.03687 0.29485 1000-0 0.026811-2.513-0.08843 0.14206 0.03381 Interpretation 87 / 96

Two-sided hypotheses: claiming equivalence VI - A Bonferroni-TOST approach [HVH08] Bofinger s approach is limited to Gaussian distributed endpoints, variance homogeneity and balanced sample sizes in the dose groups - rather restricted assumptions for real toxicological studies. Taken the special structure of the correlation matrix in Equation (8) into account, it becomes clear that a Bonferroni-type alternative [HKH99] does not loose much power even when all assumptions are fulfilled, is much simpler and can be generalized for several situations. We denote it here as Bonferroni-TOST approach because it bases on the two-one-sided-t-tests (TOST) and the multiplicity adjustment according to Bonferroni. The loss in power under the margin of the null hypothesis (where the power equals the type I error) was about 7%, and for settings under the alternative hypothesis about 3% or less [HVH08] 88 / 96

Two-sided hypotheses: claiming equivalence VII - The Bonferroni-TOST approach for ratios-to-controls Because of the specific structure of the correlation matrix in the Bofinger approach, there is no hope for a ratio-based version- but ratio-to-control is appropriate for multiple endpoints. Fieller-type confidence intervals can be used accordingly - Bonferroni-TOST approach when variance heterogeneity occurs Variance heterogeneity occurs sometimes in toxicological studies, e.g. increasing variance with increasing effects in the dose groups, where the control is a zero-dose group. Two-sided (1 2α)100% Welch-type confidence intervals will be estimated for the individual comparisons D i C, each at the Bonferroni level α = α/k - Non-parametric Bonferroni-TOST Non-parametric exact Hodges-Lehmann intervals [HL63] or intervals for relative effects can be used 89 / 96