Maximally selected chi-square statistics for ordinal variables

Similar documents
Maximally selected chi-square statistics for at least ordinal scaled variables

Boulesteix: Maximally selected chi-square statistics and binary splits of nominal variables

Random Forests for Ordinal Response Data: Prediction and Variable Selection

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Inferential Statistics

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Generalized Maximally Selected Statistics

Introduction to Machine Learning CMU-10701

Discrete Multivariate Statistics

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

epub WU Institutional Repository

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

the tree till a class assignment is reached

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010

INSTITUTE OF ACTUARIES OF INDIA

Lecture 01: Introduction

Statistical aspects of prediction models with high-dimensional data

Small n, σ known or unknown, underlying nongaussian

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Testing Independence

Bias Correction in Classification Tree Construction ICML 2001

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

The Design and Analysis of Benchmark Experiments Part II: Analysis

Hypothesis testing. 1 Principle of hypothesis testing 2

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees

Frequency Distribution Cross-Tabulation

Regression tree methods for subgroup identification I

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

Hypothesis testing:power, test statistic CMS:

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

General Comments Proofs by Mathematical Induction

CDA Chapter 3 part II

Contents. Acknowledgments. xix

Practice problems from chapters 2 and 3

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Empirical Risk Minimization, Model Selection, and Model Assessment

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data.

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

8 Nominal and Ordinal Logistic Regression

Growing a Large Tree

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION

APPENDIX B Sample-Size Calculation Methods: Classical Design

General Comments on Proofs by Mathematical Induction

Machine Learning & Data Mining

Learning with multiple models. Boosting.

SALES AND MARKETING Department MATHEMATICS. 3rd Semester. Probability distributions. Tutorials and exercises

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

14.30 Introduction to Statistical Methods in Economics Spring 2009

Duke University. Duke Biostatistics and Bioinformatics (B&B) Working Paper Series. Randomized Phase II Clinical Trials using Fisher s Exact Test

26 Chapter 4 Classification

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

Logistic regression: Miscellaneous topics

Comparison of Two Samples

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Generalized Linear Modeling - Logistic Regression

Decision Tree Learning Lecture 2

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Basic Business Statistics, 10/e

STA6938-Logistic Regression Model

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

NAG Library Chapter Introduction. G08 Nonparametric Statistics

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

HANDBOOK OF APPLICABLE MATHEMATICS

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Basic methods for simulation of random variables: 1. Inversion

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

The computationally optimal test set size in simulation studies on supervised learning

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Institute of Actuaries of India

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Statistical Hypothesis Testing with SAS and R

Multinomial Logistic Regression Models

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Marginal Screening and Post-Selection Inference

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

In Defence of Score Intervals for Proportions and their Differences

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research

Models, Data, Learning Problems

Textbook Examples of. SPSS Procedure

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

Statistical Methods for Astronomy

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

CS 6375 Machine Learning

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

Binary Logistic Regression

Computational Systems Biology: Biology X

Entering and recoding variables

Investigation of goodness-of-fit test statistic distributions by random censored samples

Finding the Gold in Your Data

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Glossary for the Triola Statistics Series

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Transcription:

bimj header will be provided by the publisher Maximally selected chi-square statistics for ordinal variables Anne-Laure Boulesteix 1 Department of Statistics, University of Munich, Akademiestrasse 1, D-80799 Munich, Germany. Summary The association between a binary variable Y and a variable X having an at least ordinal measurement scale might be examined by selecting a cutpoint in the range of X and then performing an association test for the obtained 2 2 contingency table using the chi-square statistic. The distribution of the maximally selected chi-square statistic (i.e. the maximal chi-square statistic over all possible cutpoints) under the null-hypothesis of no association between X and Y is different from the known chi-square distribution. In the last decades, this topic has been extensively studied for continuous X variables, but not for non-continuous variables of at least ordinal measurement scale (which include e.g. classical ordinal or discretized continuous variables). In this paper, we suggest an exact method to determine the finite-sample distribution of maximally selected chi-square statistics in this context. This novel approach can be seen as a method to measure the association between a binary variable and variables having an at least ordinal scale of different types (ordinal, discretized continuous, etc). As an illustration, this method is applied to a new data set describing pregnancy and birth for 811 babies. Key words: Association test, contingency table, exact distribution, variable selection, selection bias. 1 Introduction The following situation is not uncommon in medical data analysis. An at least ordinal scaled variable X is suspected by the investigator to be associated with a binary variable Y. Let (x i, y i ) i=1,...,n denote N independently and identically distributed realizations of the variables X and Y. N 1 and N 2 denote the numbers of observations with y i = 1 and y i = 2, respectively. In this paper, we derive the exact distribution of the maximally selected χ 2 statistic for such X variables. This distribution can be used to measure the association between at least ordinal scaled X variables and the binary variable Y and allows the comparison of several X variables with different numbers of possible values. If X were nominal scaled, association tests such as the asymptotic χ 2 test or Fisher s exact test for small samples (for a binary X) could be employed to examine the association between X and Y using the sample (x i, y i ) i=1,...,n. If X were continuous, tests based on the normality assumption such as the two-samples t-test or rank tests such as Wilcoxon s rank sum test for two samples may be applied. The case of an at least ordinal scaled but not continuous variable is much more difficult to handle. Without loss of generality, such a variable X can be assumed to take K distinct levels a 1,..., a K R in the e-mail:anne-laure.boulesteix@stat.uni-muenchen.de, Phone:+49 89-2180-3466, Fax: +49 89-2180-3466 1 Copyright line will be provided by the publisher

sample (x i, y i ) i=1,...,n, where 2 K N and a 1 < < a K. An option to measure the association between X and Y is to transform X into binary variables X (k) for k = 1,..., K 1 as follows X (k) = 0 if X a k X (k) = 1 otherwise. Fisher s exact test or the asymptotic χ 2 test for 2 2 contingency tables may then be applied to each variable X (k) successively. However, one must be careful when interpreting the p-values output by these tests. Selecting the X (k) yielding the smallest p-value and claiming that a k is a relevant cutpoint of X because the p-value is low would be an inappropriate approach. This issue has been extensively studied in the case of continuous variables. Miller and Siegmund (1982) show that the maximally selected χ 2 statistic converges to a normalized Brownian bridge under the null-hypothesis of no association between X and Y. The problem of the asymptotic null distribution of maximally selected statistics for binary, ordered, quantitative and censored response variables is discussed in Lausen and Schumacher (1992, 1996); Lausen, Lerche and Schumacher (2002). Hothorn and Lausen (2003) derive a lower bound for the exact distribution of maximally selected rank statistics. The distribution of the maximally selected χ 2 statistic in the small sample case is examined by Halpern (1982) in a simulation study, whereas Koziol (1991) derives the exact distribution of maximally selected χ 2 statistics given N 1 and N 2 using Durbin s combinatorial approach (Durbin, 1971). As suggested by Koziol (1991), the determinant formula by Steck (1970) might also be used for the same purpose. Maximally selected χ 2 statistics in k 2 contingency tables are investigated in Betensky and Rabinowitz (1999). The distributions of other maximally selected statistics such as the statistic used in Fisher s exact test (Halpern, 1999) or McNemar s statistic (Rabinowitz and Betensky, 2000) have also been studied in the last few years. In this paper, we are interested in the finite-sample distribution of the χ 2 statistic for all types of at least ordinal scaled variables. Via simulations, we show in Section 3 that Koziol s approach is inappropriate to measure association between a binary variable Y and a non-continuous variable X with equal realizations in the sample (x i, y i ) i=1,...,n (K < N). More specifically, under the null-hypothesis of no association between X and Y, Koziol s approach tends to detect more association if K is large. In the present paper, we are concerned with at least ordinal scaled non-continuous variables, which include e.g. classical ordinal variables (for instance a variable with possible values very good, good, bad, very bad ), discrete metric variables (for instance the number of children in a family), or essentially continuous variables which are measured in a discretized form in practice. For instance, the height of a newborn baby is often given in centimeters and can thus take only a few values ranging from about 47 to 54 cm. For such discretized continuous variables, we generally have K < N if N is large enough, whereas continuous variables may be assumed to take N distinct values in the sample (x i, y i ) i=1,...,n (K = N). In the framework of maximally selected statistics, a variable with K < N cannot be handled as a variable with K = N. The fact that some values of X are taken several times in the sample must be taken into account when deriving the distribution of the maximally selected χ 2 statistic, since the number of possible cutpoints is K 1. In this paper, we propose a novel method to derive the exact

finite-sample distribution of the maximally selected χ 2 statistic for all types of at least ordinal scaled variables. This distribution depends on parameters N 1, N 2 and m 1,..., m K, which are defined as n m k = I(x i = a k ), for k = 1,..., K, i=1 where I is the indicator function. Our novel approach is an adaptation of the procedure proposed by Koziol (1991) and uses Durbin s combinatorial approach. The exact distribution of the maximally selected χ 2 statistic can be used to compute a measure of association between X and a binary variable Y using a sample (x i, y i ) i=1,...,n. This method has potentially many applications, especially in the field of medicine. The paper is organized as follows. In Section 2, a method to compute the exact distribution function of the maximally selected χ 2 statistic given N 1, N 2, m 1,..., m K is proposed. In Section 3, we show via simulations that our method is more appropriate than Koziol s method to compare variables with different K. In Section 4, we illustrate our method by measuring the association between various binary variables and at least ordinal scaled variables from a data set describing pregnancy and delivery for 811 babies. 2 Distribution of the maximally selected χ 2 statistic 2.1 Framework In this section, X is assumed to be a variable with an at least ordinal measurement scale. Y is a binary variable with levels Y = 1, 2. For a given sample (x i, y i ) i=1,...,n, let a 1 < < a K denote the different values taken by X. We consider the following 2 2 contingency table, for k = 1,..., K 1: X a k X > a k Σ Y = 1 n 1, ak n 1,>ak N 1 Y = 2 n 2, ak n 2,>ak N 2 Σ n., ak = k j=1 m j n.,>ak = K j=k+1 m j where N 1 and N 2 denote the numbers of realizations with y i = 1 and y i = 2, respectively and m j denotes the number of realizations with X = a j in the sample (x i, y i ) i=1,...,n. The corresponding χ 2 statistic can be computed as χ 2 k = N(n 1, a k n 2,>ak n 1,>ak n 2, ak ) 2 N 1 N 2 n., ak n.,>ak. (1) In this context, we define the maximally selected χ 2 statistic as χ 2 max = arg max k=1,...,k 1 χ2 k. (2) The aim of this paper is to derive the distribution of χ 2 max given N 1, N 2, m 1,..., m K under the nullhypothesis H 0 of no association between X and Y. For simplicity, we will omit N 1, N 2, m 1,..., m K in the following: F denotes the distribution function of χ 2 max under H 0 given the parameters N 1, N 2, m 1,..., m K : F (d) = P H0 (χ 2 max d). N

2.2 Method According to Miller and Siegmund (1982), the χ 2 statistic obtained for the binary variable X (k) can be formulated as χ 2 k = A2 k, where A k = N N 1 ( n 2, a k N 2 n., a k N )/ n., ak N (1 n., a k N )( 1 + 1 ), (3) N 1 N 2 for all k = 1,..., K 1. Let d be an arbitrary strictly positive real number. After simple computations, one obtains from Equation 3 that χ 2 max d if and only if all the points with coordinates (n. ak, n 2, ak ) for k = 1,..., K 1 lie on or above the function lower d (x) = N 2x N N 1N 2 d x N N (1 x N )( 1 + 1 ) (4) N 1 N 2 and on or below the function upper d (x) = N 2x N + N 1N 2 d x N N (1 x N )( 1 + 1 ). (5) N 1 N 2 These curves may be denoted as boundaries. Let x (1) x (N) denote the ordered realizations of X. Let N 2 (i) denote the number of realizations with Y = 2 and X x (i). The functions lower d (x) and upper d (x) can also be represented on the graph (i, N 2 (i)). A sufficient and necessary condition for χ 2 max d is that the graph (i, N 2 (i)) does not pass through any point of integer coordinates (i, j) with i = n., ak and upper d (i) < j min(n 2, i) or max(0, i N 1 ) j < lower d (i), where k = 1,..., K 1. Let us denote these points as B 1,..., B q and their coordinates as (i 1, j 1 ),..., (i q, j q ), where B 1,..., B q are labeled in order of increasing i and increasing j within each i. Under the nullhypothesis of no association between X and Y, the probability that the path (i, N 2 (i)) passes through at least one of the points B 1,..., B q can be computed using Durbin s combinatorial approach (Durbin, 1971), as described by Koziol (1991). This approach is based on a Markov representation of N 2 (i) as the path of a binomial process with constant probability of success and with unit jumps, conditional on N 2 (N) = N 2. Here, we follow Koziol s formulation. Let P s denote the set of the paths from (0, 0) to B s that do not pass through points B 1,..., B s 1 and b s the number of paths in P s. Since the sets P s, s = 1,..., q are mutually disjoint, b s, s = 1,..., q can be computed recursively as b 1 = b s = i 1 j 1 i s j s, s 1 r=1 i s i r j s j r b r, s = 2,..., q.

The number of paths from (0, 0) to (N, N 2 ) that pass through B s, s = 1,..., q but not through B 1,..., B s 1 is then given as N i s N 2 j s b s. P H0 (χ 2 max > d) equals the probability that the path (i, N 2 (i)) passes through at least one of the points B 1,..., B q and is obtained as P H0 (χ 2 max > d) = N N 2 1 q N i s b s. (6) N 2 j s s=1 It follows F (d) = 1 N N 2 1 q N i s b s. (7) N 2 j s s=1 Our method, which is strongly related to the procedure described in Koziol (1991), allows explicitly K < N. The two approaches differ in the definition of the points B 1,..., B q. In Koziol (1991), the boundaries are formed by the points B 1,..., B q of coordinates (i, j) satisfying i = max {x N : j > N 2x N + N 1N 2 d x N N (1 x N )( 1 + 1 )} and 1 j N 2, N 1 N 2 or i = min {x N : j < N 2x N N 1N 2 d x N N (1 x N )( 1 + 1 )} and 0 j N 2 1. N 1 N 2 As an example, the boundaries obtained with Koziol s method and our new method are represented in Figure 1 for N 1 = 30, N 2 = 40, m 1 = 25, m 2 = 10, m 3 = 25, m 4 = 10 and d = 9. It can seen that the points B 1,..., B q defined by Koziol form a closed corridor. With Koziol s approach, paths which pass through a boundary point with abscissa i 0 such that x (i0) = x (i0+1) are counted for the computation of F (d), although they do not correspond to any concrete possible cutpoint. Thus, this approach is inappropriate for X variables with possibly equal realizations. In our approach, only the paths that would yield χ 2 k > d for at least one k are counted. These are the paths that are strictly above the upper boundary or strictly below the lower boundary at abscissa n., ak (k = 1,..., K 1) only. Note that the obtained distribution function F is the same with both methods in the special case K = N. In this special case, Koziol s approach is recommended, since computationally faster. The formula given in Equation 6 can be used to measure the association between a binary variable Y and an at least ordinal scaled variable X using a sample (x i, y i ) i=1,...,n as follows. For all k = 1,..., K 1, the value χ 2 k of the χ2 statistic for the sample (x i, y i ) i=1,...,n is computed using Equation 1 and the maximal χ 2 statistic χ 2 max is obtained from Equation 2. F (χ 2 max) is a measure of association between X and Y. In the following section, we show via simulations that the boundaries defined in our approach are more appropriate to measure association between a binary variable Y and a non-continuous variable X than Koziol s boundaries.

Obtained boundaries: an example N 2 (i) 0 10 20 30 40 50 60 70 Koziol New method 0 10 20 30 40 50 60 70 i Fig. 1 Boundaries obtained with Koziol s approach and our new approach for d = 9 N 1 = 30, N 2 = 40, m 1 = 25, m 2 = 10, m 3 = 25, m 4 = 10. 3 Simulations In this part, we examine two important features of our new method and compare it to Koziol s approach via simulations: Our method as well as Koziol s method can be used to identify predictor variables which are strongly associated with a binary variable Y. Thus, they can be seen as variable selection methods. Section 3.1 deals with the selection bias of these methods in the null case (no association between the variables X and Y ). The power of both approaches to identify predictor variables is examined in Section 3.2 for different measurement precisions of an essentially continuous X variable. 3.1 Null case In the whole section, we make the hypothesis of no association between the binary variable Y and some at least ordinal scaled variables X 1, X 2, X 3. Suppose that we compute a measure of association for each pair (X 1, Y ), (X 2, Y ) and (X 3, Y ) based on a given sample and select the variable X i with the highest association measure. An effective measure of association is expected to select X 1, X 2 and X 3 with probability 1 3. In this section, we show that Koziol s approach selects variables with large K more often than variables with small K. In contrast, the frequency of selection does not depend on the number of different values in the sample with our method. We present only the simulation results obtained with classical ordinal predictor variables for which the set of possible values {a 1,..., a K } does not depend

on the specific sample. However, similar results could be obtained with essentially continuous variables which are measured as discrete variables. The sample size is set successively to N = 50 and N = 100. For each value of N, N run = 1000 data sets containing N independent observations are simulated. Different distributions of Y are examined successively: First case (I) The two classes have equal probabilities: p 1 = P (Y = 1) = 0.5 p 2 = P (Y = 2) = 0.5. Second case (II) The two classes have non-equal probabilities: p 1 = P (Y = 1) = 0.7 p 2 = P (Y = 2) = 0.3. X 1, X 2 and X 3 are mutually independent and distributed as follows. The set of possible values is {1, 2, 3} for X 1, {1, 2, 3, 4, 5, 6, 7} for X 2 and {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} for X 3. Let K i denote the number of possible values of variable X i. We study two cases successively for the distribution of the variables X i : First case (A) For each variable X i, the different levels have equal probability: P (X i = 1) =... = P (X i = K i ) = 1 K i Second case (B) For each variable X i, for k = 1,..., K P (X i = k) = c 0.1 if k is odd, P (X i = k) = c 0.2 if k is even, where c is a normalizing factor such that K i k=1 P (X i = k) = 1. Thus, four configurations (I/A,I/B,II/A and II/B) are studied. For each configuration and sample size (N = 50 and N = 100), N run = 1000 simulation data sets are drawn randomly. For each simulated data set and each variable, χ 2 max is determined and F (χ 2 max) is computed using successively Koziol s boundaries and the new boundaries defined in Section 2. The variable(s) with the highest F (χ 2 max) is (are) selected. The obtained frequencies of selection for each configuration (I/A,I/B,II/A and II/B) and each N can be found in Table 1. As can be seen from Table 1, variable selection using Koziol s method is strongly

X 1 X 2 X 3 N = 50, p 1 = 0.5 36, 22 33, 38 31, 45 N = 50, p 1 = 0.7 37, 23 30, 33 34, 47 N = 100, p 1 = 0.5 33, 20 33, 34 34, 47 N = 100, p 1 = 0.7 34, 19 33, 34 34, 48 N = 50, p 1 = 0.5 35, 22 32, 35 33, 45 N = 50, p 1 = 0.7 37, 21 32, 34 31, 46 N = 100, p 1 = 0.5 33, 18 32, 34 35, 49 N = 100, p 1 = 0.7 34, 20 33, 34 34, 47 Table 1 Classical ordinal variables: Frequency of selection (in %) of X 1, X 2, X 3 for different N, different p 1, p 2, with our method (normal font) and Koziol s method (italic). Top: Case A (equal probabilities), bottom: Case B (non-equal probabilities). biased. Variables with large K are selected more often. With our new method, no bias is observed. Similar results are obtained for the different values of N and p 1 and for the different distributions of X 1, X 2, X 3. The bias can be visualized for case I/A with N = 100 in Figure 2. 3.2 Power study Another important topic is the capacity of our approach to detect predictors which are associated with the response variable Y. In the whole section, we consider discretized continuous X variables which have different distributions in the two classes Y = 1 and Y = 2. We aim to show that our approach has a higher power than Koziol s method and that the gain of power obtained by using our approach increases when the number of values taken by the X variable decreases. In the whole power study, we simulate data sets containing a binary response variable Y and an informative predictor variable X. The response variable Y is generated as in Section 3.1, case I (i.e. from the distribution P (Y = 1) = P (Y = 2) = 0.5). The predictor variable X is generated as follows. We consider a random variable Z which is normally distributed with mean 0 within class Y = 1 and µ within class Y = 2: Z Y = 1 N (0, 1) Z Y = 2 N (µ, 1) X is defined as X = round(z/ν) ν,

Ordinal variables (case A), n=100, p 1 =0.5 Frequency of selection 0.0 0.1 0.2 0.3 0.4 X 1 new X 1 Koziol X 2 new X 2 Koziol X 3 new X 3 Koziol Fig. 2 Barplot representing the frequencies of selection of X 1, X 2 and X 3 for N = 100 and p 1 = 0.5, with our method (gray) and with Koziol s method (black) for classical ordinal variables (Case A: equal probabilities). where ν is a real number corresponding to the measurement precision and round(x) denotes the integer approximation of x. In our simulation, we set ν to ν = 0.5, 1, 2 and µ to µ = 0, 0.25, 0.5, 0.75, 1, 1.25, 1.5 successively. For each value of the parameters µ and ν, we generate N run = 1000 data sets with N = 50 independent observations of X and Y and measure the association between X and Y by computing F (χ 2 max) using successively our boundaries and Koziol s boundaries. The proportion of runs for which significant association (i.e. F (χ 2 max) α, where α equals e.g. 0.95) is detected is then computed and denoted as empirical detection rate. It is represented against µ in Figure 3 for ν = 0.5 (top), ν = 1 (middle) and ν = 2 (bottom) with α = 0.95. Similar results could be obtained with other confidence levels, e.g. α = 0.9 or α = 0.99. As expected, the empirical detection rate increases with µ with both approaches and equals 1 for large values of µ. For all three values of ν, the empirical detection rate is higher with our approach than with Koziol s approach. Moreover, the difference between both approaches increases with ν. Whereas both approaches are equivalent for continuous variables, Koziol s approach performs poorly for variables which take few values in the available sample. The difference between both approaches also depends on the parameter µ. Whereas our approach performs much better than Koziol s approach for intermediate values of µ, almost no difference is observed when X and Y are strongly associated (i.e. for very large µ).

ν=0.5 Detection rate 0.0 0.6 0.0 0.5 1.0 1.5 µ New method Koziol ν=1 Detection rate 0.0 0.6 0.0 0.5 1.0 1.5 µ ν=2 Detection rate 0.0 0.6 0.0 0.5 1.0 1.5 µ Fig. 3 Proportion of runs for which significant association (F (χ 2 max) 0.95) is detected using Koziol s method (dashed) and our method (normal lines) with ν = 0.5 (top), ν = 1 (middle) and ν = 2 (bottom). There is trade-off between type I and type II error: the detection rate is larger with our approach than with Koziol s approach for all values of µ, even for µ = 0, which may be seen as an inconvenience of our approach. However, a major advantage of our method is that the detection rate does not depend on the measurement precision, whereas it does with Koziol s method: our approach is especially appropriate to compare several predictor variables taking different numbers of values in the available sample. 4 Application to pregnancy and birth data To illustrate our method, we consider a pregnancy and birth data set which we collected by ourselves directly from internet users recruited on french-speaking pregnancy and birth websites. The investigated variables include the binary variables CESA, EPISIO, INDUCED, SEX and the discretized continuous variables WEIGHTBB, WEIGHTMO, AGE and DURATION (see Table 4 for a description). Moreover, we consider the transformed variables WEIGHTBB3, WEIGHTMO4, AGE5 and DURATION3, which are obtained as described in Table 4. The transformations of WEIGHTBB and DURATION are performed as usual in obstetrical studies: birthweights under 2500 g, between 2500 g and 4000 g and above 4000 g are considered as low, normal and high respectively. Similarly, a baby is considered preterm if born before the 37th pregnancy week and post-term if born after the 41th week. The transformation of

Variable K Description CESA Type of delivery (1:vaginal, 2:cesarean). EPISIO Did the obstetrician perform an episiotomy (1:no, 2:yes)? INDUCED Was the labor induced (1:no, 2:yes)? SEX Sex of the baby (1:male, 2:female). WEIGHTBB 39 Weight of the baby at birth (in g, measurement precision: 100 g). WEIGHTBB3 3 Weight of the baby at birth. 1: 2500 g, 2: 2500 4000 g, 3: > 4000 g. WEIGHTMO 90 Weight of the mother before pregnancy (in kg, measurement precision: 0.5 kg). WEIGHTMO4 4 Weight of the mother before pregnancy. 1: 50 kg, 2: 50 60 kg, 3: 60 70 kg, 4: > 70 kg. AGE 25 Age of the mother (in year, measurement precision: 1 year). AGE5 5 Age of the mother. 1: 20 years, 2: 20 25 years, 3: 25 30 years, 4: 30 35 years, 5: > 35 years. DURATION 16 Duration of the pregnancy (in week, measurement precision: 1 week). DURATION3 3 Duration of the pregnancy. 1: 37 weeks, 2: 37 41 weeks, 3: > 41 weeks. Table 2 Binary response variables (top) and discretized continuous variables (bottom). WEIGHTMO and AGE is performed based on intervals of equal length. Table 4 also gives the number K of distinct values taken by the variables in the data set. In this section, we measure the association between the binary variables and the discretized continuous variables from Table 4 using our approach and Koziol s approach. We show that our approach is more successful in detecting true association for variables with few possible values than Koziol s approach by comparing the obtained association measures for the original and transformed variables. All 32 pairs formed by a binary response variable and a discretized continuous variable from Table 4 are examined successively. The following procedure is applied to each pair. 1. Denote as a 1,..., a K the different values taken by the ordinal variable in the sample, with a 1 < < a K. In the case of a classical ordinal variable with values 1,..., K, we have a 1 = 1,..., a K = K. In the extreme case of a variable taking different values for all N observations, we have a 1 = x (1),..., a N = x (N). 2. For k = 1,..., K 1, compute the chi-square statistic χ 2 k from the 2 2 contingency table X a k X > a k Y = 1 n 1, ak n 1,>ak Y = 2 n 2, ak n 2,>ak 3. Determine χ 2 max = max k=1,...,k 1 χ 2 k.

CESA EPISIO INDUCED SEX WEIGHTBB 1, 1 0.23, 0.09 0.98, 0.95 1, 1 WEIGHTBB3 1, 1 0.18, 0 1, 0.87 0.99, 0.85 WEIGHTMO 0.05, 0.02 0.63, 0.54 0.99, 0.99 0.38, 0.25 WEIGHTMO4 0.06, 0 0.46, 0 1, 0.95 0.07, 0 AGE 0.98, 0.93 0.79, 0.61 0.73, 0.54 0.74, 0.53 AGE5 0.97, 0.69 0.92, 0.62 0.93, 0.54 0.42, 0 DURATION 1, 1 0.54, 0.15 1, 1 0.58, 0.19 DURATION3 1, 0.99 0.12, 0 1, 1 0.76, 0.02 Table 3 Measure of association between the binary variables CESA, EPISIO, INDUCED, SEX, and the variables WEIGHTBB/WEIGHTBB3, WEIGHTMO/WEIGHTMO4, AGE/AGE5, DURATION/DURATION3 using our approach (normal) and Koziol s approach (italic) 4. Compute F (χ 2 max) with the parameters N 1, N 2, m 1,..., m K using successively our new boundaries and Koziol s boundaries. The results can be found in Table 3. As can be seen from Table 3, our method detects approximately as much association for the transformed variables WEIGHTBB3, WEIGHTMO4, AGE5 and DURATION3 as for the original variables WEIGHTBB, WEIGHTMO, AGE and DURATION. In contrast, Koziol s method often detects more association for the original variables than for the transformed variables, e.g. for the pairs AGE/CESA, WEIGHTBB/INDUCED and or WEIGHTMO/INDUCED. Thus, the association measure obtained using our boundaries seems to depend less on the measurement precision than the association measure obtained using Koziol s boundaries. This is a desirable property in the context of medical data analysis, since one often has to analyze data sets including continuous variables measured with high precision as well as ordinal variables with few possible values. Our method allows a direct comparison of candidate predictor variables of different types. Such a comparison can not be achieved using classical approaches. For instance, the t-statistic may not be employed for ordinal variables. Maximally selected impurity criteria used in classification trees such as the Gini-index or the cross-entropy are not appropriate because they select predictor variables with many cutpoints more often. Furthermore, our method is more successful than Koziol s method in detecting true associations which are known from the obstetrical literature. For instance, it is well-known that male babies are heavier in average than female babies (Liebermann et al., 1997). Association between the variables SEX and WEIGHTBB is detected by both approaches, but no significant association is detected between SEX and the transformed variable WEIGHTBB3 using Koziol s approach. Similarly, labor induction is obviously much more frequent for bigger babies. The association between the variables INDUCED and

WEIGHTBB/WEIGHTBB3 is better detected by our approach than by Koziol s approach. A study by Cnattingius, Cnattingius and Notzon (1998) also points out that the risks for cesarean increase with maternal age, which is consistent with the results obtained with our method but not with Koziol s method. In a word, our novel method is more efficient than Koziol s method to (i) compare the association between a binary variable and at least ordinal scaled variables taking different numbers of values in the available sample (ii) detect association between a binary variable and a candidate predictor variable with high power. 5 Discussion In this paper, we proposed a simple exact procedure based on Durbin s combinatorial approach (Durbin, 1971) to compute the finite-sample distribution of maximally selected χ 2 statistics for at least ordinal scaled variables. This procedure can be used to identify and compare prognostic factors in clinical studies. In contrast to Koziol s method, the proposed method does not induce any selection bias when the candidate predictor variables have different numbers of distinct values in the available sample. As an exact procedure, it is especially appropriate for small sample sizes which are common in clinical studies but becomes computationally intensive for very large sample sizes and large numbers of possible values. Our approach can be seen as a global framework to measure association between a binary variable and all types of at least ordinal scaled variables, as demonstrated in the real data study. Unlike other usual association criteria, it allows the comparison of e.g. a continuous predictor variable and an ordinal predictor variable with the possible values bad - normal - good. In future work, our method could be interestingly applied to classification trees. In recent papers, maximally selected statistics and their associated p-values have been successfully applied to the problem of variable and cutpoint selection in classification and regression trees (Shih, 2004; Lausen et al., 2004). The association measure developed in this paper might also be used as a selection criterion for choosing the best splitting variable and the best cutpoint. Since it allows the comparison of predictor variables of different types and can also be used in the case of small sample sizes, we expect it to perform better than the usual criteria in some cases. Acknowledgments I thank the referee, Gerhard Tutz and Korbinian Strimmer for critical comments and Stefanie Pildner von Steinburg for helping me to understand the data. References Betensky, R. A. and Rabinowitz, D. (1999). Maximally selected χ 2 statistics for k 2 tables. Biometrics 55, 317 320.

Cnattingius, R., Cnattingius, S. and Notzon, F. C. (1998). Obstacles to reducing cesarean rates in a lowcesarean setting: the effect of maternal age, height, and weight. Obstetrics and Gynecology 92, 501 506. Durbin, J. (1971). Boundary-crossing probabilities for the Brownian motion and Poisson processes and techniques for computing the power of the Kolmogorov-Smirnov test. Journal of Applied Probability 8, 431 453. Halpern, A. L. (1999). Minimally selected p and other tests for a single abrupt changepoint in a binary sequence. Biometrics 55, 1044 1050. Halpern, J. (1982). Maximally selected Chi square statistics for small samples. Biometrics 38, 1017 1023. Hothorn, T. and Lausen, B. (2003). On the exact distribution of maximally selected rank statistics. Computational Statistics and Data Analysis 43, 121 137. Koziol, J. A. (1991). On maximally selected Chi-square statistics. Biometrics 47, 1557 1561. Lausen, B., Hothorn, T., Bretz, F. and Schumacher, M. (2004). Assessment of optimal selected prognostic factors. Biometrical Journal 46, 364 374. Lausen, B., Lerche, R. and Schumacher, M. (2002). Maximally selected rank statistics for dose-response problems. Biometrical Journal 44, 131 147. Lausen, B. and Schumacher, M. (1992). Maximally selected rank statistics. Biometrics 48, 73 85. Lausen, B. and Schumacher, M. (1996). Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Computational Statistics and Data Analysis 21, 307 326. Liebermann, E. and Lang, J. M. and Cohen, A. P. and Frigoletto, F. D. and Acker, D. and Rao, R. (1997). The association of fetal sex with the rate of cesarean section. American Journal of Obstetrics and Gynecology 176, 667 671. Miller, R. and Siegmund, D. (1982). Maximally selected Chi square statistics. Biometrics 48, 1011-1016. Rabinowitz, D. and Betensky, R. A. (2000). Approximating the distribution of maximally selected Mc- Nemar s statistics. Biometrics 56, 897 902. Shih, Y. S. (2004). A note on split selection bias in classification trees. Computational Statistics and Data Analysis 45, 457 466. Steck (1970). Rectangle probabilities for uniform order statistics and the probability that the empirical distribution function lies between two distribution functions. Annals of Mathematical Statistics 42, 1 11.