Maximally selected chi-square statistics for ordinal variables

Size: px
Start display at page:

Download "Maximally selected chi-square statistics for ordinal variables"

Transcription

1 bimj header will be provided by the publisher Maximally selected chi-square statistics for ordinal variables Anne-Laure Boulesteix 1 Department of Statistics, University of Munich, Akademiestrasse 1, D Munich, Germany. Summary The association between a binary variable Y and a variable X having an at least ordinal measurement scale might be examined by selecting a cutpoint in the range of X and then performing an association test for the obtained 2 2 contingency table using the chi-square statistic. The distribution of the maximally selected chi-square statistic (i.e. the maximal chi-square statistic over all possible cutpoints) under the null-hypothesis of no association between X and Y is different from the known chi-square distribution. In the last decades, this topic has been extensively studied for continuous X variables, but not for non-continuous variables of at least ordinal measurement scale (which include e.g. classical ordinal or discretized continuous variables). In this paper, we suggest an exact method to determine the finite-sample distribution of maximally selected chi-square statistics in this context. This novel approach can be seen as a method to measure the association between a binary variable and variables having an at least ordinal scale of different types (ordinal, discretized continuous, etc). As an illustration, this method is applied to a new data set describing pregnancy and birth for 811 babies. Key words: Association test, contingency table, exact distribution, variable selection, selection bias. 1 Introduction The following situation is not uncommon in medical data analysis. An at least ordinal scaled variable X is suspected by the investigator to be associated with a binary variable Y. Let (x i, y i ) i=1,...,n denote N independently and identically distributed realizations of the variables X and Y. N 1 and N 2 denote the numbers of observations with y i = 1 and y i = 2, respectively. In this paper, we derive the exact distribution of the maximally selected χ 2 statistic for such X variables. This distribution can be used to measure the association between at least ordinal scaled X variables and the binary variable Y and allows the comparison of several X variables with different numbers of possible values. If X were nominal scaled, association tests such as the asymptotic χ 2 test or Fisher s exact test for small samples (for a binary X) could be employed to examine the association between X and Y using the sample (x i, y i ) i=1,...,n. If X were continuous, tests based on the normality assumption such as the two-samples t-test or rank tests such as Wilcoxon s rank sum test for two samples may be applied. The case of an at least ordinal scaled but not continuous variable is much more difficult to handle. Without loss of generality, such a variable X can be assumed to take K distinct levels a 1,..., a K R in the anne-laure.boulesteix@stat.uni-muenchen.de, Phone: , Fax: Copyright line will be provided by the publisher

2 sample (x i, y i ) i=1,...,n, where 2 K N and a 1 < < a K. An option to measure the association between X and Y is to transform X into binary variables X (k) for k = 1,..., K 1 as follows X (k) = 0 if X a k X (k) = 1 otherwise. Fisher s exact test or the asymptotic χ 2 test for 2 2 contingency tables may then be applied to each variable X (k) successively. However, one must be careful when interpreting the p-values output by these tests. Selecting the X (k) yielding the smallest p-value and claiming that a k is a relevant cutpoint of X because the p-value is low would be an inappropriate approach. This issue has been extensively studied in the case of continuous variables. Miller and Siegmund (1982) show that the maximally selected χ 2 statistic converges to a normalized Brownian bridge under the null-hypothesis of no association between X and Y. The problem of the asymptotic null distribution of maximally selected statistics for binary, ordered, quantitative and censored response variables is discussed in Lausen and Schumacher (1992, 1996); Lausen, Lerche and Schumacher (2002). Hothorn and Lausen (2003) derive a lower bound for the exact distribution of maximally selected rank statistics. The distribution of the maximally selected χ 2 statistic in the small sample case is examined by Halpern (1982) in a simulation study, whereas Koziol (1991) derives the exact distribution of maximally selected χ 2 statistics given N 1 and N 2 using Durbin s combinatorial approach (Durbin, 1971). As suggested by Koziol (1991), the determinant formula by Steck (1970) might also be used for the same purpose. Maximally selected χ 2 statistics in k 2 contingency tables are investigated in Betensky and Rabinowitz (1999). The distributions of other maximally selected statistics such as the statistic used in Fisher s exact test (Halpern, 1999) or McNemar s statistic (Rabinowitz and Betensky, 2000) have also been studied in the last few years. In this paper, we are interested in the finite-sample distribution of the χ 2 statistic for all types of at least ordinal scaled variables. Via simulations, we show in Section 3 that Koziol s approach is inappropriate to measure association between a binary variable Y and a non-continuous variable X with equal realizations in the sample (x i, y i ) i=1,...,n (K < N). More specifically, under the null-hypothesis of no association between X and Y, Koziol s approach tends to detect more association if K is large. In the present paper, we are concerned with at least ordinal scaled non-continuous variables, which include e.g. classical ordinal variables (for instance a variable with possible values very good, good, bad, very bad ), discrete metric variables (for instance the number of children in a family), or essentially continuous variables which are measured in a discretized form in practice. For instance, the height of a newborn baby is often given in centimeters and can thus take only a few values ranging from about 47 to 54 cm. For such discretized continuous variables, we generally have K < N if N is large enough, whereas continuous variables may be assumed to take N distinct values in the sample (x i, y i ) i=1,...,n (K = N). In the framework of maximally selected statistics, a variable with K < N cannot be handled as a variable with K = N. The fact that some values of X are taken several times in the sample must be taken into account when deriving the distribution of the maximally selected χ 2 statistic, since the number of possible cutpoints is K 1. In this paper, we propose a novel method to derive the exact

3 finite-sample distribution of the maximally selected χ 2 statistic for all types of at least ordinal scaled variables. This distribution depends on parameters N 1, N 2 and m 1,..., m K, which are defined as n m k = I(x i = a k ), for k = 1,..., K, i=1 where I is the indicator function. Our novel approach is an adaptation of the procedure proposed by Koziol (1991) and uses Durbin s combinatorial approach. The exact distribution of the maximally selected χ 2 statistic can be used to compute a measure of association between X and a binary variable Y using a sample (x i, y i ) i=1,...,n. This method has potentially many applications, especially in the field of medicine. The paper is organized as follows. In Section 2, a method to compute the exact distribution function of the maximally selected χ 2 statistic given N 1, N 2, m 1,..., m K is proposed. In Section 3, we show via simulations that our method is more appropriate than Koziol s method to compare variables with different K. In Section 4, we illustrate our method by measuring the association between various binary variables and at least ordinal scaled variables from a data set describing pregnancy and delivery for 811 babies. 2 Distribution of the maximally selected χ 2 statistic 2.1 Framework In this section, X is assumed to be a variable with an at least ordinal measurement scale. Y is a binary variable with levels Y = 1, 2. For a given sample (x i, y i ) i=1,...,n, let a 1 < < a K denote the different values taken by X. We consider the following 2 2 contingency table, for k = 1,..., K 1: X a k X > a k Σ Y = 1 n 1, ak n 1,>ak N 1 Y = 2 n 2, ak n 2,>ak N 2 Σ n., ak = k j=1 m j n.,>ak = K j=k+1 m j where N 1 and N 2 denote the numbers of realizations with y i = 1 and y i = 2, respectively and m j denotes the number of realizations with X = a j in the sample (x i, y i ) i=1,...,n. The corresponding χ 2 statistic can be computed as χ 2 k = N(n 1, a k n 2,>ak n 1,>ak n 2, ak ) 2 N 1 N 2 n., ak n.,>ak. (1) In this context, we define the maximally selected χ 2 statistic as χ 2 max = arg max k=1,...,k 1 χ2 k. (2) The aim of this paper is to derive the distribution of χ 2 max given N 1, N 2, m 1,..., m K under the nullhypothesis H 0 of no association between X and Y. For simplicity, we will omit N 1, N 2, m 1,..., m K in the following: F denotes the distribution function of χ 2 max under H 0 given the parameters N 1, N 2, m 1,..., m K : F (d) = P H0 (χ 2 max d). N

4 2.2 Method According to Miller and Siegmund (1982), the χ 2 statistic obtained for the binary variable X (k) can be formulated as χ 2 k = A2 k, where A k = N N 1 ( n 2, a k N 2 n., a k N )/ n., ak N (1 n., a k N )( ), (3) N 1 N 2 for all k = 1,..., K 1. Let d be an arbitrary strictly positive real number. After simple computations, one obtains from Equation 3 that χ 2 max d if and only if all the points with coordinates (n. ak, n 2, ak ) for k = 1,..., K 1 lie on or above the function lower d (x) = N 2x N N 1N 2 d x N N (1 x N )( ) (4) N 1 N 2 and on or below the function upper d (x) = N 2x N + N 1N 2 d x N N (1 x N )( ). (5) N 1 N 2 These curves may be denoted as boundaries. Let x (1) x (N) denote the ordered realizations of X. Let N 2 (i) denote the number of realizations with Y = 2 and X x (i). The functions lower d (x) and upper d (x) can also be represented on the graph (i, N 2 (i)). A sufficient and necessary condition for χ 2 max d is that the graph (i, N 2 (i)) does not pass through any point of integer coordinates (i, j) with i = n., ak and upper d (i) < j min(n 2, i) or max(0, i N 1 ) j < lower d (i), where k = 1,..., K 1. Let us denote these points as B 1,..., B q and their coordinates as (i 1, j 1 ),..., (i q, j q ), where B 1,..., B q are labeled in order of increasing i and increasing j within each i. Under the nullhypothesis of no association between X and Y, the probability that the path (i, N 2 (i)) passes through at least one of the points B 1,..., B q can be computed using Durbin s combinatorial approach (Durbin, 1971), as described by Koziol (1991). This approach is based on a Markov representation of N 2 (i) as the path of a binomial process with constant probability of success and with unit jumps, conditional on N 2 (N) = N 2. Here, we follow Koziol s formulation. Let P s denote the set of the paths from (0, 0) to B s that do not pass through points B 1,..., B s 1 and b s the number of paths in P s. Since the sets P s, s = 1,..., q are mutually disjoint, b s, s = 1,..., q can be computed recursively as b 1 = b s = i 1 j 1 i s j s, s 1 r=1 i s i r j s j r b r, s = 2,..., q.

5 The number of paths from (0, 0) to (N, N 2 ) that pass through B s, s = 1,..., q but not through B 1,..., B s 1 is then given as N i s N 2 j s b s. P H0 (χ 2 max > d) equals the probability that the path (i, N 2 (i)) passes through at least one of the points B 1,..., B q and is obtained as P H0 (χ 2 max > d) = N N 2 1 q N i s b s. (6) N 2 j s s=1 It follows F (d) = 1 N N 2 1 q N i s b s. (7) N 2 j s s=1 Our method, which is strongly related to the procedure described in Koziol (1991), allows explicitly K < N. The two approaches differ in the definition of the points B 1,..., B q. In Koziol (1991), the boundaries are formed by the points B 1,..., B q of coordinates (i, j) satisfying i = max {x N : j > N 2x N + N 1N 2 d x N N (1 x N )( )} and 1 j N 2, N 1 N 2 or i = min {x N : j < N 2x N N 1N 2 d x N N (1 x N )( )} and 0 j N 2 1. N 1 N 2 As an example, the boundaries obtained with Koziol s method and our new method are represented in Figure 1 for N 1 = 30, N 2 = 40, m 1 = 25, m 2 = 10, m 3 = 25, m 4 = 10 and d = 9. It can seen that the points B 1,..., B q defined by Koziol form a closed corridor. With Koziol s approach, paths which pass through a boundary point with abscissa i 0 such that x (i0) = x (i0+1) are counted for the computation of F (d), although they do not correspond to any concrete possible cutpoint. Thus, this approach is inappropriate for X variables with possibly equal realizations. In our approach, only the paths that would yield χ 2 k > d for at least one k are counted. These are the paths that are strictly above the upper boundary or strictly below the lower boundary at abscissa n., ak (k = 1,..., K 1) only. Note that the obtained distribution function F is the same with both methods in the special case K = N. In this special case, Koziol s approach is recommended, since computationally faster. The formula given in Equation 6 can be used to measure the association between a binary variable Y and an at least ordinal scaled variable X using a sample (x i, y i ) i=1,...,n as follows. For all k = 1,..., K 1, the value χ 2 k of the χ2 statistic for the sample (x i, y i ) i=1,...,n is computed using Equation 1 and the maximal χ 2 statistic χ 2 max is obtained from Equation 2. F (χ 2 max) is a measure of association between X and Y. In the following section, we show via simulations that the boundaries defined in our approach are more appropriate to measure association between a binary variable Y and a non-continuous variable X than Koziol s boundaries.

6 Obtained boundaries: an example N 2 (i) Koziol New method i Fig. 1 Boundaries obtained with Koziol s approach and our new approach for d = 9 N 1 = 30, N 2 = 40, m 1 = 25, m 2 = 10, m 3 = 25, m 4 = Simulations In this part, we examine two important features of our new method and compare it to Koziol s approach via simulations: Our method as well as Koziol s method can be used to identify predictor variables which are strongly associated with a binary variable Y. Thus, they can be seen as variable selection methods. Section 3.1 deals with the selection bias of these methods in the null case (no association between the variables X and Y ). The power of both approaches to identify predictor variables is examined in Section 3.2 for different measurement precisions of an essentially continuous X variable. 3.1 Null case In the whole section, we make the hypothesis of no association between the binary variable Y and some at least ordinal scaled variables X 1, X 2, X 3. Suppose that we compute a measure of association for each pair (X 1, Y ), (X 2, Y ) and (X 3, Y ) based on a given sample and select the variable X i with the highest association measure. An effective measure of association is expected to select X 1, X 2 and X 3 with probability 1 3. In this section, we show that Koziol s approach selects variables with large K more often than variables with small K. In contrast, the frequency of selection does not depend on the number of different values in the sample with our method. We present only the simulation results obtained with classical ordinal predictor variables for which the set of possible values {a 1,..., a K } does not depend

7 on the specific sample. However, similar results could be obtained with essentially continuous variables which are measured as discrete variables. The sample size is set successively to N = 50 and N = 100. For each value of N, N run = 1000 data sets containing N independent observations are simulated. Different distributions of Y are examined successively: First case (I) The two classes have equal probabilities: p 1 = P (Y = 1) = 0.5 p 2 = P (Y = 2) = 0.5. Second case (II) The two classes have non-equal probabilities: p 1 = P (Y = 1) = 0.7 p 2 = P (Y = 2) = 0.3. X 1, X 2 and X 3 are mutually independent and distributed as follows. The set of possible values is {1, 2, 3} for X 1, {1, 2, 3, 4, 5, 6, 7} for X 2 and {1, 2, 3, 4, 5, 6, 7, 8, 9, 10} for X 3. Let K i denote the number of possible values of variable X i. We study two cases successively for the distribution of the variables X i : First case (A) For each variable X i, the different levels have equal probability: P (X i = 1) =... = P (X i = K i ) = 1 K i Second case (B) For each variable X i, for k = 1,..., K P (X i = k) = c 0.1 if k is odd, P (X i = k) = c 0.2 if k is even, where c is a normalizing factor such that K i k=1 P (X i = k) = 1. Thus, four configurations (I/A,I/B,II/A and II/B) are studied. For each configuration and sample size (N = 50 and N = 100), N run = 1000 simulation data sets are drawn randomly. For each simulated data set and each variable, χ 2 max is determined and F (χ 2 max) is computed using successively Koziol s boundaries and the new boundaries defined in Section 2. The variable(s) with the highest F (χ 2 max) is (are) selected. The obtained frequencies of selection for each configuration (I/A,I/B,II/A and II/B) and each N can be found in Table 1. As can be seen from Table 1, variable selection using Koziol s method is strongly

8 X 1 X 2 X 3 N = 50, p 1 = , 22 33, 38 31, 45 N = 50, p 1 = , 23 30, 33 34, 47 N = 100, p 1 = , 20 33, 34 34, 47 N = 100, p 1 = , 19 33, 34 34, 48 N = 50, p 1 = , 22 32, 35 33, 45 N = 50, p 1 = , 21 32, 34 31, 46 N = 100, p 1 = , 18 32, 34 35, 49 N = 100, p 1 = , 20 33, 34 34, 47 Table 1 Classical ordinal variables: Frequency of selection (in %) of X 1, X 2, X 3 for different N, different p 1, p 2, with our method (normal font) and Koziol s method (italic). Top: Case A (equal probabilities), bottom: Case B (non-equal probabilities). biased. Variables with large K are selected more often. With our new method, no bias is observed. Similar results are obtained for the different values of N and p 1 and for the different distributions of X 1, X 2, X 3. The bias can be visualized for case I/A with N = 100 in Figure Power study Another important topic is the capacity of our approach to detect predictors which are associated with the response variable Y. In the whole section, we consider discretized continuous X variables which have different distributions in the two classes Y = 1 and Y = 2. We aim to show that our approach has a higher power than Koziol s method and that the gain of power obtained by using our approach increases when the number of values taken by the X variable decreases. In the whole power study, we simulate data sets containing a binary response variable Y and an informative predictor variable X. The response variable Y is generated as in Section 3.1, case I (i.e. from the distribution P (Y = 1) = P (Y = 2) = 0.5). The predictor variable X is generated as follows. We consider a random variable Z which is normally distributed with mean 0 within class Y = 1 and µ within class Y = 2: Z Y = 1 N (0, 1) Z Y = 2 N (µ, 1) X is defined as X = round(z/ν) ν,

9 Ordinal variables (case A), n=100, p 1 =0.5 Frequency of selection X 1 new X 1 Koziol X 2 new X 2 Koziol X 3 new X 3 Koziol Fig. 2 Barplot representing the frequencies of selection of X 1, X 2 and X 3 for N = 100 and p 1 = 0.5, with our method (gray) and with Koziol s method (black) for classical ordinal variables (Case A: equal probabilities). where ν is a real number corresponding to the measurement precision and round(x) denotes the integer approximation of x. In our simulation, we set ν to ν = 0.5, 1, 2 and µ to µ = 0, 0.25, 0.5, 0.75, 1, 1.25, 1.5 successively. For each value of the parameters µ and ν, we generate N run = 1000 data sets with N = 50 independent observations of X and Y and measure the association between X and Y by computing F (χ 2 max) using successively our boundaries and Koziol s boundaries. The proportion of runs for which significant association (i.e. F (χ 2 max) α, where α equals e.g. 0.95) is detected is then computed and denoted as empirical detection rate. It is represented against µ in Figure 3 for ν = 0.5 (top), ν = 1 (middle) and ν = 2 (bottom) with α = Similar results could be obtained with other confidence levels, e.g. α = 0.9 or α = As expected, the empirical detection rate increases with µ with both approaches and equals 1 for large values of µ. For all three values of ν, the empirical detection rate is higher with our approach than with Koziol s approach. Moreover, the difference between both approaches increases with ν. Whereas both approaches are equivalent for continuous variables, Koziol s approach performs poorly for variables which take few values in the available sample. The difference between both approaches also depends on the parameter µ. Whereas our approach performs much better than Koziol s approach for intermediate values of µ, almost no difference is observed when X and Y are strongly associated (i.e. for very large µ).

10 ν=0.5 Detection rate µ New method Koziol ν=1 Detection rate µ ν=2 Detection rate µ Fig. 3 Proportion of runs for which significant association (F (χ 2 max) 0.95) is detected using Koziol s method (dashed) and our method (normal lines) with ν = 0.5 (top), ν = 1 (middle) and ν = 2 (bottom). There is trade-off between type I and type II error: the detection rate is larger with our approach than with Koziol s approach for all values of µ, even for µ = 0, which may be seen as an inconvenience of our approach. However, a major advantage of our method is that the detection rate does not depend on the measurement precision, whereas it does with Koziol s method: our approach is especially appropriate to compare several predictor variables taking different numbers of values in the available sample. 4 Application to pregnancy and birth data To illustrate our method, we consider a pregnancy and birth data set which we collected by ourselves directly from internet users recruited on french-speaking pregnancy and birth websites. The investigated variables include the binary variables CESA, EPISIO, INDUCED, SEX and the discretized continuous variables WEIGHTBB, WEIGHTMO, AGE and DURATION (see Table 4 for a description). Moreover, we consider the transformed variables WEIGHTBB3, WEIGHTMO4, AGE5 and DURATION3, which are obtained as described in Table 4. The transformations of WEIGHTBB and DURATION are performed as usual in obstetrical studies: birthweights under 2500 g, between 2500 g and 4000 g and above 4000 g are considered as low, normal and high respectively. Similarly, a baby is considered preterm if born before the 37th pregnancy week and post-term if born after the 41th week. The transformation of

11 Variable K Description CESA Type of delivery (1:vaginal, 2:cesarean). EPISIO Did the obstetrician perform an episiotomy (1:no, 2:yes)? INDUCED Was the labor induced (1:no, 2:yes)? SEX Sex of the baby (1:male, 2:female). WEIGHTBB 39 Weight of the baby at birth (in g, measurement precision: 100 g). WEIGHTBB3 3 Weight of the baby at birth. 1: 2500 g, 2: g, 3: > 4000 g. WEIGHTMO 90 Weight of the mother before pregnancy (in kg, measurement precision: 0.5 kg). WEIGHTMO4 4 Weight of the mother before pregnancy. 1: 50 kg, 2: kg, 3: kg, 4: > 70 kg. AGE 25 Age of the mother (in year, measurement precision: 1 year). AGE5 5 Age of the mother. 1: 20 years, 2: years, 3: years, 4: years, 5: > 35 years. DURATION 16 Duration of the pregnancy (in week, measurement precision: 1 week). DURATION3 3 Duration of the pregnancy. 1: 37 weeks, 2: weeks, 3: > 41 weeks. Table 2 Binary response variables (top) and discretized continuous variables (bottom). WEIGHTMO and AGE is performed based on intervals of equal length. Table 4 also gives the number K of distinct values taken by the variables in the data set. In this section, we measure the association between the binary variables and the discretized continuous variables from Table 4 using our approach and Koziol s approach. We show that our approach is more successful in detecting true association for variables with few possible values than Koziol s approach by comparing the obtained association measures for the original and transformed variables. All 32 pairs formed by a binary response variable and a discretized continuous variable from Table 4 are examined successively. The following procedure is applied to each pair. 1. Denote as a 1,..., a K the different values taken by the ordinal variable in the sample, with a 1 < < a K. In the case of a classical ordinal variable with values 1,..., K, we have a 1 = 1,..., a K = K. In the extreme case of a variable taking different values for all N observations, we have a 1 = x (1),..., a N = x (N). 2. For k = 1,..., K 1, compute the chi-square statistic χ 2 k from the 2 2 contingency table X a k X > a k Y = 1 n 1, ak n 1,>ak Y = 2 n 2, ak n 2,>ak 3. Determine χ 2 max = max k=1,...,k 1 χ 2 k.

12 CESA EPISIO INDUCED SEX WEIGHTBB 1, , , , 1 WEIGHTBB3 1, , 0 1, , 0.85 WEIGHTMO 0.05, , , , 0.25 WEIGHTMO4 0.06, , 0 1, , 0 AGE 0.98, , , , 0.53 AGE5 0.97, , , , 0 DURATION 1, , , , 0.19 DURATION3 1, , 0 1, , 0.02 Table 3 Measure of association between the binary variables CESA, EPISIO, INDUCED, SEX, and the variables WEIGHTBB/WEIGHTBB3, WEIGHTMO/WEIGHTMO4, AGE/AGE5, DURATION/DURATION3 using our approach (normal) and Koziol s approach (italic) 4. Compute F (χ 2 max) with the parameters N 1, N 2, m 1,..., m K using successively our new boundaries and Koziol s boundaries. The results can be found in Table 3. As can be seen from Table 3, our method detects approximately as much association for the transformed variables WEIGHTBB3, WEIGHTMO4, AGE5 and DURATION3 as for the original variables WEIGHTBB, WEIGHTMO, AGE and DURATION. In contrast, Koziol s method often detects more association for the original variables than for the transformed variables, e.g. for the pairs AGE/CESA, WEIGHTBB/INDUCED and or WEIGHTMO/INDUCED. Thus, the association measure obtained using our boundaries seems to depend less on the measurement precision than the association measure obtained using Koziol s boundaries. This is a desirable property in the context of medical data analysis, since one often has to analyze data sets including continuous variables measured with high precision as well as ordinal variables with few possible values. Our method allows a direct comparison of candidate predictor variables of different types. Such a comparison can not be achieved using classical approaches. For instance, the t-statistic may not be employed for ordinal variables. Maximally selected impurity criteria used in classification trees such as the Gini-index or the cross-entropy are not appropriate because they select predictor variables with many cutpoints more often. Furthermore, our method is more successful than Koziol s method in detecting true associations which are known from the obstetrical literature. For instance, it is well-known that male babies are heavier in average than female babies (Liebermann et al., 1997). Association between the variables SEX and WEIGHTBB is detected by both approaches, but no significant association is detected between SEX and the transformed variable WEIGHTBB3 using Koziol s approach. Similarly, labor induction is obviously much more frequent for bigger babies. The association between the variables INDUCED and

13 WEIGHTBB/WEIGHTBB3 is better detected by our approach than by Koziol s approach. A study by Cnattingius, Cnattingius and Notzon (1998) also points out that the risks for cesarean increase with maternal age, which is consistent with the results obtained with our method but not with Koziol s method. In a word, our novel method is more efficient than Koziol s method to (i) compare the association between a binary variable and at least ordinal scaled variables taking different numbers of values in the available sample (ii) detect association between a binary variable and a candidate predictor variable with high power. 5 Discussion In this paper, we proposed a simple exact procedure based on Durbin s combinatorial approach (Durbin, 1971) to compute the finite-sample distribution of maximally selected χ 2 statistics for at least ordinal scaled variables. This procedure can be used to identify and compare prognostic factors in clinical studies. In contrast to Koziol s method, the proposed method does not induce any selection bias when the candidate predictor variables have different numbers of distinct values in the available sample. As an exact procedure, it is especially appropriate for small sample sizes which are common in clinical studies but becomes computationally intensive for very large sample sizes and large numbers of possible values. Our approach can be seen as a global framework to measure association between a binary variable and all types of at least ordinal scaled variables, as demonstrated in the real data study. Unlike other usual association criteria, it allows the comparison of e.g. a continuous predictor variable and an ordinal predictor variable with the possible values bad - normal - good. In future work, our method could be interestingly applied to classification trees. In recent papers, maximally selected statistics and their associated p-values have been successfully applied to the problem of variable and cutpoint selection in classification and regression trees (Shih, 2004; Lausen et al., 2004). The association measure developed in this paper might also be used as a selection criterion for choosing the best splitting variable and the best cutpoint. Since it allows the comparison of predictor variables of different types and can also be used in the case of small sample sizes, we expect it to perform better than the usual criteria in some cases. Acknowledgments I thank the referee, Gerhard Tutz and Korbinian Strimmer for critical comments and Stefanie Pildner von Steinburg for helping me to understand the data. References Betensky, R. A. and Rabinowitz, D. (1999). Maximally selected χ 2 statistics for k 2 tables. Biometrics 55,

14 Cnattingius, R., Cnattingius, S. and Notzon, F. C. (1998). Obstacles to reducing cesarean rates in a lowcesarean setting: the effect of maternal age, height, and weight. Obstetrics and Gynecology 92, Durbin, J. (1971). Boundary-crossing probabilities for the Brownian motion and Poisson processes and techniques for computing the power of the Kolmogorov-Smirnov test. Journal of Applied Probability 8, Halpern, A. L. (1999). Minimally selected p and other tests for a single abrupt changepoint in a binary sequence. Biometrics 55, Halpern, J. (1982). Maximally selected Chi square statistics for small samples. Biometrics 38, Hothorn, T. and Lausen, B. (2003). On the exact distribution of maximally selected rank statistics. Computational Statistics and Data Analysis 43, Koziol, J. A. (1991). On maximally selected Chi-square statistics. Biometrics 47, Lausen, B., Hothorn, T., Bretz, F. and Schumacher, M. (2004). Assessment of optimal selected prognostic factors. Biometrical Journal 46, Lausen, B., Lerche, R. and Schumacher, M. (2002). Maximally selected rank statistics for dose-response problems. Biometrical Journal 44, Lausen, B. and Schumacher, M. (1992). Maximally selected rank statistics. Biometrics 48, Lausen, B. and Schumacher, M. (1996). Evaluating the effect of optimized cutoff values in the assessment of prognostic factors. Computational Statistics and Data Analysis 21, Liebermann, E. and Lang, J. M. and Cohen, A. P. and Frigoletto, F. D. and Acker, D. and Rao, R. (1997). The association of fetal sex with the rate of cesarean section. American Journal of Obstetrics and Gynecology 176, Miller, R. and Siegmund, D. (1982). Maximally selected Chi square statistics. Biometrics 48, Rabinowitz, D. and Betensky, R. A. (2000). Approximating the distribution of maximally selected Mc- Nemar s statistics. Biometrics 56, Shih, Y. S. (2004). A note on split selection bias in classification trees. Computational Statistics and Data Analysis 45, Steck (1970). Rectangle probabilities for uniform order statistics and the probability that the empirical distribution function lies between two distribution functions. Annals of Mathematical Statistics 42, 1 11.

Maximally selected chi-square statistics for at least ordinal scaled variables

Maximally selected chi-square statistics for at least ordinal scaled variables Maximally selected chi-square statistics for at least ordinal scaled variables Anne-Laure Boulesteix anne-laure.boulesteix@stat.uni-muenchen.de Department of Statistics, University of Munich, Akademiestrasse

More information

Boulesteix: Maximally selected chi-square statistics and binary splits of nominal variables

Boulesteix: Maximally selected chi-square statistics and binary splits of nominal variables Boulesteix: Maximally selected chi-square statistics and binary splits of nominal variables Sonderforschungsbereich 386, Paper 449 (2005) Online unter: http://epub.ub.uni-muenchen.de/ Projektpartner Maximally

More information

Random Forests for Ordinal Response Data: Prediction and Variable Selection

Random Forests for Ordinal Response Data: Prediction and Variable Selection Silke Janitza, Gerhard Tutz, Anne-Laure Boulesteix Random Forests for Ordinal Response Data: Prediction and Variable Selection Technical Report Number 174, 2014 Department of Statistics University of Munich

More information

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Eva Riccomagno, Maria Piera Rogantin DIMA Università di Genova riccomagno@dima.unige.it rogantin@dima.unige.it Part G Distribution free hypothesis tests 1. Classical and distribution-free

More information

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

CS6375: Machine Learning Gautam Kunapuli. Decision Trees Gautam Kunapuli Example: Restaurant Recommendation Example: Develop a model to recommend restaurants to users depending on their past dining experiences. Here, the features are cost (x ) and the user s

More information

Generalized Maximally Selected Statistics

Generalized Maximally Selected Statistics Generalized Maximally Selected Statistics Torsten Hothorn Ludwig-Maximilians-Universität München Achim Zeileis Wirtschaftsuniversität Wien Abstract Maximally selected statistics for the estimation of simple

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 23. Decision Trees Barnabás Póczos Contents Decision Trees: Definition + Motivation Algorithm for Learning Decision Trees Entropy, Mutual Information, Information

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career.

9/2/2010. Wildlife Management is a very quantitative field of study. throughout this course and throughout your career. Introduction to Data and Analysis Wildlife Management is a very quantitative field of study Results from studies will be used throughout this course and throughout your career. Sampling design influences

More information

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Lecture 9 Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Univariate categorical data Univariate categorical data are best summarized in a one way frequency table.

More information

epub WU Institutional Repository

epub WU Institutional Repository epub WU Institutional Repository Torsten Hothorn and Achim Zeileis Generalized Maximally Selected Statistics Paper Original Citation: Hothorn, Torsten and Zeileis, Achim (2007) Generalized Maximally Selected

More information

Generalization to Multi-Class and Continuous Responses. STA Data Mining I

Generalization to Multi-Class and Continuous Responses. STA Data Mining I Generalization to Multi-Class and Continuous Responses STA 5703 - Data Mining I 1. Categorical Responses (a) Splitting Criterion Outline Goodness-of-split Criterion Chi-square Tests and Twoing Rule (b)

More information

the tree till a class assignment is reached

the tree till a class assignment is reached Decision Trees Decision Tree for Playing Tennis Prediction is done by sending the example down Prediction is done by sending the example down the tree till a class assignment is reached Definitions Internal

More information

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010

Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010 Math 1040 Final Exam Form A Introduction to Statistics Fall Semester 2010 Instructor Name Time Limit: 120 minutes Any calculator is okay. Necessary tables and formulas are attached to the back of the exam.

More information

INSTITUTE OF ACTUARIES OF INDIA

INSTITUTE OF ACTUARIES OF INDIA INSTITUTE OF ACTUARIES OF INDIA EXAMINATIONS 4 th November 008 Subject CT3 Probability & Mathematical Statistics Time allowed: Three Hours (10.00 13.00 Hrs) Total Marks: 100 INSTRUCTIONS TO THE CANDIDATES

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Statistical aspects of prediction models with high-dimensional data

Statistical aspects of prediction models with high-dimensional data Statistical aspects of prediction models with high-dimensional data Anne Laure Boulesteix Institut für Medizinische Informationsverarbeitung, Biometrie und Epidemiologie February 15th, 2017 Typeset by

More information

Small n, σ known or unknown, underlying nongaussian

Small n, σ known or unknown, underlying nongaussian READY GUIDE Summary Tables SUMMARY-1: Methods to compute some confidence intervals Parameter of Interest Conditions 95% CI Proportion (π) Large n, p 0 and p 1 Equation 12.11 Small n, any p Figure 12-4

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

Bias Correction in Classification Tree Construction ICML 2001

Bias Correction in Classification Tree Construction ICML 2001 Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity

More information

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors.

EDF 7405 Advanced Quantitative Methods in Educational Research. Data are available on IQ of the child and seven potential predictors. EDF 7405 Advanced Quantitative Methods in Educational Research Data are available on IQ of the child and seven potential predictors. Four are medical variables available at the birth of the child: Birthweight

More information

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

The Design and Analysis of Benchmark Experiments Part II: Analysis

The Design and Analysis of Benchmark Experiments Part II: Analysis The Design and Analysis of Benchmark Experiments Part II: Analysis Torsten Hothorn Achim Zeileis Friedrich Leisch Kurt Hornik Friedrich Alexander Universität Erlangen Nürnberg http://www.imbe.med.uni-erlangen.de/~hothorn/

More information

Hypothesis testing. 1 Principle of hypothesis testing 2

Hypothesis testing. 1 Principle of hypothesis testing 2 Hypothesis testing Contents 1 Principle of hypothesis testing One sample tests 3.1 Tests on Mean of a Normal distribution..................... 3. Tests on Variance of a Normal distribution....................

More information

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees

Detection of Uniform and Non-Uniform Differential Item Functioning by Item Focussed Trees arxiv:1511.07178v1 [stat.me] 23 Nov 2015 Detection of Uniform and Non-Uniform Differential Functioning by Focussed Trees Moritz Berger & Gerhard Tutz Ludwig-Maximilians-Universität München Akademiestraße

More information

Frequency Distribution Cross-Tabulation

Frequency Distribution Cross-Tabulation Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape

More information

Regression tree methods for subgroup identification I

Regression tree methods for subgroup identification I Regression tree methods for subgroup identification I Xu He Academy of Mathematics and Systems Science, Chinese Academy of Sciences March 25, 2014 Xu He (AMSS, CAS) March 25, 2014 1 / 34 Outline The problem

More information

STATISTICS ( CODE NO. 08 ) PAPER I PART - I

STATISTICS ( CODE NO. 08 ) PAPER I PART - I STATISTICS ( CODE NO. 08 ) PAPER I PART - I 1. Descriptive Statistics Types of data - Concepts of a Statistical population and sample from a population ; qualitative and quantitative data ; nominal and

More information

Hypothesis testing:power, test statistic CMS:

Hypothesis testing:power, test statistic CMS: Hypothesis testing:power, test statistic The more sensitive the test, the better it can discriminate between the null and the alternative hypothesis, quantitatively, maximal power In order to achieve this

More information

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Decision Trees CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Classification without Models Well, partially without a model } Today: Decision Trees 2015 Bruno Ribeiro 2 3 Why Trees? } interpretable/intuitive,

More information

General Comments Proofs by Mathematical Induction

General Comments Proofs by Mathematical Induction CMSC351 Notes on Mathematical Induction Proofs These are examples of proofs used in cmsc250. These proofs tend to be very detailed. You can be a little looser. General Comments Proofs by Mathematical Induction

More information

CDA Chapter 3 part II

CDA Chapter 3 part II CDA Chapter 3 part II Two-way tables with ordered classfications Let u 1 u 2... u I denote scores for the row variable X, and let ν 1 ν 2... ν J denote column Y scores. Consider the hypothesis H 0 : X

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Practice problems from chapters 2 and 3

Practice problems from chapters 2 and 3 Practice problems from chapters and 3 Question-1. For each of the following variables, indicate whether it is quantitative or qualitative and specify which of the four levels of measurement (nominal, ordinal,

More information

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018 Data Mining CS57300 Purdue University Bruno Ribeiro February 8, 2018 Decision trees Why Trees? interpretable/intuitive, popular in medical applications because they mimic the way a doctor thinks model

More information

Empirical Risk Minimization, Model Selection, and Model Assessment

Empirical Risk Minimization, Model Selection, and Model Assessment Empirical Risk Minimization, Model Selection, and Model Assessment CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 5.7-5.7.2.4, 6.5-6.5.3.1 Dietterich,

More information

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data.

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data. STAT 518 --- Section 3.4: The Sign Test The sign test, as we will typically use it, is a method for analyzing paired data. Examples of Paired Data: Similar subjects are paired off and one of two treatments

More information

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data Ronald Heck Class Notes: Week 8 1 Class Notes: Week 8 Probit versus Logit Link Functions and Count Data This week we ll take up a couple of issues. The first is working with a probit link function. While

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Growing a Large Tree

Growing a Large Tree STAT 5703 Fall, 2004 Data Mining Methodology I Decision Tree I Growing a Large Tree Contents 1 A Single Split 2 1.1 Node Impurity.................................. 2 1.2 Computation of i(t)................................

More information

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION

WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION WEIGHTED LIKELIHOOD NEGATIVE BINOMIAL REGRESSION Michael Amiguet 1, Alfio Marazzi 1, Victor Yohai 2 1 - University of Lausanne, Institute for Social and Preventive Medicine, Lausanne, Switzerland 2 - University

More information

APPENDIX B Sample-Size Calculation Methods: Classical Design

APPENDIX B Sample-Size Calculation Methods: Classical Design APPENDIX B Sample-Size Calculation Methods: Classical Design One/Paired - Sample Hypothesis Test for the Mean Sign test for median difference for a paired sample Wilcoxon signed - rank test for one or

More information

General Comments on Proofs by Mathematical Induction

General Comments on Proofs by Mathematical Induction Fall 2015: CMSC250 Notes on Mathematical Induction Proofs General Comments on Proofs by Mathematical Induction All proofs by Mathematical Induction should be in one of the styles below. If not there should

More information

Machine Learning & Data Mining

Machine Learning & Data Mining Group M L D Machine Learning M & Data Mining Chapter 7 Decision Trees Xin-Shun Xu @ SDU School of Computer Science and Technology, Shandong University Top 10 Algorithm in DM #1: C4.5 #2: K-Means #3: SVM

More information

Learning with multiple models. Boosting.

Learning with multiple models. Boosting. CS 2750 Machine Learning Lecture 21 Learning with multiple models. Boosting. Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Learning with multiple models: Approach 2 Approach 2: use multiple models

More information

SALES AND MARKETING Department MATHEMATICS. 3rd Semester. Probability distributions. Tutorials and exercises

SALES AND MARKETING Department MATHEMATICS. 3rd Semester. Probability distributions. Tutorials and exercises SALES AND MARKETING Department MATHEMATICS 3rd Semester Probability distributions Tutorials and exercises Online document : on http://jff-dut-tc.weebly.com section DUT Maths S3. IUT de Saint-Etienne Département

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

14.30 Introduction to Statistical Methods in Economics Spring 2009

14.30 Introduction to Statistical Methods in Economics Spring 2009 MIT OpenCourseWare http://ocw.mit.edu 4.0 Introduction to Statistical Methods in Economics Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Duke University. Duke Biostatistics and Bioinformatics (B&B) Working Paper Series. Randomized Phase II Clinical Trials using Fisher s Exact Test

Duke University. Duke Biostatistics and Bioinformatics (B&B) Working Paper Series. Randomized Phase II Clinical Trials using Fisher s Exact Test Duke University Duke Biostatistics and Bioinformatics (B&B) Working Paper Series Year 2010 Paper 7 Randomized Phase II Clinical Trials using Fisher s Exact Test Sin-Ho Jung sinho.jung@duke.edu This working

More information

26 Chapter 4 Classification

26 Chapter 4 Classification 26 Chapter 4 Classification The preceding tree cannot be simplified. 2. Consider the training examples shown in Table 4.1 for a binary classification problem. Table 4.1. Data set for Exercise 2. Customer

More information

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47 Contents 1 Non-parametric Tests 3 1.1 Introduction....................................... 3 1.2 Advantages of Non-parametric Tests......................... 4 1.3 Disadvantages of Non-parametric Tests........................

More information

Logistic regression: Miscellaneous topics

Logistic regression: Miscellaneous topics Logistic regression: Miscellaneous topics April 11 Introduction We have covered two approaches to inference for GLMs: the Wald approach and the likelihood ratio approach I claimed that the likelihood ratio

More information

Comparison of Two Samples

Comparison of Two Samples 2 Comparison of Two Samples 2.1 Introduction Problems of comparing two samples arise frequently in medicine, sociology, agriculture, engineering, and marketing. The data may have been generated by observation

More information

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements: Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported

More information

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not?

Question. Hypothesis testing. Example. Answer: hypothesis. Test: true or not? Question. Average is not the mean! μ average. Random deviation or not? Hypothesis testing Question Very frequently: what is the possible value of μ? Sample: we know only the average! μ average. Random deviation or not? Standard error: the measure of the random deviation.

More information

Generalized Linear Modeling - Logistic Regression

Generalized Linear Modeling - Logistic Regression 1 Generalized Linear Modeling - Logistic Regression Binary outcomes The logit and inverse logit interpreting coefficients and odds ratios Maximum likelihood estimation Problem of separation Evaluating

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 1 1-1 Basic Business Statistics 11 th Edition Chapter 1 Chi-Square Tests and Nonparametric Tests Basic Business Statistics, 11e 009 Prentice-Hall, Inc. Chap 1-1 Learning Objectives In this chapter,

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations

More information

NAG Library Chapter Introduction. G08 Nonparametric Statistics

NAG Library Chapter Introduction. G08 Nonparametric Statistics NAG Library Chapter Introduction G08 Nonparametric Statistics Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 2.1 Parametric and Nonparametric Hypothesis Testing... 2 2.2 Types

More information

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk

Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Ann Inst Stat Math (0) 64:359 37 DOI 0.007/s0463-00-036-3 Estimators for the binomial distribution that dominate the MLE in terms of Kullback Leibler risk Paul Vos Qiang Wu Received: 3 June 009 / Revised:

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878 Contingency Tables I. Definition & Examples. A) Contingency tables are tables where we are looking at two (or more - but we won t cover three or more way tables, it s way too complicated) factors, each

More information

Basic methods for simulation of random variables: 1. Inversion

Basic methods for simulation of random variables: 1. Inversion INSTITUT FOR MATEMATISKE FAG c AALBORG UNIVERSITET FREDRIK BAJERS VEJ 7 G 9220 AALBORG ØST Tlf.: 96 35 88 63 URL: www.math.auc.dk Fax: 98 15 81 29 E-mail: jm@math.aau.dk Basic methods for simulation of

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over

* * MATHEMATICS (MEI) 4767 Statistics 2 ADVANCED GCE. Monday 25 January 2010 Morning. Duration: 1 hour 30 minutes. Turn over ADVANCED GCE MATHEMATICS (MEI) 4767 Statistics 2 Candidates answer on the Answer Booklet OCR Supplied Materials: 8 page Answer Booklet Graph paper MEI Examination Formulae and Tables (MF2) Other Materials

More information

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection Model Selection in GLMs Last class: estimability/identifiability, analysis of deviance, standard errors & confidence intervals (should be able to implement frequentist GLM analyses!) Today: standard frequentist

More information

The computationally optimal test set size in simulation studies on supervised learning

The computationally optimal test set size in simulation studies on supervised learning Mathias Fuchs, Xiaoyu Jiang, Anne-Laure Boulesteix The computationally optimal test set size in simulation studies on supervised learning Technical Report Number 189, 2016 Department of Statistics University

More information

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013 Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2 Things not

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2018 Examinations Subject CT3 Probability and Mathematical Statistics Core Technical Syllabus 1 June 2017 Aim The

More information

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 STAT 391 - Spring Quarter 2017 - Midterm 1 - April 27, 2017 Name: Student ID Number: Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56 Directions. Read directions carefully and show all your

More information

Statistical Hypothesis Testing with SAS and R

Statistical Hypothesis Testing with SAS and R Statistical Hypothesis Testing with SAS and R Statistical Hypothesis Testing with SAS and R Dirk Taeger Institute for Prevention and Occupational Medicine of the German Social Accident Insurance, Institute

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Chi-Squared Tests. Semester 1. Chi-Squared Tests

Chi-Squared Tests. Semester 1. Chi-Squared Tests Semester 1 Goodness of Fit Up to now, we have tested hypotheses concerning the values of population parameters such as the population mean or proportion. We have not considered testing hypotheses about

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01

An Analysis of College Algebra Exam Scores December 14, James D Jones Math Section 01 An Analysis of College Algebra Exam s December, 000 James D Jones Math - Section 0 An Analysis of College Algebra Exam s Introduction Students often complain about a test being too difficult. Are there

More information

In Defence of Score Intervals for Proportions and their Differences

In Defence of Score Intervals for Proportions and their Differences In Defence of Score Intervals for Proportions and their Differences Robert G. Newcombe a ; Markku M. Nurminen b a Department of Primary Care & Public Health, Cardiff University, Cardiff, United Kingdom

More information

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research

Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research Test Yourself! Methodological and Statistical Requirements for M.Sc. Early Childhood Research HOW IT WORKS For the M.Sc. Early Childhood Research, sufficient knowledge in methods and statistics is one

More information

Models, Data, Learning Problems

Models, Data, Learning Problems Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Models, Data, Learning Problems Tobias Scheffer Overview Types of learning problems: Supervised Learning (Classification, Regression,

More information

Textbook Examples of. SPSS Procedure

Textbook Examples of. SPSS Procedure Textbook s of IBM SPSS Procedures Each SPSS procedure listed below has its own section in the textbook. These sections include a purpose statement that describes the statistical test, identification of

More information

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F. Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University

2018 CS420, Machine Learning, Lecture 5. Tree Models. Weinan Zhang Shanghai Jiao Tong University 2018 CS420, Machine Learning, Lecture 5 Tree Models Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html ML Task: Function Approximation Problem setting

More information

Statistical Methods for Astronomy

Statistical Methods for Astronomy Statistical Methods for Astronomy Probability (Lecture 1) Statistics (Lecture 2) Why do we need statistics? Useful Statistics Definitions Error Analysis Probability distributions Error Propagation Binomial

More information

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition Preface Preface to the First Edition xi xiii 1 Basic Probability Theory 1 1.1 Introduction 1 1.2 Sample Spaces and Events 3 1.3 The Axioms of Probability 7 1.4 Finite Sample Spaces and Combinatorics 15

More information

CS 6375 Machine Learning

CS 6375 Machine Learning CS 6375 Machine Learning Decision Trees Instructor: Yang Liu 1 Supervised Classifier X 1 X 2. X M Ref class label 2 1 Three variables: Attribute 1: Hair = {blond, dark} Attribute 2: Height = {tall, short}

More information

FULL LIKELIHOOD INFERENCES IN THE COX MODEL

FULL LIKELIHOOD INFERENCES IN THE COX MODEL October 20, 2007 FULL LIKELIHOOD INFERENCES IN THE COX MODEL BY JIAN-JIAN REN 1 AND MAI ZHOU 2 University of Central Florida and University of Kentucky Abstract We use the empirical likelihood approach

More information

Binary Logistic Regression

Binary Logistic Regression The coefficients of the multiple regression model are estimated using sample data with k independent variables Estimated (or predicted) value of Y Estimated intercept Estimated slope coefficients Ŷ = b

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Entering and recoding variables

Entering and recoding variables Entering and recoding variables To enter: You create a New data file Define the variables on Variable View Enter the values on Data View To create the dichotomies: Transform -> Recode into Different Variable

More information

Investigation of goodness-of-fit test statistic distributions by random censored samples

Investigation of goodness-of-fit test statistic distributions by random censored samples d samples Investigation of goodness-of-fit test statistic distributions by random censored samples Novosibirsk State Technical University November 22, 2010 d samples Outline 1 Nonparametric goodness-of-fit

More information

Finding the Gold in Your Data

Finding the Gold in Your Data Finding the Gold in Your Data An introduction to Data Mining Originally presented @ SAS Global Forum David A. Dickey North Carolina State University Decision Trees A divisive method (splits) Start with

More information

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E Salt Lake Community College MATH 1040 Final Exam Fall Semester 011 Form E Name Instructor Time Limit: 10 minutes Any hand-held calculator may be used. Computers, cell phones, or other communication devices

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Intelligent Data Analysis. Decision Trees Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Intelligent Data Analysis Decision Trees Paul Prasse, Niels Landwehr, Tobias Scheffer Decision Trees One of many applications:

More information