Partition of the Chi-Squared Statistic in a Contingency Table

Size: px
Start display at page:

Download "Partition of the Chi-Squared Statistic in a Contingency Table"

Transcription

1 Partition of the Chi-Squared Statistic in a Contingency Table by Jo Ann Colas Thesis submitted submitted to the Faculty of Graduate and Postdoctoral Studies In partial fulfillment of the requirements For the MSc degree in Mathematics and Statistics Department of Mathematics and Statistics Faculty of Sciences University of Ottawa c Jo Ann Colas, Ottawa, Canada, 014

2 Abstract The Pearson statistic, a well-known goodness-of fit test in the analysis of contingency tables, gives little guidance as to why a null hypothesis is rejected One approach to determine the source(s) of deviation from the null is the decomposition of a chi-squared statistic This allows writing the statistic as the sum of independent chi-squared statistics First, three major types of contingency tables and the usual chi-squared tests are reviewed Three types of decompositions are presented and applied: one based on the partition of the contingency table into independent subtables; one derived from smooth models and one from the eigendecomposition of the central matrix defining the statistics A comparison of some of the omnibus statistics decomposed above to a χ (1)-distributed statistic shows that the omnibus statistics lack power compared to this statistic for testing hypothesis of equal success probabilities against monotonic trend in the success probabilities in a column-binomial contingency table ii

3 Acknowledgements Foremost, I would like to express my sincere gratitude to my supervisors Dr M Alvo and Dr P-J Bergeron for their continuous support of my Master s study and research, for their patience, motivation and knowledge Their guidance helped me in all the time of researching and writing this thesis I would also like to thank my committee members, Dr M Zarepour and Dr S Sinha, for letting my defense be an enjoyable moment, and for your brilliant comments and suggestions I thank my fellow labmates at the Department of Mathematics and Statistics: Maryam, Blanche Nadege, Nada, Cheikh, Farid, Hicham, Rachid, Ewa, Farnoosh, Jack and Gael for the stimulating discussions, for the sleepless nights we worked together before deadlines and for all the fun we have had in the last two years Last but not least, I would like to thank my family: my mother Rosie as my editor, my father as my chauffeur and my brother as my motivator iii

4 Contents 1 Introduction 1 Description of the Chi-Squared Test for a Contingency Table 3 1 The Chi-Squared Distribution 3 Contingency Tables 5 3 The Three Models Associated to Contingency Tables 6 31 The unrestricted bivariate sampling model 6 3 The product-multinomial model 8 33 The permutation model The link between the three models 11 4 Chi-Squared Hypothesis Tests 1 41 The chi-squared goodness-of-fit test 13 4 The chi-squared test for homogeneity The chi-squared test for independence 19 3 Tools for the Decomposition of the Pearson Statistic 3 31 Ordinal Categories and Testing for Trends 3 3 Ranked Data and Contingency Tables 5 31 Ranked data and their properties 5 3 Advantages and disadvantages of working with ranked data 6 33 Smooth Goodness-of-Fit Models Neyman-type smooth models 8 33 Barton-type smooth models Advantages and disadvantages of smooth models Orthonormal Polynomials Defined From Distributional Moments 30 4 Decompositions of Some Chi-Squared Statistics The Decomposition Into Subtables 33 4 The Decomposition of the Pearson Statistic Decomposition of the Pearson statistic in Two-Way Tables 37 4 Under the unrestricted bivariate sampling model Under the multinomial model 44 iv

5 44 Under the binomial model 5 45 Under the randomised design model Decomposition of the Pearson Statistic in Three-Way Tables The Decomposition of the Cumulative Chi-Squared Statistic General form of the decomposition of the cumulative chi-squared statistic 6 43 The distribution of the cumulative chi-squared statistic Link to smooth models Explicit solutions when the column are equiprobables The Decomposition of the Chi-Squared Likelihood-Ratio Test Statistic Log-linear models Decomposition of the chi-squared likelihood-ratio test statistic Extension of the decomposition 76 5 Comparisons of the CCS Statistics and a Correlation Based Statistic The Alvo & Berthelot Spearman Test 78 5 The Comparisons The distributions under the null hypothesis 80 5 Comparison of the significance levels Comparison of the powers 84 6 Conclusion 89 A Obtaining the first two orthogonal polynomials from the Emerson recursion 91 B Proof of symmetry of (AA ) 1 when the columns are equiprobable 96 v

6 List of Tables 1 An I J contingency table 5 An I J contingency table under the product-multinomial model 8 3 An I J contingency table under the permutation model The s-th subtable 34 4 A rank-by-rank table (without ties) A treatment-by-rank table (without ties) The I table around the overall sample median A treatment-by-rank table (without ties) The s-th collapsed I contingency table 6 47 The J collapsed table around the i-th row Hierarchical log-linear models for an I J contingency table Hierarchical log-linear models for an I J K contingency table Minimum number of simulations 83 5 Significance levels for the Taguchi and Nair CCS statistics Significance levels for the Pearson and Spearman statistics Comparison of the powers strictly increasing success probabilities Comparison of the powers increasing success probabilities with repeated values Comparison of the powers non-increasing and non-decreasing success probabilities 88 vi

7 Chapter 1 Introduction Karl Pearson s chi-square goodness-of-fit test was introduced at the end of the development of several important concepts in the preceding centuries In 1733, de Moivre[19] established the asymptotic normality of the variable X = Y nπ nπ(1 π), where Y Binomial(n,π) Consequently, X is asymptotically distributed as the square of the N(0, 1) distribution Bienaymé[1] published the distribution of the sum of m independently distributed N(0, 1) variables in the gamma function form Bravais[15], Schols[4] and Edgeworth[0] developed the joint multivariate normal distribution A contemporary of Pearson, Sheppard[43, 44] considered possible tests of goodness-of-fit by comparing observed frequencies to expected frequencies for each cell of a contingency table In a good fit, the difference would be small In particular, Sheppard looked at tables as a dichotomy of a bivariate normal distribution However, he could not find a generalization for I J tables due to the awkward form of the covariance matrix By considering a multinomial distribution instead, Pearson[3] found a more tractable solution of the covariance matrix and provided the widely used chi-squared test of goodness-of-fit The advantages of the Pearson test include: easy to compute; used on categorical variables; used to compare two or more populations/samples; no assumption on the distribution of the population(s) However, it has some shortcomings Since it is based on categorical variables, there will be some loss of information if it is used with samples based on continuous distributions Also, since the distribution of the test statistic is obtained asymptotically, the Pearson test is sensitive to sample size In this thesis, we focus on the fact that the Pearson test is an omnibus test When it rejects the null hypothesis, it tells there is enough evidence to suggest a relationship, but it is not clear about the strength or the type of this relationship One approach to overcome this drawback is to use the additive property of the chi-squared distribution that allows writing a chi-squared variable as the sum of asymptotically independent chi-squared variables Such a breakdown of a chi-squared statistic is called a partition or a decomposition of the statistic Many decompositions of the Pearson statistic have been proposed, allowing for the detection of different deviations from 1

8 Introduction the null distribution Agresti[1] and Iversen[5] present a decomposition of the Pearson statistic based on the decomposition of the I J contingency table into subtables If the data is in the form of ranks, Rayner & Best provide a partition of the Pearson statistic based on a set of polynomials orthonormal with respect to distribution Depending on the model of the contingency table, this approach results in the extension of widely-used tests on two-way contingency table, such as the Pearson product-moment correlation coefficient[36], the Kruskal-Wallis test[37] and the Friedman test[16] If the alternative hypothesis is directional, other chi-squared test statistics have been proposed and decomposed, in particular, the class of cumulative chi-squared tests based on the weighted sum of increasing sums of squares initially proposed by Taguchi[45, 46] If the data is binary and the alternative is that of monotone trend in the proportions, Alvo-Berthelot[] proposed a test statistic motivated by rankings This thesis reviews the above-mentioned decompositions In Chapter, we first review the types of contingency tables and then go over the usual chisquared test based on the Pearson statistic In Chapter 3, we review the necessary tools for the decompositions, including properties of ranked data, a brief review of smooth models and the the derivation of polynomials orthonormal with respect to a distribution Chapter 4 discusses the different decompositions Finally, in Chapter 5 we study the power simulations of the Pearson test, the cumulative chi-squared tests and the Spearman statistic from Alvo-Berthelot[] to show that when we want to test homogeneity against monotonicity in one parameter, alternative and simpler tests are more powerful

9 Chapter Description of the Chi-Squared Test for a Contingency Table In this chapter, we first define the chi-squared distribution and describe some of its properties We follow with describing the three major types of contingency tables and giving the distribution of the cell counts We end the chapter with a review of the usual chi-squared hypothesis tests and the associated Pearson and likelihood ratio test statistics 1 The Chi-Squared Distribution Definition 1 A random variable W has a chi-squared distribution with ν degrees of freedom, denoted by W χ (ν), if its density function is given by f (w ν) = 1 Γ ( v ) v w v e w, 0 w <, ν > 0 From the definition of the density function for the chi-squared distribution, we can see that this distribution is in fact a special case of the gamma distribution The mean, variance and characteristic function are respectively given by E[W] = ν, Var[W] = ν, φ W (t) = 1 (1 it) v Theorem 1 If W 1 and W are two independent variables such that W 1 χ (ν 1 ) and W χ (ν ), then the random variable W 1 +W χ (ν 1 +ν ) Conversely, if a random variable W χ (ν), W can be always be expressed as the sum of two independent random variables W 1 and W such that W 1 χ (ν 1 ), W χ (ν ) and ν = ν 1 +ν Proof Let φ Wk (t) = (1 it) ν k/ be thecharacteristicfunction forw k, k = 1, SinceW 1 andw areindependent, thecharacteristicfunction 3

10 Description of the Chi-Squared Test For A Contingency Table 4 of W 1 +W is φ W1 W (t) = φ W1 (t)φ W (t) 1 1 = (1 it) v 1 (1 it) v = 1 (1 it) v 1 +v which is the the characteristic function for the χ (ν 1 ν ) distribution Remark By recursion, Theorem 1 is true for any finite number independent chi-squared variables such that the sum of their degrees of freedom is ν Lemma 3 Let Z be a standard normal variable Then, Z has a χ (1) distribution Proof Let W = Z Then, dw = zdz Then, for all w (0, ), the cdf for W is F W (w) = P (W w) = P ( Z w ) = P ( w 1/ Z +w 1/) ( = Φ +w 1/) Φ ( w 1/) where Φ is the cdf for the N(0,1) distribution which has pdf φ(z) = e z / / π for z + Then, the pdf for W = Z is f W (w) = d dw F W (w) dz dw = 1 [ ( φ +w 1/) ( +φ w 1/)] w 1/ = 1 e w πw = 1 Γ ( 1 ) 1 w 1 e w which is the density function for the χ (1) distribution Theorem 4 Let Z 1,,Z ν be ν independent and identically distributed (iid) variables with a N (0,1) distribution Then, W = Z 1 + +Z ν χ (ν) Proof By Lemma 3, the Z 1,,Z ν are iid with a χ (1) distribution and so by Remark, W = Z 1 + +Z ν χ (ν) Alternatively, some authors define a chi-squared variable as the sum of ν independent squared standard normal variables and then show that it is in fact a special case of a gamma variable

11 Description of the Chi-Squared Test For A Contingency Table 5 Contingency Tables We recall the definition of a categorical variable, also known as a nominal variable Definition A categorical variable is a variable that can take on one of a limited, and usually fixed, number of possible values which are defined in terms of categories A categorical variable can be represented on either a nominal or ordinal scale For example, the variable colour has nominal scale where the categories red, orange, yellow, green, blue, indigo and violet have no particular order On the other hand, the variable work qualification, with categories qualified, semi-qualified and not qualified, has an ordinal scale, where there is a decreasing level of qualification from the first to the last category Let X be a categorical variable with classification { A = A 1,,A I : A i A j =, i j = 1,,I, Ai = Ξ} where Ξ is the set of possible values for X Similarly, let Y be a categorical variable with classification { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ψ} where Ψ is the set of possible values for Y Suppose (X 1,Y 1 ),,(X n,y n ) is a random sample of the random vector (X,Y) Let N ij denote the random variable which counts the number of observations that fall into the cross-category A i B j That is, n N ij = I[X k A i,y k B j ], i = 1,,I, j = 1,,J k=1 where I[ ] is the indicator function, defined as { 1 if the event A occurs, I[A] = 0 otherwise Table 1 represents an I J contingency table for X and Y where Table 1: An I J contingency table B 1 B j B J Total A 1 N 11 N 1j N 1J N 1 A i N i1 N ij N ij N i A I N I1 N Ij N IJ N I Total N 1 N j N J n N ij represents the counts for the A i B j cross-category N i, called the i-th row total, represents the counts for the category A i n N i = 1[X k A i ] = N i1 + +N ij i = 1,,I; k=1

12 Description of the Chi-Squared Test For A Contingency Table 6 N j, called the j-th column total, represents the counts for the category B j N j = n 1[Y k B j ] = N 1j + +N Ij j = 1,,J; k=1 n represents the total number of observations, which is always fixed and known Parameters of interest in an I J contingency table are the cell and marginal probabilities: π ij denotes the probability that (X,Y) falls into the (i,j)-th cross-category π ij = P (X A i,y B j ), i = 1,,I, j = 1,,J π i denotes the probability that X is in category A i : π i = P (X A i ), i = 1,,I Since we have P (X A i ) = P (X A i,y Ψ) = where the B j are disjoint, we have that P (X A i,y B j ) π i = π i1 + +π ij, i = 1,,I π j denotes the probability that Y is in category B j : π j = P (Y B j ), j = 1,,J Since we have P (Y B j ) = P (X Ξ,Y B j ) = where the A i are disjoint, we have that P (X A i,y B j ) π j = π 1j + +π Ij, j = 1,,J Here we have presented two-way contingency tables but this concept can be extended to multi-way contingency tables which cross-classify three or more categorical variables 3 The Three Models Associated to Contingency Tables As Lancaster[7] indicates, there are three probability models that allow for analysis on contingency tables In the models, while n is always fixed, the sets of marginal totals may or may not be fixed 31 The unrestricted bivariate sampling model Suppose X and Y are categorical variables and the random vector (X,Y) has a joint distribution P (X A i,y B j ) = π ij, i = 1,,I, j = 1,,J such that π π IJ = 1, where A i, and B j, are the categories for X and Y, respectively,i = 1,,I, j = 1,,J Let us assume n observations of (X,Y) are chosen by a random process with assigned weight π ij Let N ij denote the number of observations in the (i,j)-th cross-category The data can be arranged as in Table 1

13 Description of the Chi-Squared Test For A Contingency Table 7 The probability for the cell counts N ij of the I J contingency table for X and Y is given by the multinomial distribution ( ) I n J P ({N ij = n ij } π,n) = n 11,,n IJ where π = (π 11,,π IJ ) The log-likelihood function for π is l(π {n ij },n) n ij lnπ ij Under the constraint π π IJ = 1, the Lagrange function is L({n ij },λ π,n) n ij lnπ ij λ π ij 1 To find the ML estimator for π ij, we maximise the Lagrange function: π ij L({n ij },λ π,n) = 0 n ij π ij λ = 0 π ij = n ij λ π nij ij (1) Substitutingintotheconstraintyields(n n IJ )/λ = 1, soλ = nandtheunrestrictedmlestimators for the probabilities π ij are ˆπ ij = N ij n i = 1,,I, j = 1,,J Given the cell counts have a multinomial distribution, the unrestricted ML estimators for the means are ˆµ ij = nˆπ ij, i = 1,,I, j = 1,,J, and the unrestricted ML estimators for the covariances are { ˆπij (1 ˆπ ij ), i = a, j = b, cov[n ˆ ij,n ab ] = n ˆπ ijˆπ ab, otherwise, where R is the IJ IJ matrix given by = nˆr R = [diag(π) (ππ )] In this model, the only fixed quantity is the total number of observations Example 1 Suppose we observe n = 50 individuals We cross-classify each individual with respect to their sex (male or female) and their handedness (right-, left-handed or ambidextrous) Then, the 3 contingency table for the variables sex and handedness has an unrestricted bivariate sampling model

14 Description of the Chi-Squared Test For A Contingency Table 8 3 The product-multinomial model Lancaster[7] presents this as the comparative trial model however it is now more commonly known as the product-multinomial model Let X 1,,X I be random variables defined on the same outcome space Ξ which has the partition { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ξ} Suppose X i1,,x ini is a random sample for X i, i = 1,,I Let N ij denote the number of observations of the i-th sample which are in the j-th category That is n i N ij = 1[X ik B j ], i = 1,,I, j = 1,,J k=1 Then the I J contingency table for the variables X 1,,X I against the classification B is as Table Table : An I J contingency table under the product-multinomial model B 1 B j B J Total X 1 N 11 N 1j N 1J n 1 X i N i1 N ij N ij n i X I N I1 N Ij N IJ n I Total N 1 N j N J n In this model, the rows are independent We denote the row totals as n i, to emphasize that they are known and fixed n i = N i1 + +N ij = n i, i = 1,,I and the product-multinomial model is called the row-multinomial model Let π ij denote the probability that an observation of the i-th variable falls into the j-th category That is, P (X i B j ) = π ij, i = 1,,I, j = 1,,J Since B partitions Ξ, the probabilities for each row sum to one π i1 + +π ij = 1, i = 1,,I Then, assuming that each observation is independent of every other observation, the i-th row N i has a multinomial(n i,π i ) distribution, where π i = (π i1,,π ij ), i = 1,,I, and the probability for the cell counts is given by the product of I multinomial distributions I ( ) P ({N ij = n ij } π,{n i },n) = n i J π nij ij () n i1,,n ij

15 Description of the Chi-Squared Test For A Contingency Table 9 where π = (π 1,,π I ) The log-likelihood function for π is l(π {n ij },{n i },n) n ij lnπ ij Under the constraints π i1 + +π ij = 1, i = 1,,I, the Lagrange function is L({n ij },λ π,n) n ij lnπ ij π ij 1 To find the ML estimator for π ij, we maximise the Lagrange function: λ i π ij L({n ij },λ π,n) = 0 π ij = n ij λ i Substituting into the i-th constraint yields (n i1 + + n ij )/λ i = 1, so λ i = n i, i = 1,,I, and the unrestricted ML estimator for the probabilities π ij are ˆπ ij = N ij n i i = 1,,I, j = 1,,J Given that the rows have independent multinomial distributions, the unrestricted ML estimators for the means are ˆµ ij = n i ˆπ ij, i = 1,,I, j = 1,,J, and the unrestricted ML estimators for the covariances are ˆπ ij (1 ˆπ ij ), i = a, j = b, cov[n ˆ ij,n ab ] = n i ˆπ ijˆπ ib, i = a, j b, 0, i a, where R i is the J J matrix given by = n i { ˆR i i = a, 0, i a R i = [diag(π i ) (π i π i )] i = 1,,I (3) The same way the total number of observations in the rows were fixed, the total number of observations in the columns could be fixed instead, which would be denoted by n j, j = 1,,J Then, the j-th column N j has a multinomial(n j,π j ) distribution, the probability for the cell counts would be given by [ J ( ) I ] n j P ({N ij = n ij } π,{n j },n) = π nij ij n 1j,,n Ij and the product-multinomial model is now called a column-multinomial model Example Suppose we sample n i = 50 individuals from I = 5 countries For each country, we classify each individual with respect to their socio-economic class (upper, middle or lower class) Then, the 5 3 contingency table for the variables country and socio-economic class has a product-multinomial model

16 Description of the Chi-Squared Test For A Contingency Table The permutation model Suppose we have n independent observations which can be categorised according to one of two possible classifications { A = A 1,,A I : A i A j =, i j = 1,,I, Ai = Ξ} and B = { B 1,,B J : B i B j =, i j = 1,,J, Bj = Ξ} where Ξ is the set of possible values for the observations Let n i, i = 1,,I, and n j, j = 1,,J, be known and fixed positive non-zero integers such that I n i = J n j = n From the n observations, select n 1 without replacement and assign them to A 1, then select n without replacement and assign them to A, and so on Now, we restart the process: from these same n observations select n 1 without replacement and assign them to B 1 then select n without replacement and assign them to B, and so on This way, each object now has two labels The I J contingency table for this cross-classification is as Table 3 Table 3: An I J contingency table under the permutation model B 1 B j B J Total A 1 N 11 N 1j N 1J n 1 A i N i1 N ij N ij n i A I N I1 N Ij N IJ n I Total n 1 n j n J n The row and column totals as are denoted by n i and n j, respectively, to emphasize that they are fixed: n i = N i1 + +N ij, n j = N 1j + +N Ij, i = 1,,I, j = 1,,J In this model, the fixed quantities are the total number of observations and both sets of marginal totals The number of ways of obtaining cell counts for the i-th row is ( n i n i1,,n ij ) = n i! J n ij! and so the number of ways obtaining an I J contingency table with entries {n ij } is I ( n i n i1,,n ij The number of ways of obtaining the column totals is ( n n 1,,n J ) I = n i! I J n ij! ) = n! J n j!

17 Description of the Chi-Squared Test For A Contingency Table 11 So, the probability for the cell counts is given by P ({N ij = n ij } {n i },{n j },n) = = ( 1 n n 1,,n J ( I n i! ) I ( )( J n j! n! I J n ij! ) n i n i1,,n ij ) (4) As per Lemma 51 of Lancaster s The Chi-Squared Distribution, the means for the cell counts are µ ij = nπ i π j where, for the permutation model, the row and column probabilities are fixed at π i = n i /n and π j = n j /n respectively, i = 1,,I, j = 1,,J Continuing from Lemma 51, the covariances are given by cov[n ij,n ab ] = n n 1 = n n 1 = n n 1 S T π i (1 π i )π j (1 π j ), i = a, j = b, π i (1 π i )π j π b, i = a, j b, π i π a π j (1 π j ), i a, j = b, π i π a π j π b, i a, j b, { πi (1 π i )T, i = a, π i π a T, i a, where S and T are respectively the I I and J J matrices given by S = [diag(π i ) (π i π a )] (5) T = [diag(π j ) (π j π b )] (6) Example 3 Suppose we have two variables, sex and socio-economic class We choose 150 males and 100 females such that we have 50 individuals classified as upper class, 75 middle class and 15 lower class Then, the 3 contingency table for the variables sex and socio-economic class has a permutation model 34 The link between the three models While the cell counts have different distributions under the three models, the following theorem links the three models Theorem 5 Under the hypothesis of independence of the marginal variables, the distribution of the entries in a two-way contingency table does not involve the marginal parameters, {π i } and {π j } in the unrestricted bivariate sampling model Under the hypothesis of homogeneity, π ij = π j, the distribution of the entries in a two-way contingency table does not involve the marginal parameters {π j } in the product-multinomial model Proof For the unrestricted bivariate sampling model, let us condition on both sets of marginal totals

18 Description of the Chi-Squared Test For A Contingency Table 1 P ({N ij = n ij } π,n) P ({N ij = n ij } π,{n i = n i },{N j = n j },n) = P ({N i = n i } π,n)p ({N j = n j } π,n) ( ) n I n 11,,n IJ = [( ) ][( n I n πni i n 1,,n I n 1,,n J ( ) n n 11,,n IJ = ( )( ) n n n 1,,n I n 1,,n J [ I ][ n J ] i! n j! I J [ ] nij P ({N ij = n ij } π,{n i = n i },{N j = n j },n) = n! πij I J n ij! π i π j J πnij ij ) ] J πn j j I J πnij ij [ I ][ J πni πn j i j Under the hypothesis of independence, π ij = π i π j, and so the probability is now equal to (4), which does not involve the marginal parameters {π i } and {π j } For the row-multinomial model, let us condition on the column totals P ({N ij = n ij } π,{n i },{N j = n j },n) = P ({N ij = n ij } π,{n i },n) P ({N j = n j } π,n) [( = = I ( I ( n i n i1,,n ij n n 1,,n J ( n i n i1,,n ij ) n n 1,,n J [ I ][ n J ] i! n j! P ({N ij = n ij } π,{n i },{N j = n j },n) = n! I J n ij! ) J πnij ) J πn j j ) ] I J πnij [ J ] πn j j I J [ ] nij πij Under the hypothesis of homogeneity, π ij = π j, and so the probability is now equal to (4) which does not involve the column parameters {π j } This theorem, presented by Lancaster[7], tell us that under the appropriate hypothesis and given both sets of marginal totals, the cell counts under the three models have the same conditional probability, ( I )( n J ) i! n j! P ({N ij = n ij } {N i = n i },{N j = n j },n) = n! I J n (7) ij! In fact, if we condition on both sets of marginal probabilities, we have the same result 4 Chi-Squared Hypothesis Tests We consider test statistics which asymptotically have a χ distribution under the null hypothesis π j ]

19 Description of the Chi-Squared Test For A Contingency Table The chi-squared goodness-of-fit test A goodness-of-fit test is used to decide if a sample comes from a specified distribution This is done by verifying if the empirical distribution and the specified distribution give the same probabilities for a fixed number of sets The most widely used chi-squared test for goodness-of-fit, was introduced by Karl Pearson[3] in 1900 Let X be a random variable, defined on the outcome space Ξ, with unknown cumulative distribution function F Let X 1,,X n be a random sample of X The null hypothesis for a chi-squared test of goodness-of-fit is H 0 : F (x) = F 0 (x) for all x R, (8) where F 0 is a completely specified distribution The data is grouped in the following way: let { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ξ} be a pre-specified partition of Ξ Let N j denote the random variable that counts the number of observations falling in the j-th category B j : N j = n I[X k B j ], j = 1,,J k=1 Then we have the following 1 J contingency table B 1 B j B J Total X N 1 N j N J n Total N 1 N j N J n This contingency table falls under the row-multinomial model with I = 1, cell counts as N 1j = N j, cell probabilities π 1j = π j = P (X B j ) which satisfy π 1 + +π J = 1, and a fixed (row) total of n Now, the null hypothesis for a chi-squared test of goodness-of-fit is equivalent to H 0 : π j = π j0, j = 1,,J (9) where π j0 is the probability of the j-th cell under the distribution F 0 Since the hypothesis H 0 is more restrictive than the hypothesis H 0, if the hypothesis H 0 is rejected then it is natural to reject the hypothesis H 0 If the hypothesis H 0 is not rejected, then we can say that the grouped data do not contradict the hypothesis H 0 The Pearson chi-squared statistic Given that the above 1 J contingency table falls under the row-multinomial model, the random vector of the cell counts N = (N 1,,N J ) has a multinomial(n,π) distribution where π = (π 1,,π J ) Pearson suggested the following, the Pearson chi-squared statistic, as a statistic to test against the alternative hypothesis H 1 : π j π j0 X P = (N j ˆµ j0 ) ˆµ j0,

20 Description of the Chi-Squared Test For A Contingency Table 14 where ˆµ j0 is the ML estimator for µ j = nπ j under H 0 : π j = π j0, j = 1,,J This statistic looks at the difference between the observed counts and their expected value under H 0 The expected value of N j under H 0 is ˆµ j0 = nπ j0 Thus, the Pearson statistic for goodness-of-fit is X P = = = (N j nπ j0 ) nπ j0 Nj nπ j0n j +(nπ j0 ) X P = Theorem 6 Under the null hypothesis N j nπ j0 N j nπ j0 n nπ j0 N j +n H 0 : π j = π j0, j = 1,,J, the Pearson statistic X P asymptotically has a χ (J 1) distribution as n Proof The proof is presented in Kendall & Stuart[6] π j0 The likelihood ratio test statistic An alternative statistic for testing the above hypotheses has been derived by looking at this problem as a two-sided likelihood ratio test (LRT) From subsection 3, the likelihood function for π is ( ) n J L(π {n j },n) = n 1,,n J and the unrestricted ML estimators for the probabilities π j are ˆπ j = N j n j = 1,,J The ML estimators for the probabilities π j under H 0 are ˆπ j0 = π j0, j = 1,,J Therefore, the two-sided LRT statistic is ( ) n J Λ(N,n) = L(ˆπ j0 N,n) L(ˆπ j N,n) = N 1,,N πnj j0 J J ( ) Nj nπj0 ( ) n J ( ) = Nj Nj N j N 1,,N n J Wilks[49] showed that if the null hypothesis is true, lnλ has a chi-squared distribution as n The degrees of freedom are ν = dimθ dimθ 0, where Θ is the parameter space and Θ 0 is the parameter space under the null hypothesis The alternative statistic, called the chi-squared likelihood ratio test statistic, is G = lnλ π nj j

21 Description of the Chi-Squared Test For A Contingency Table 15 So, the chi-squared LRT statistic for goodness-of-fit is The parameter space is G = lnλ = ( ) Nj N j ln nπ j0 Θ = { (π 1,,π J ) [0,1] J : π 1 + +π J = 1 } So dimθ = J 1 because of the relation π j = 1 Under H 0 : π j = π j0, the probabilities are fixed so dimθ 0 = 0 and the chi-squared LRT statistic G asymptotically has a χ (J 1) as n Comparison of the likelihood ratio test statistic and Pearson chi-squared statistic When the null hypothesis of goodness-of-fit is true, both the Pearson statistic and the chi-squared LRT statistic asymptotically have a χ (J 1) Theorem 7 Under the same null hypothesis, the statistics X P and G are asymptotically equivalent Proof Let ˆµ j0 be the ML estimator of µ j = E(N j ) under the null hypothesis Then, the chi-squared LRT statistic is G = = = ( ) Nj N j ln ˆµ j0 ) (ˆµj0 +N j ˆµ j0 (ˆµ j0 +N j ˆµ j0 )ln Let g(x) = ln(1+x) Then, the k-th derivative for g is The Taylor expansion for g around a constant c is ˆµ j0 ( (ˆµ j0 +N j ˆµ j0 )ln 1+ N ) j ˆµ j0 ˆµ j0 g (k) (x) = ( 1) k 1 (k 1)! (1+x) k k = 1,, T g (x,c) = g (i) (c) (x c)i i! = k ( 1) i 1 ( ) i x c i 1+c For c = 0, T g (x,0) = ( 1) i 1 x i = x x i + x3 3 x4 4 +

22 Description of the Chi-Squared Test For A Contingency Table 16 Then the Taylor expansion around c = 0 for the j-th term of G is ( ) (Nj ˆµ j0 ) (ˆµ j0 +N j ˆµ j0 )T g,0 = (ˆµ j0 +N j ˆµ j0 ) ˆµ j0 = ˆµ j0 [ (N j ˆµ j0 ) [ (N j ˆµ j0 ) 1 (N j ˆµ j0 ) ˆµ j0 ˆµ + 1 (N j ˆµ j0 ) 3 j0 3 ˆµ 3 j0 ] 1 (N j ˆµ j0 ) ˆµ j0 ˆµ + 1 (N j ˆµ j0 ) 3 j0 3 ˆµ 3 j0 [ ] (N j ˆµ j0 ) +(N j ˆµ j0 ) 1 (N j ˆµ j0 ) ˆµ j0 ˆµ + 1 (N j ˆµ j0 ) 3 j0 3 ˆµ 3 j0 = (N j ˆµ j0 ) 1 (N j ˆµ j0 ) + 1 (N j ˆµ j0 ) 3 ˆµ j0 3 ˆµ j0 + (N j ˆµ j0 ) 1 (N j ˆµ j0 ) 3 ˆµ j0 ˆµ + 1 (N j ˆµ j0 ) 4 j0 3 ˆµ 3 j0 = (N j ˆµ j0 )+ 1 (N j ˆµ j0 ) 1 (N j ˆµ j0 ) 3 ˆµ j0 6 ˆµ j0 = (N j ˆµ j0 )+ 1 (N j ˆµ j0 ) +O [(N j ˆµ j0 ) 3] ˆµ j0 Since under the null hypothesis J (N j ˆµ j0 ) = 0, as n, G (N j ˆµ j0 ) ˆµ j0 = X P ] Since the proof is shown without the use of explicit form of ML estimator ˆµ j0, X P and G are always asymptotically equivalent under the same null hypothesis 4 The chi-squared test for homogeneity The objective of the chi-squared test for homogeneity is to decide if two or more samples come from the same population by determining if they give the same probabilities for a fixed number of sets Let X 1,,X I be random variables, defined on the same outcome space Ξ which has the partition { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ξ} with unknown cdf F 1,,F I, respectively Let X i1,,x ini be a random sample of F i, i = 1,,I We set π ij = P i (X i B j ) i = 1,,I, j = 1,,J, where P i is the probability function associated with F i, i = 1,,I The data is grouped in the following way: let N ij denote the number of observations in the i-th sample fall in the j-th category N ij = 1[X ik B j ] i = 1,,I, j = 1,,J k=1

23 Description of the Chi-Squared Test For A Contingency Table 17 Then, we have an I J contingency table as per Table The null hypothesis for a chi-squared test of homogeneity of the random variables is which is equivalent to H 0 : F 1 (x) = = F I (x) for all x R, (10) H 0 : π 1j = = π Ij = π j, j = 1,,J (11) Since the hypothesis H 0 is more restrictive than the hypothesis H 0, if the hypothesis H 0 is rejected then it is natural to reject the hypothesis H 0 If the hypothesis H 0 is not rejected, then we can say that the grouped data do not contradict the hypothesis H 0 The Pearson chi-squared statistic Given that the row totals are fixed, the contingency table falls under the row-multinomial model and the i-th row N i has a multinomial(n i,π i ) distribution, i = 1,,I The Pearson statistic to test H 0 against hypothesis the H 1 : π ij π j for some j, is X P = ( N ij ˆµ (0) ij where ˆµ (0) ij is the ML estimator for µ ij = n i π ij under H 0, i = 1,,I, j = 1,,J We need to find the ML estimators ˆπ (0) ij under H 0 Under the row-multinomial model, the log-likelihood function for π is Under H 0, the log-likelihood is l(π {n ij },{n i },n) l 0 (π {n ij },{n i },n) ˆµ (0) ij ) n ij lnπ ij n j lnπ j Under the constraint π 1 + +π J = 1, the Lagrange function under H 0 is L 0 ({n ij },λ {π j },n) n j lnπ j λ To find the ML estimator for π j, we maximise the Lagrange function: π j 1 L 0 ({n ij },λ {π j },n) = 0 π j = n j π j λ Substituting into the constraint yields (n 1 + +n J )/λ = 1, so λ = n and the ML estimators for the marginal column probabilities π j under H 0 are ˆπ (0) j = N j n j = 1,,J

24 Description of the Chi-Squared Test For A Contingency Table 18 Then, the ML estimators for the cell probabilities π ij under H 0 : π 1j = = π Ij = π j are ˆπ (0) ij = ˆπ (0) j = N j n i = 1,,I, j = 1,,J and the ML estimators for the cell means µ ij under H 0 are ij = n i N j, i = 1,,I, j = 1,,J n ˆµ (0) Thus, the Pearson statistic for homogeneity is X P = 1 n = 1 n = n Theorem 8 Under the hypothesis, (nn ij n i N j ) n i N j n Nij nn i N j N ij (n i N j ) N ij n i N j n i N j XP = n N ij 1 n i N j N ij 1 n H 0 : π 1j = = π Ij = π j, j = 1,,J n i N j the Pearson statistic XP has an approximately chi-squared distribution with (I 1)(J 1) degrees of freedom as n i, i = 1,,I, and n Proof Under H 0, by the same approach as for Theorem 6, we obtain that for each row, Xi (N ij n i N j ) = n i N j i = 1,,I approximately has a χ (J 1) distribution as n i, i = 1,,I Since only I 1 of the rows are independent under H 0, XP = I X i approximately has a χ (I 1)(J 1) distribution as n i, i = 1,,I and n The likelihood ratio test statistic Under the row-multinomial model, the likelihood function for π = (π 11,,π IJ ) is I ( ) L(π {n ij },{n i },n) = n i J π nij ij, n i1,,n ij and the unrestricted ML estimators for the probabilities π ij are ˆπ ij = N ij n i = 1,,I j = 1,,J

25 Description of the Chi-Squared Test For A Contingency Table 19 As we saw, ˆπ (0) ij = ˆπ (0) j = N j /n, i = 1,,I and j = 1,,J Therefore, the two-sided LRT statistic is Λ(N,n) = ( ) L ˆπ (0) ij N,n L(ˆπ ij N,n) = I [ ( N i N i1,,n ij I [ ( N i N i1,,n ij and the chi-squared LRT statistic for homogeneity is G = lnλ = ) J ) J N ij ln ] ( ) Nij N j n ] = ( ) Nij Nij n ( Nij N j ) I J ( ) Nij N j N ij The parameter space is Θ = { (π 11,,π IJ ) [0,1] IJ : J } π ij = 1 i = 1,,I and dimθ = I(J 1) because of the relation π ij = 1 π ij for the I rows The parameter space under H 0 is { } J Θ 0 = (π 1,,π J ) [0,1] J : π j = 1 and dimθ 0 = J 1 because of the relation π J = 1 π j So under H 0, the chi-squared LRT statistic G has a chi-squared distribution with I(J 1) (J 1) = (I 1)(J 1) degrees of freedom as n i, i = 1,,I, and n 43 The chi-squared test for independence The objective of the chi-squared test of independence is to decide if two samples are independent by comparing the observations to the expected values under the hypothesis of independence Let X be a random variable, defined on the outcome space Ξ which has partition A = { A 1,,A I : A i A j =, i j = 1,,I, Ai = Ξ} Similarly, let Y random variable, defined on the outcome space Ψ which has partition { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ψ} The null hypothesis of independence between X and Y is H 0 : F (x,y) = F X (x)f Y (y) for all (x,y) R (1) where F is the joint cdf, F X and F Y are the cdf for X and Y, respectively Suppose (X 1,Y 1 ),,(X n,y n ) is a random sample of the random vector (X,Y) Let N ij denote the random variable that counts the number of observations that fall in the (i, j)-th cross-category: N ij = n 1[X k A i,y k B j ] i = 1,,I, j = 1,,J k=1 Then we have an I J contingency table as per Table 1

26 Description of the Chi-Squared Test For A Contingency Table 0 We set π ij = P (X A i,y B j ) i = 1,,I, j = 1,,J where P is the probability function associated with the joint cdf F If the hypothesis H 0 is true, then we would have π ij = π i π j for all i and j Thus a corresponding null hypothesis is H 0 : π ij = π i π j i = 1,,I j = 1,,J (13) Since the hypothesis H 0 is narrower than the hypothesis H 0, if the hypothesis H 0 is rejected then it is natural to reject the hypothesis H 0 If the hypothesis H 0 is not rejected, then we can say that the grouped data do not contradict the hypothesis H 0 The Pearson chi-squared statistic Given that only the total number of observations is fixed, the contingency table falls under the unrestricted bivariate sampling model and so the random vector of the cell counts N = (N 11,,N IJ ) has a multinomial(n,π) such that π π IJ = 1 The Pearson statistic to test H 0 against H 1 : π ij π i π j for some (i,j), is X P = ( N ij ˆµ (0) ij where ˆµ (0) ij is the ML estimator for µ ij = nπ ij under H 0, i = 1,,I, j = 1,,J We need to find the ML estimators ˆπ (0) ij under H 0 : π ij = π i π j Under the unrestricted bivariate sampling model, the log-likelihood function for π is Under H 0, the log-likelihood is l(π {n ij },n) l 0 (π {n ij },n) ˆµ (0) ij n i lnπ i + ) n ij lnπ ij n j lnπ j Under the constraints π 1 + +π I = 1 and π 1 + +π J = 1, the Lagrange function under H 0 is ) L 0 ({n ij },λ {π i },{π j },n) n i lnπ i + π i 1 λ π j 1 n j lnπ j λ 1( To find the ML estimator for π i, we maximise the Lagrange function: π i L 0 ({n ij },λ {π i },{π j },n) = 0 π i = n i λ 1 Substituting into the first constraint yields (n 1 + +n I )/λ 1 = 1, so λ 1 = n and the ML estimators for the marginal row probabilities π i under H 0 are ˆπ (0) i = N i n i = 1,,I

27 Description of the Chi-Squared Test For A Contingency Table 1 Similarly, λ = n and the ML estimators for the marginal column probabilities π j under H 0 are ˆπ (0) j = N j n j = 1,,J Then, the ML estimators for the cell probabilities π ij under H 0 are ˆπ (0) ij = ˆπ (0) i ˆπ (0) j = N i N j n n and the ML estimators for the cell means µ ij under H 0 are ˆµ (0) ij = nˆπ (0) ij = N i N j n Thus, the Pearson statistic for independence is X P = 1 n = 1 n = n Theorem 9 Under the hypothesis (nn ij N i N j ) N i N j i = 1,,I, j = 1,,J i = 1,,I, j = 1,,J (nn ij ) nn ij N i N j +t(n i N j ) N ij N i N j N i N j XP N = n ij 1 N i N j N ij + 1 n N i N j H 0 : π ij = π i π j i = 1,,I j = 1,,J the Pearson statistic X P asymptotically has as a χ (I 1)(J 1) as n i, i = 1,,I, and n Proof Similar to the Pearson statistic for the test of homogeneity except with a vector of length IJ The likelihood ratio test statistic Under the unrestricted bivariate sampling model, likelihood function for π = (π 11,,π IJ ) is ( ) I n J L(π {n ij },n) = n 11,,n IJ and the unrestricted ML estimators for the cell probabilities π ij are ˆπ ij = N ij n i = 1,,I j = 1,,J As we saw, ˆπ (0) ij = ˆπ (0) i ˆπ (0) j = N i N j /n for all i = 1,,I and j = 1,,J Therefore, the two-sided LRT statistic is Λ(N,n) = ( ) L ˆπ (0) ij N,n L(ˆπ ij N,n) = ( n ) I N 11,,N IJ ( ) n I N 11,,N IJ J J π nij ij, ( ) Nij Ni N j n ( ) = Nij Nij n I J ( ) Nij Ni N j nn ij

28 Description of the Chi-Squared Test For A Contingency Table and the chi-squared LRT statistic for independence is G = lnλ = ( ) nnij N ij ln N i N j The parameter space is Θ = { (π 11,,π IJ ) [0,1] IJ : I J } π ij = 1 and dimθ = IJ 1 because of the relation π IJ = 1 π ij The parameter space under H 0 is { Θ 0 = (π 1 π 1,,π I π J ) I [0,1] IJ : π i = 1 } J π j = 1 anddimθ 0 = (I 1)+(J 1)becauseoftherelationsπ I = 1 π i andπ J = 1 π j SounderH 0, the chi-squared LRT statistic G has a chi-squared distribution with IJ 1 (I 1) (J 1) = (I 1)(J 1) degrees of freedom as n

29 Chapter 3 Tools for the Decomposition of the Pearson Statistic In this chapter, we first discuss why sometimes the Pearson and likelihood ratio tests are not appropriate to use We follow to give the three major concepts that we will rely on for the decomposition of the Pearson statistic: ranked data, smooth models and orthonormal polynomials with respect to a distribution 31 Ordinal Categories and Testing for Trends Chi-squared tests of independence merely indicate the degree of evidence of association between the row and column variables/classifications One major fault of the Pearson and LRT chi-squared tests of independence is that the expected cell counts only depend on the marginal totals The order of the categories of the rows and the columns is not taken into account Thus Pearson and LRT statistics treat both the row and column variables/classifications as being nominal If the chi-squared tests are applied when at least one variable/classification is ordinal, this information is ignored and may lead to a false decision Example 31 (From Agresti[1]) In the following table the variables are income and job satisfaction, measured for the black males in a national (USA) sample Both classifications are ordinal, with the categories very dissatisfied (VD), little dissatisfied (LD), moderately satisfied (MS) and very satisfied (VS) Job Satisfaction Income (USD) VD LD MS VS Total < 15, ,000 5, ,000 40, > 40, Total The Pearson and LRT statistics for testing independence are X P = 60 and G = 68 with 9 degrees of freedom and p-values are 074 and 066, respectively The statistics show little evidence of association However, we can permute the columns and rows to obtain the following table 3

30 Tools for the Decomposition of the Pearson Statistic 4 Job Satisfaction Income (USD) VD LD MS VS Total < 15, ,000 5, , , > 40, Total Since columns and rows of the second table are simply a permutation of the columns and rows from the first table, the cell counts are still being compared to the same estimated cell means and the Pearson and LRT statistics have the same value thus not rejecting the hypothesis of independence between income and job satisfaction, yet we can see there is an increasing trend In this thesis, we also look at chi-squared tests of homogeneity against monotonic trend Let X 1,,X I be random variables, defined on the same outcome space Ξ which has the partition { B = B 1,,B J : B i B j =, i j = 1,,J, Bj = Ξ} with unknown cdf F 1,,F I, respectively We set π ij = P i (X i B j ) i = 1,,I, j = 1,,J, where P i is the probability function associated with F i, i = 1,,I The null hypothesis for a chi-squared test of homogeneity of the random variables is which is equivalent to where H 0 : F 1 (x) = = F I (x) for all x R H 0 : τ i1 = = τ Ij,j = 1,,J { πi1 + +π ij j = 1,,J 1, τ ij = 1 j = J are the cumulative cell probabilities The chi-squared test of homogeneity against monotonic trend tests H 0 against H 1 : F 1 (x) F I (x) for all x R H : F 1 (x) F I (x) for all x R or H 3 : F 1 (x) F I (x) or F 1 (x) F I (x) for all x R with at least one strict inequality in the alternatives This is equivalent to testing H 0 against H 1 : τ i1 τ Ij j = 1,,J, (31) H : τ i1 τ Ij j = 1,,J, (3) or H 3 : τ i1 τ Ij or τ i1 τ Ij j = 1,,J (33)

31 Tools for the Decomposition of the Pearson Statistic 5 with at least one strict inequality in the alternative hypotheses The alternative (31) is equivalent to a monotone increasing trend in the row distributions of a row-multinomial model, the alternative (3) is equivalent to a monotone decreasing trend, and the alternative (33) is equivalent to a monotonic trend Once again, the Pearson and LRT chi-squared tests are invariant to the possible order of the columns when testing against the above directional alternatives and may result into a false decision Example 3 (From Bagdonavičius et al[4]) Investigating the granular composition of quartz in Lithuanian and Saharan sand samples, the data on the lengths (in cm) of the maximal axis of the quartz grains were used Using the granular composition of the samples, conclusions on the geological sand formation conditions were drawn The data are grouped in intervals Interval midpoints Total Lithuanian Sand Saharan Sand Total The Pearson and LRT statistics to verify the hypothesis that the length of the maximum axis has the same distribution in Lithuanian and Saharan sand are X P = 75 and G = 745 with 7 degrees of freedom and p-values < 10 1 The statistics show there is evidence of association Looking at the table of cumulative cell probabilities, Interval midpoints Lithuanian Sand Saharan Sand an appropriate follow-up test would compare the hypothesis of homogeneity against decreasing trend in the row distributions For both these situations, ordinal variables or trend alternatives, alterations to the usual chi-squared tests have been made to increase the power of the test Without loss of generality, we are interested in detecting monotone increasing trends 3 Ranked Data and Contingency Tables We consider nonparametric tests based on statistics which depend only on the location of observations in the ordered sample and not directly on their values 31 Ranked data and their properties Let X 1,,X n be a random sample of a random variable X and let X (1) X (n) be the order statistics If the distribution of X is absolutely continuous, ie the probability of having tied ranks is zero, and X (1) < < X (n), the rank R i of X i is the order number of X i in the ordered sample X (1),,X (n) : R i = n ji [ ] X i = X (j)

32 Tools for the Decomposition of the Pearson Statistic 6 Since ranks take the values 1,,n, the sum and the sum of squares of the ranks are constant n R i = n i = n(n+1), n Ri = n i = n(n+1)(n+1) 6 Then, if the distribution of the random variable X is absolutely continuous, then the ranks have probability P(R = i) = 1/n and E[R i ] = n+1, cov[r i,r j ] = n+1 [ ] ni n 1 n 1 n 1 i,j = 1,,n where I n is the n n identity matrix and 1 n is the vector of n 1 s The notion of rank can be generalized to the case where the distribution of the random variable X is not necessarily absolutely continuous, thus allowing for tied ranks If there is a group of s coinciding at the j-th order statistic in the ordered sample X (1) X (n) and X i = X (j) then R i is defined as the arithmetic mean of the positions of coinciding observations: R i = j +(j +1)+ +(j +s 1) s = j+s 1 a=j s 1 a t = j + a=1 a s = j + s 1 These adjusted ranks, called mid-ranks, have no effect on the mean of the ranks Suppose X i1 = = X is = X (j), then the sum of their ranks is s R ib = b=1 s b=1 and the sum of all ranks remains the same Set ( j + s 1 ) = S = s b=1 k s l (s l 1) l=1 j+s 1 a=j j+s 1 a s = a where k denotes the number of random groups with coinciding members in the ordered sample and s l denotes the number of the members in the l-th group Theorem 31 Conditional on the observed ties, the means and the covariances of the ranks are a=j E[R i ] = n+1 cov[r i,r j ] = n(n 1) S 1n(n 1) [ni n 1 n 1 n] i,j = 1,,n Proof The proof is provided by Bagdonavičius et al[4] 3 Advantages and disadvantages of working with ranked data Conover[17] gives two reasons why one would prefer to work with ranks over the actual data: If the numbers assigned to the observations have no meaning by themselves but rather attain meaning only in an ordinal comparison with other observations, then the number contains no more information than the ranks contain

33 Tools for the Decomposition of the Pearson Statistic 7 Even if the numbers have meaning but the distribution function is not normal, the probability theory is beyond our reach when the statistic is based on the actual data The probability theory based on ranks is relatively simple and does not depend on the distribution in many cases Conover[17] defines a nonparametric method as follows: Definition 31 A statistical method is nonparametric if it satisfies at least one of the following criteria: The method may be used on data with a nominal scale of measurement The method may be used on data with an ordinal scale of measurement The method may be usedon data with an interval or ratio scale of measurement, where the distribution function of the random variable producing the data is either unspecified or specified except for an infinite number of unknown parameters Thus satisfying the second criteria, a statistical method using ranked data is considered nonparametric and has the following advantages: As given by the definition, nonparametric methods based on ranked data can be used on data with ordinal, interval or ratio scale of measurement As they were originally developed before the wide use of computers, nonparametric methods are intuitive and are simple to carry out by hand, for small samples at least Thereareno orverylimited assumptionsonthe formatofthe data A nonparametricmethod may be preferable when the assumptions required for a parametric method are not valid For example, the scale of the measurement: the Pearson correlation assumes data is at least interval while its nonparametric equivalent, the Spearman correlation, assumes the data is at least ordinal the underlying distribution: the one sample t-test requires observations to be drawn from a normally distributed population while its nonparametric equivalent, the Wilcoxon signed rank test, simply assumes the observations are drawn from the same population However, we need to acknowledge there are also some disadvantages to the use of nonparametric methods: If the sample size is too large, nonparametric methods may be difficult to compute by hand and unlike parametric methods, appropriate computer software for nonparametric methods are usually computationally intensive and can be limited, although the situation is improving If the assumptions of the corresponding parametric method hold or the sample size is not large enough, nonparametric methods may lack power compared to their parametric equivalent Tied values can be problematic when these are common, and adjustments to the test statistic may be necessary When the data follows a normal distribution, for example, the mean and standard deviation are all that is required to understand the distribution and make inference With nonparametric methods, because there only probabilities involved, reducing the data to a few number (eg the median) does not give an accurate picture In this thesis, though there may be a large number of categories, with an ordinal variable and/or a trend, we reduce the amount of information necessary to better interpret the data

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007) FROM: PAGANO, R. R. (007) I. INTRODUCTION: DISTINCTION BETWEEN PARAMETRIC AND NON-PARAMETRIC TESTS Statistical inference tests are often classified as to whether they are parametric or nonparametric Parameter

More information

Unit 14: Nonparametric Statistical Methods

Unit 14: Nonparametric Statistical Methods Unit 14: Nonparametric Statistical Methods Statistics 571: Statistical Methods Ramón V. León 8/8/2003 Unit 14 - Stat 571 - Ramón V. León 1 Introductory Remarks Most methods studied so far have been based

More information

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47

Contents Kruskal-Wallis Test Friedman s Two-way Analysis of Variance by Ranks... 47 Contents 1 Non-parametric Tests 3 1.1 Introduction....................................... 3 1.2 Advantages of Non-parametric Tests......................... 4 1.3 Disadvantages of Non-parametric Tests........................

More information

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami Parametric Assumptions The observations must be independent. Dependent variable should be continuous

More information

Textbook Examples of. SPSS Procedure

Textbook Examples of. SPSS Procedure Textbook s of IBM SPSS Procedures Each SPSS procedure listed below has its own section in the textbook. These sections include a purpose statement that describes the statistical test, identification of

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j. Chapter 9 Pearson s chi-square test 9. Null hypothesis asymptotics Let X, X 2, be independent from a multinomial(, p) distribution, where p is a k-vector with nonnegative entries that sum to one. That

More information

Summary of Chapters 7-9

Summary of Chapters 7-9 Summary of Chapters 7-9 Chapter 7. Interval Estimation 7.2. Confidence Intervals for Difference of Two Means Let X 1,, X n and Y 1, Y 2,, Y m be two independent random samples of sizes n and m from two

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines)

Dr. Maddah ENMG 617 EM Statistics 10/12/12. Nonparametric Statistics (Chapter 16, Hines) Dr. Maddah ENMG 617 EM Statistics 10/12/12 Nonparametric Statistics (Chapter 16, Hines) Introduction Most of the hypothesis testing presented so far assumes normally distributed data. These approaches

More information

Modeling and inference for an ordinal effect size measure

Modeling and inference for an ordinal effect size measure STATISTICS IN MEDICINE Statist Med 2007; 00:1 15 Modeling and inference for an ordinal effect size measure Euijung Ryu, and Alan Agresti Department of Statistics, University of Florida, Gainesville, FL

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Inferential Statistics

Inferential Statistics Inferential Statistics Eva Riccomagno, Maria Piera Rogantin DIMA Università di Genova riccomagno@dima.unige.it rogantin@dima.unige.it Part G Distribution free hypothesis tests 1. Classical and distribution-free

More information

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Optimal exact tests for complex alternative hypotheses on cross tabulated data Optimal exact tests for complex alternative hypotheses on cross tabulated data Daniel Yekutieli Statistics and OR Tel Aviv University CDA course 29 July 2017 Yekutieli (TAU) Optimal exact tests for complex

More information

j=1 π j = 1. Let X j be the number

j=1 π j = 1. Let X j be the number THE χ 2 TEST OF SIMPLE AND COMPOSITE HYPOTHESES 1. Multinomial distributions Suppose we have a multinomial (n,π 1,...,π k ) distribution, where π j is the probability of the jth of k possible outcomes

More information

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing 1 In most statistics problems, we assume that the data have been generated from some unknown probability distribution. We desire

More information

2.3 Analysis of Categorical Data

2.3 Analysis of Categorical Data 90 CHAPTER 2. ESTIMATION AND HYPOTHESIS TESTING 2.3 Analysis of Categorical Data 2.3.1 The Multinomial Probability Distribution A mulinomial random variable is a generalization of the binomial rv. It results

More information

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables International Journal of Statistics and Probability; Vol. 7 No. 3; May 208 ISSN 927-7032 E-ISSN 927-7040 Published by Canadian Center of Science and Education Decomposition of Parsimonious Independence

More information

Lecture 8: Summary Measures

Lecture 8: Summary Measures Lecture 8: Summary Measures Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 8:

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

Bivariate Paired Numerical Data

Bivariate Paired Numerical Data Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown

Nonparametric Statistics. Leah Wright, Tyler Ross, Taylor Brown Nonparametric Statistics Leah Wright, Tyler Ross, Taylor Brown Before we get to nonparametric statistics, what are parametric statistics? These statistics estimate and test population means, while holding

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

INTERVAL ESTIMATION AND HYPOTHESES TESTING

INTERVAL ESTIMATION AND HYPOTHESES TESTING INTERVAL ESTIMATION AND HYPOTHESES TESTING 1. IDEA An interval rather than a point estimate is often of interest. Confidence intervals are thus important in empirical work. To construct interval estimates,

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano Testing Statistical Hypotheses Third Edition 4y Springer Preface vii I Small-Sample Theory 1 1 The General Decision Problem 3 1.1 Statistical Inference and Statistical Decisions

More information

5 Introduction to the Theory of Order Statistics and Rank Statistics

5 Introduction to the Theory of Order Statistics and Rank Statistics 5 Introduction to the Theory of Order Statistics and Rank Statistics This section will contain a summary of important definitions and theorems that will be useful for understanding the theory of order

More information

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti Good Confidence Intervals for Categorical Data Analyses Alan Agresti Department of Statistics, University of Florida visiting Statistics Department, Harvard University LSHTM, July 22, 2011 p. 1/36 Outline

More information

PRINCIPLES OF STATISTICAL INFERENCE

PRINCIPLES OF STATISTICAL INFERENCE Advanced Series on Statistical Science & Applied Probability PRINCIPLES OF STATISTICAL INFERENCE from a Neo-Fisherian Perspective Luigi Pace Department of Statistics University ofudine, Italy Alessandra

More information

One-Way Tables and Goodness of Fit

One-Way Tables and Goodness of Fit Stat 504, Lecture 5 1 One-Way Tables and Goodness of Fit Key concepts: One-way Frequency Table Pearson goodness-of-fit statistic Deviance statistic Pearson residuals Objectives: Learn how to compute the

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Components of the Pearson-Fisher Chi-squared Statistic

Components of the Pearson-Fisher Chi-squared Statistic JOURNAL OF APPLIED MATHEMATICS AND DECISION SCIENCES, 6(4), 241 254 Copyright c 2002, Lawrence Erlbaum Associates, Inc. Components of the Pearson-Fisher Chi-squared Statistic G.D. RAYNER National Australia

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling

Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling Geometry of Goodness-of-Fit Testing in High Dimensional Low Sample Size Modelling Paul Marriott 1, Radka Sabolova 2, Germain Van Bever 2, and Frank Critchley 2 1 University of Waterloo, Waterloo, Ontario,

More information

3 Joint Distributions 71

3 Joint Distributions 71 2.2.3 The Normal Distribution 54 2.2.4 The Beta Density 58 2.3 Functions of a Random Variable 58 2.4 Concluding Remarks 64 2.5 Problems 64 3 Joint Distributions 71 3.1 Introduction 71 3.2 Discrete Random

More information

Classification. Chapter Introduction. 6.2 The Bayes classifier

Classification. Chapter Introduction. 6.2 The Bayes classifier Chapter 6 Classification 6.1 Introduction Often encountered in applications is the situation where the response variable Y takes values in a finite set of labels. For example, the response Y could encode

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability & Mathematical Statistics May 2011 Examinations INDICATIVE SOLUTION Introduction The indicative solution has been written by the Examiners with the

More information

The Multinomial Model

The Multinomial Model The Multinomial Model STA 312: Fall 2012 Contents 1 Multinomial Coefficients 1 2 Multinomial Distribution 2 3 Estimation 4 4 Hypothesis tests 8 5 Power 17 1 Multinomial Coefficients Multinomial coefficient

More information

Non-parametric Inference and Resampling

Non-parametric Inference and Resampling Non-parametric Inference and Resampling Exercises by David Wozabal (Last update. Juni 010) 1 Basic Facts about Rank and Order Statistics 1.1 10 students were asked about the amount of time they spend surfing

More information

Review of One-way Tables and SAS

Review of One-way Tables and SAS Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409

More information

Lecture 21: October 19

Lecture 21: October 19 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 21: October 19 21.1 Likelihood Ratio Test (LRT) To test composite versus composite hypotheses the general method is to use

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 12/15/2008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

Glossary for the Triola Statistics Series

Glossary for the Triola Statistics Series Glossary for the Triola Statistics Series Absolute deviation The measure of variation equal to the sum of the deviations of each value from the mean, divided by the number of values Acceptance sampling

More information

Empirical Power of Four Statistical Tests in One Way Layout

Empirical Power of Four Statistical Tests in One Way Layout International Mathematical Forum, Vol. 9, 2014, no. 28, 1347-1356 HIKARI Ltd, www.m-hikari.com http://dx.doi.org/10.12988/imf.2014.47128 Empirical Power of Four Statistical Tests in One Way Layout Lorenzo

More information

Subject CS1 Actuarial Statistics 1 Core Principles

Subject CS1 Actuarial Statistics 1 Core Principles Institute of Actuaries of India Subject CS1 Actuarial Statistics 1 Core Principles For 2019 Examinations Aim The aim of the Actuarial Statistics 1 subject is to provide a grounding in mathematical and

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE THE ROYAL STATISTICAL SOCIETY 004 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE PAPER II STATISTICAL METHODS The Society provides these solutions to assist candidates preparing for the examinations in future

More information

Unit 9: Inferences for Proportions and Count Data

Unit 9: Inferences for Proportions and Count Data Unit 9: Inferences for Proportions and Count Data Statistics 571: Statistical Methods Ramón V. León 1/15/008 Unit 9 - Stat 571 - Ramón V. León 1 Large Sample Confidence Interval for Proportion ( pˆ p)

More information

NAG Library Chapter Introduction. G08 Nonparametric Statistics

NAG Library Chapter Introduction. G08 Nonparametric Statistics NAG Library Chapter Introduction G08 Nonparametric Statistics Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 2.1 Parametric and Nonparametric Hypothesis Testing... 2 2.2 Types

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015 Part IB Statistics Theorems with proof Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua Lent 2015 These notes are not endorsed by the lecturers, and I have modified them (often significantly)

More information

Categorical Variables and Contingency Tables: Description and Inference

Categorical Variables and Contingency Tables: Description and Inference Categorical Variables and Contingency Tables: Description and Inference STAT 526 Professor Olga Vitek March 3, 2011 Reading: Agresti Ch. 1, 2 and 3 Faraway Ch. 4 3 Univariate Binomial and Multinomial Measurements

More information

ML Testing (Likelihood Ratio Testing) for non-gaussian models

ML Testing (Likelihood Ratio Testing) for non-gaussian models ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l

More information

Recall the Basics of Hypothesis Testing

Recall the Basics of Hypothesis Testing Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE

More information

Final Examination Statistics 200C. T. Ferguson June 11, 2009

Final Examination Statistics 200C. T. Ferguson June 11, 2009 Final Examination Statistics 00C T. Ferguson June, 009. (a) Define: X n converges in probability to X. (b) Define: X m converges in quadratic mean to X. (c) Show that if X n converges in quadratic mean

More information

Frequency Distribution Cross-Tabulation

Frequency Distribution Cross-Tabulation Frequency Distribution Cross-Tabulation 1) Overview 2) Frequency Distribution 3) Statistics Associated with Frequency Distribution i. Measures of Location ii. Measures of Variability iii. Measures of Shape

More information

Lecture 7: Hypothesis Testing and ANOVA

Lecture 7: Hypothesis Testing and ANOVA Lecture 7: Hypothesis Testing and ANOVA Goals Overview of key elements of hypothesis testing Review of common one and two sample tests Introduction to ANOVA Hypothesis Testing The intent of hypothesis

More information

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of Probability Sampling Procedures Collection of Data Measures

More information

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic STATISTICS ANCILLARY SYLLABUS (W.E.F. the session 2014-15) Semester Paper Code Marks Credits Topic 1 ST21012T 70 4 Descriptive Statistics 1 & Probability Theory 1 ST21012P 30 1 Practical- Using Minitab

More information

6 Single Sample Methods for a Location Parameter

6 Single Sample Methods for a Location Parameter 6 Single Sample Methods for a Location Parameter If there are serious departures from parametric test assumptions (e.g., normality or symmetry), nonparametric tests on a measure of central tendency (usually

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Testing Statistical Hypotheses

Testing Statistical Hypotheses E.L. Lehmann Joseph P. Romano, 02LEu1 ttd ~Lt~S Testing Statistical Hypotheses Third Edition With 6 Illustrations ~Springer 2 The Probability Background 28 2.1 Probability and Measure 28 2.2 Integration.........

More information

Contents. Acknowledgments. xix

Contents. Acknowledgments. xix Table of Preface Acknowledgments page xv xix 1 Introduction 1 The Role of the Computer in Data Analysis 1 Statistics: Descriptive and Inferential 2 Variables and Constants 3 The Measurement of Variables

More information

Bivariate Relationships Between Variables

Bivariate Relationships Between Variables Bivariate Relationships Between Variables BUS 735: Business Decision Making and Research 1 Goals Specific goals: Detect relationships between variables. Be able to prescribe appropriate statistical methods

More information

the long tau-path for detecting monotone association in an unspecified subpopulation

the long tau-path for detecting monotone association in an unspecified subpopulation the long tau-path for detecting monotone association in an unspecified subpopulation Joe Verducci Current Challenges in Statistical Learning Workshop Banff International Research Station Tuesday, December

More information

Practice Problems Section Problems

Practice Problems Section Problems Practice Problems Section 4-4-3 4-4 4-5 4-6 4-7 4-8 4-10 Supplemental Problems 4-1 to 4-9 4-13, 14, 15, 17, 19, 0 4-3, 34, 36, 38 4-47, 49, 5, 54, 55 4-59, 60, 63 4-66, 68, 69, 70, 74 4-79, 81, 84 4-85,

More information

simple if it completely specifies the density of x

simple if it completely specifies the density of x 3. Hypothesis Testing Pure significance tests Data x = (x 1,..., x n ) from f(x, θ) Hypothesis H 0 : restricts f(x, θ) Are the data consistent with H 0? H 0 is called the null hypothesis simple if it completely

More information

BTRY 4090: Spring 2009 Theory of Statistics

BTRY 4090: Spring 2009 Theory of Statistics BTRY 4090: Spring 2009 Theory of Statistics Guozhang Wang September 25, 2010 1 Review of Probability We begin with a real example of using probability to solve computationally intensive (or infeasible)

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

CDA Chapter 3 part II

CDA Chapter 3 part II CDA Chapter 3 part II Two-way tables with ordered classfications Let u 1 u 2... u I denote scores for the row variable X, and let ν 1 ν 2... ν J denote column Y scores. Consider the hypothesis H 0 : X

More information

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ

ADJUSTED POWER ESTIMATES IN. Ji Zhang. Biostatistics and Research Data Systems. Merck Research Laboratories. Rahway, NJ ADJUSTED POWER ESTIMATES IN MONTE CARLO EXPERIMENTS Ji Zhang Biostatistics and Research Data Systems Merck Research Laboratories Rahway, NJ 07065-0914 and Dennis D. Boos Department of Statistics, North

More information

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications WORKING PAPER SERIES WORKING PAPER NO 7, 2008 Swedish Business School at Örebro An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications By Hans Högberg

More information

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing

Chapter Fifteen. Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Chapter Fifteen Frequency Distribution, Cross-Tabulation, and Hypothesis Testing Copyright 2010 Pearson Education, Inc. publishing as Prentice Hall 15-1 Internet Usage Data Table 15.1 Respondent Sex Familiarity

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2009 Prof. Gesine Reinert Our standard situation is that we have data x = x 1, x 2,..., x n, which we view as realisations of random

More information

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Chapte The McGraw-Hill Companies, Inc. All rights reserved. er15 Chapte Chi-Square Tests d Chi-Square Tests for -Fit Uniform Goodness- Poisson Goodness- Goodness- ECDF Tests (Optional) Contingency Tables A contingency table is a cross-tabulation of n paired observations

More information

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's

More information

Master s Written Examination - Solution

Master s Written Examination - Solution Master s Written Examination - Solution Spring 204 Problem Stat 40 Suppose X and X 2 have the joint pdf f X,X 2 (x, x 2 ) = 2e (x +x 2 ), 0 < x < x 2

More information

ECON 4160, Autumn term Lecture 1

ECON 4160, Autumn term Lecture 1 ECON 4160, Autumn term 2017. Lecture 1 a) Maximum Likelihood based inference. b) The bivariate normal model Ragnar Nymoen University of Oslo 24 August 2017 1 / 54 Principles of inference I Ordinary least

More information

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

n y π y (1 π) n y +ylogπ +(n y)log(1 π). Tests for a binomial probability π Let Y bin(n,π). The likelihood is L(π) = n y π y (1 π) n y and the log-likelihood is L(π) = log n y +ylogπ +(n y)log(1 π). So L (π) = y π n y 1 π. 1 Solving for π gives

More information

Introduction to Statistical Analysis

Introduction to Statistical Analysis Introduction to Statistical Analysis Changyu Shen Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology Beth Israel Deaconess Medical Center Harvard Medical School Objectives Descriptive

More information

Comparison of Two Samples

Comparison of Two Samples 2 Comparison of Two Samples 2.1 Introduction Problems of comparing two samples arise frequently in medicine, sociology, agriculture, engineering, and marketing. The data may have been generated by observation

More information

2.6.3 Generalized likelihood ratio tests

2.6.3 Generalized likelihood ratio tests 26 HYPOTHESIS TESTING 113 263 Generalized likelihood ratio tests When a UMP test does not exist, we usually use a generalized likelihood ratio test to verify H 0 : θ Θ against H 1 : θ Θ\Θ It can be used

More information

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series

Lecture Slides. Elementary Statistics. by Mario F. Triola. and the Triola Statistics Series Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests

PSY 307 Statistics for the Behavioral Sciences. Chapter 20 Tests for Ranked Data, Choosing Statistical Tests PSY 307 Statistics for the Behavioral Sciences Chapter 20 Tests for Ranked Data, Choosing Statistical Tests What To Do with Non-normal Distributions Tranformations (pg 382): The shape of the distribution

More information

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21 Sections 2.3, 2.4 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 21 2.3 Partial association in stratified 2 2 tables In describing a relationship

More information

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F.

Lecture Slides. Section 13-1 Overview. Elementary Statistics Tenth Edition. Chapter 13 Nonparametric Statistics. by Mario F. Lecture Slides Elementary Statistics Tenth Edition and the Triola Statistics Series by Mario F. Triola Slide 1 Chapter 13 Nonparametric Statistics 13-1 Overview 13-2 Sign Test 13-3 Wilcoxon Signed-Ranks

More information

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă

HYPOTHESIS TESTING II TESTS ON MEANS. Sorana D. Bolboacă HYPOTHESIS TESTING II TESTS ON MEANS Sorana D. Bolboacă OBJECTIVES Significance value vs p value Parametric vs non parametric tests Tests on means: 1 Dec 14 2 SIGNIFICANCE LEVEL VS. p VALUE Materials and

More information

1 Directional Derivatives and Differentiability

1 Directional Derivatives and Differentiability Wednesday, January 18, 2012 1 Directional Derivatives and Differentiability Let E R N, let f : E R and let x 0 E. Given a direction v R N, let L be the line through x 0 in the direction v, that is, L :=

More information

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES

ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES ON SMALL SAMPLE PROPERTIES OF PERMUTATION TESTS: INDEPENDENCE BETWEEN TWO SAMPLES Hisashi Tanizaki Graduate School of Economics, Kobe University, Kobe 657-8501, Japan e-mail: tanizaki@kobe-u.ac.jp Abstract:

More information

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1 1.1 The Probability Model...1 1.2 Finite Discrete Models with Equally Likely Outcomes...5 1.2.1 Tree Diagrams...6 1.2.2 The Multiplication Principle...8

More information

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner

More information

Basic Business Statistics, 10/e

Basic Business Statistics, 10/e Chapter 1 1-1 Basic Business Statistics 11 th Edition Chapter 1 Chi-Square Tests and Nonparametric Tests Basic Business Statistics, 11e 009 Prentice-Hall, Inc. Chap 1-1 Learning Objectives In this chapter,

More information

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC CHI SQUARE ANALYSIS I N T R O D U C T I O N T O N O N - P A R A M E T R I C A N A L Y S E S HYPOTHESIS TESTS SO FAR We ve discussed One-sample t-test Dependent Sample t-tests Independent Samples t-tests

More information

Simulating Realistic Ecological Count Data

Simulating Realistic Ecological Count Data 1 / 76 Simulating Realistic Ecological Count Data Lisa Madsen Dave Birkes Oregon State University Statistics Department Seminar May 2, 2011 2 / 76 Outline 1 Motivation Example: Weed Counts 2 Pearson Correlation

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information