Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are associated with each of the t different categories for X. These categories may be either nominal or ordinal and we will suppose that they are labeled,..., t. (i) A nominal variable (sometimes called a categorical variable) is one that has t 2 categories, but there is no intrinsic ordering to these categories. For example, gender is a nominal variable having two categories ( male and female ). Eye color is also a nominal variable having say four categories ( blue, green, brown, hazel ) but, there is no agreed way to order these from lowest to highest. (ii) An ordinal variable is similar to a nominal variable in that it has t 2 categories but the difference between the two is that there is a clear ordering of these categories. The numbers which are assigned to the t categories are then directly related to their rank order. For example, the variable size with categories small, medium and large coded, 2 and 3, respectively. The probability mass function of X is of the form p j = P (X = j) for j =,..., t and the p j s satisfy the constraint that t p j =. We can represent these probabilities as a vector of cell probabilities denoted by p = (p,..., p t ) T. The simplest case is when X is a binary random variable taking possible values and 2. Then, we have P (X = ) = p and P (X = 2) = p. If we have an independent random sample (irs) of N individuals (ie. X,..., X N ) then we can represent the data by the cell counts Y,..., Y t where t Y j = N. Here, Y j denotes the number of individuals in the sample which have a value of X equal to j. A suitable probability model for the random variable Y is the multinomial distribution with parameters N and p. (When t = 2 this is equivalent to the Bi(N, p) distribution). The probability p j can then be interpreted as the relative frequency of category j in the population. As you have seen in the Statistical Inference module (chapter 4) the maximum likelihood estimator of p j is Y j /N, the sample proportion. Under this multinomial model, the expected cell count for category j is E(Y j ) = Np j and the unconstrained maximum likelihood estimator of E(Y j ) is N ˆp j = Y j.
We can test hypotheses of the form H 0 : p j = π j, j =,..., t regarding a set of particular values π for the probabilities p using the generalized likelihood ratio test. In Statistical Inference, chapter 4 the test statistic is shown to be ( ) t Yj 2 log Λ = 2 Y j log Nπ j where Nπ j = E(Y j H 0 ). The null distribution of this statistic is χ 2 t which can be used to obtain a critical value or p-value for assessing significance. For sufficiently large N, this statistic is well-approximated by the alternative statistic = t (Y j Nπ j ) 2 Nπ j = t (O j E j ) 2 where O j is the observed value of Y j and E j = E(Y j H 0 ), the expected value of Y j under H 0. also has a null χ 2 t distribution. Multivariate Discrete Random Variables Suppose now that we have a random vector X = where each of the X j s are discrete random variables. Suppose that X j has c j different categories. The probability distribution of X is now described by a set of probabilities which each give the probability of falling into one of the c c 2... c p different cells in the full cross-classification of the p variables. The simplest scenario is when p = 2 and both X and are binary random variables. (ie. c = c 2 = 2). We can then present the probabilities in the following 2 2 table (or array): X X p E j X p p 2 p + 2 p 2 p 22 p 2+ Total p + p +2 2
where 2 2k= p jk = and p jk = P (X = j, = k), for j =, 2 and k =, 2. If we now have a random sample X,..., X N then we can summarize the data in the form a contingency table which is a c c 2... c p array of counts of the numbers of individuals falling into each of the cells. Again, this is easiest to illustrate for a 2 2 contingency table. We have: X Y 2 Y 22 Y + 2 Y 2 Y 22 Y 2+ Total Y + Y +2 N where 2 2k= Y jk = N and Y jk denotes the number of individuals in the sample having X = j and = k. Unlike the univariate case, with this structure the positions of the cells tell us something about the individuals falling into them. For example, all individuals in a specific cell have one characteristic in common with all the individuals in the other cell in the same row and another characteristic in common with all the individuals in the other cell in the same column. Note that if we had third random variable X 3, along with the binary variables X and, this would create a structure of c 3 two-way tables. Example - Coronary Heart Patient Data A random sample of N = 200 coronary heart disease patients had their blood pressure (BP) and serum cholesterol (SC) levels measured resulting in the following data summary: SC BP 23 26 49 2 82 69 5 Total 05 95 200 Note that low values for each variable are coded by a and high values by a 2. It can be generally helpful to express the data as proportions rather than counts. In this example a fixed size random sample of 200 patients was obtained and then each individual was classified into one of the four cells of the table. It therefore seems sensible to express the counts as proportions of the total sample size. ie. 3
SC BP 0.5 0.30 0.245 2 0.40 0.345 0.755 Total 0.525 0.475.000 There is a tendency for higher proportions of patients to have high blood pressure at both low and high serum cholesterol levels than have low blood pressure. The proportions having low and high serum cholesterol levels are fairly similar at both blood pressure levels. An hypothesis of particular interest in this scenario is whether the row and column variables are independently distributed. In the above example, this corresponds to testing whether serum cholesterol level is independent of blood pressure level. We can express the null hypothesis as: H 0 : p jk = p j+ p +k, j, k =, 2 This independence hypothesis can also be formulated in terms of the four parameters {p jk ; j, k =, 2} by writing H 0 for j =, k = as follows: p = (p + p 2 )(p + p 2 ) Multiplying p on the left hand side by = (p + p 2 + p 2 + p 22 ) we get, after some simplification p p 22 = p 2 p 2 We could also obtain this result for any other combination of j and k and it is thus equivalent to the condition expressed under H 0. Therefore, H 0 is true if and only if ρ = p p 22 p 2 p 2 =. ρ as defined above is called the odds ratio. Hence, H 0 can be equivalently expressed as H 0 : ρ = or, in terms of the log-odds ratio as H 0 : log ρ = 0 Since the MLE of p jk is y jk /N we can estimate ρ for a particular set of data as r = y y 22 y 2 y 2 4
and log r = log y + log y 22 log y 2 log y 2 which is a linear contrast of the cell frequencies. In terms of random variables, we have and it can be shown that V (log R) N R = Y Y 22 Y 2 Y 2 ( + + + ) p p 2 p 2 p 22 which is estimated by + + + y y 2 y 2 y 22 A 00( α)% confidence interval for log ρ is therefore given by ( log r ± z α/2 + + + ) y y 2 y 2 y 22 where z α/2 is the 00( α/2)% point of the N(0, ) distribution. Confidence limits for ρ are then obtained by transforming the end points exponentially. Note that we have formulated the problem in terms of log R because the asymptotic distribution for log R is fairly well-approximated by a Normal distribution while that for R can be markedly skewed for values close to zero. Therefore, to check the independence hypothesis, we can construct say, a 95% confidence interval for ρ and check if the value is contained in the interval. For the heart patient data above we have r = (23 69)/(26 82) = 0.744 so that log r = 0.295. A 95% confidence interval for log ρ is therefore given by 0.295 ±.96 23 + 26 + 82 + 69 ie.( 0.940, 0.35) A 95% confidence interval for ρ is then (0.390,.420) which does contain the value. We therefore conclude that serum cholesterol and blood pressure levels are independently distributed for coronary heart patients. 5