Discrete Multivariate Statistics

Similar documents
Statistics 3858 : Contingency Tables

8 Nominal and Ordinal Logistic Regression

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

11-2 Multinomial Experiment

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

STATISTICS ANCILLARY SYLLABUS. (W.E.F. the session ) Semester Paper Code Marks Credits Topic

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Ordinal Variables in 2 way Tables

13.1 Categorical Data and the Multinomial Experiment

Lecture 01: Introduction

CHI SQUARE ANALYSIS 8/18/2011 HYPOTHESIS TESTS SO FAR PARAMETRIC VS. NON-PARAMETRIC

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Describing Contingency tables

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Generalized logit models for nominal multinomial responses. Local odds ratios

Unit 9: Inferences for Proportions and Count Data

Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Unit 9: Inferences for Proportions and Count Data

Chapter 11. Hypothesis Testing (II)

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Review of Statistics 101

Relate Attributes and Counts

Lecture 41 Sections Mon, Apr 7, 2008

Frequency Distribution Cross-Tabulation

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

Lecture 41 Sections Wed, Nov 12, 2008

Lecture 8: Summary Measures

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Categorical Data Analysis Chapter 3

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

STAT 705: Analysis of Contingency Tables

Confidence Intervals, Testing and ANOVA Summary

HYPOTHESIS TESTING: THE CHI-SQUARE STATISTIC

Tastitsticsss? What s that? Principles of Biostatistics and Informatics. Variables, outcomes. Tastitsticsss? What s that?

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Statistics for Managers Using Microsoft Excel

6 Single Sample Methods for a Location Parameter

Contingency Tables Part One 1

PhysicsAndMathsTutor.com

What is Statistics? Statistics is the science of understanding data and of making decisions in the face of variability and uncertainty.

Sample size calculations for logistic and Poisson regression models

Testing Independence

Example. χ 2 = Continued on the next page. All cells

STA6938-Logistic Regression Model

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

STAT Section 5.8: Block Designs

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Lecture Notes 2: Variables and graphics

Chi-Square. Heibatollah Baghi, and Mastee Badii

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

ML Testing (Likelihood Ratio Testing) for non-gaussian models

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Introduction. Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

MAT 2379, Introduction to Biostatistics, Sample Calculator Questions 1. MAT 2379, Introduction to Biostatistics

(Where does Ch. 7 on comparing 2 means or 2 proportions fit into this?)

15: CHI SQUARED TESTS

Clinical Research Module: Biostatistics

Psych Jan. 5, 2005

Lecture 10: Generalized likelihood ratio test

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Ling 289 Contingency Table Statistics

Chapte The McGraw-Hill Companies, Inc. All rights reserved.

Lecture 28 Chi-Square Analysis

Summary of Chapters 7-9

Exploring, summarizing and presenting data. Berghold, IMI, MUG

Statistical Process Control for Multivariate Categorical Processes

Chapter 26: Comparing Counts (Chi Square)

Psych 230. Psychological Measurement and Statistics

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

CHAPTER 3. YAKUP ARI,Ph.D.(C)

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Statistics in medicine

Introduction to Basic Statistics Version 2

Introduction to Statistical Analysis

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Statistics in medicine

Ch 6: Multicategory Logit Models

Multinomial Logistic Regression Models

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

BIOS 625 Fall 2015 Homework Set 3 Solutions

One-sample categorical data: approximate inference

MS&E 226: Small Data

8/4/2009. Describing Data with Graphs

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Principal Component Analysis for Mixed Quantitative and Qualitative Data

Chi square test of independence

Mark Scheme (Results) June 2008

Sample Size Determination

STAT Section 3.4: The Sign Test. The sign test, as we will typically use it, is a method for analyzing paired data.

Do not copy, post, or distribute. Independent-Samples t Test and Mann- C h a p t e r 13

Transcription:

Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are associated with each of the t different categories for X. These categories may be either nominal or ordinal and we will suppose that they are labeled,..., t. (i) A nominal variable (sometimes called a categorical variable) is one that has t 2 categories, but there is no intrinsic ordering to these categories. For example, gender is a nominal variable having two categories ( male and female ). Eye color is also a nominal variable having say four categories ( blue, green, brown, hazel ) but, there is no agreed way to order these from lowest to highest. (ii) An ordinal variable is similar to a nominal variable in that it has t 2 categories but the difference between the two is that there is a clear ordering of these categories. The numbers which are assigned to the t categories are then directly related to their rank order. For example, the variable size with categories small, medium and large coded, 2 and 3, respectively. The probability mass function of X is of the form p j = P (X = j) for j =,..., t and the p j s satisfy the constraint that t p j =. We can represent these probabilities as a vector of cell probabilities denoted by p = (p,..., p t ) T. The simplest case is when X is a binary random variable taking possible values and 2. Then, we have P (X = ) = p and P (X = 2) = p. If we have an independent random sample (irs) of N individuals (ie. X,..., X N ) then we can represent the data by the cell counts Y,..., Y t where t Y j = N. Here, Y j denotes the number of individuals in the sample which have a value of X equal to j. A suitable probability model for the random variable Y is the multinomial distribution with parameters N and p. (When t = 2 this is equivalent to the Bi(N, p) distribution). The probability p j can then be interpreted as the relative frequency of category j in the population. As you have seen in the Statistical Inference module (chapter 4) the maximum likelihood estimator of p j is Y j /N, the sample proportion. Under this multinomial model, the expected cell count for category j is E(Y j ) = Np j and the unconstrained maximum likelihood estimator of E(Y j ) is N ˆp j = Y j.

We can test hypotheses of the form H 0 : p j = π j, j =,..., t regarding a set of particular values π for the probabilities p using the generalized likelihood ratio test. In Statistical Inference, chapter 4 the test statistic is shown to be ( ) t Yj 2 log Λ = 2 Y j log Nπ j where Nπ j = E(Y j H 0 ). The null distribution of this statistic is χ 2 t which can be used to obtain a critical value or p-value for assessing significance. For sufficiently large N, this statistic is well-approximated by the alternative statistic = t (Y j Nπ j ) 2 Nπ j = t (O j E j ) 2 where O j is the observed value of Y j and E j = E(Y j H 0 ), the expected value of Y j under H 0. also has a null χ 2 t distribution. Multivariate Discrete Random Variables Suppose now that we have a random vector X = where each of the X j s are discrete random variables. Suppose that X j has c j different categories. The probability distribution of X is now described by a set of probabilities which each give the probability of falling into one of the c c 2... c p different cells in the full cross-classification of the p variables. The simplest scenario is when p = 2 and both X and are binary random variables. (ie. c = c 2 = 2). We can then present the probabilities in the following 2 2 table (or array): X X p E j X p p 2 p + 2 p 2 p 22 p 2+ Total p + p +2 2

where 2 2k= p jk = and p jk = P (X = j, = k), for j =, 2 and k =, 2. If we now have a random sample X,..., X N then we can summarize the data in the form a contingency table which is a c c 2... c p array of counts of the numbers of individuals falling into each of the cells. Again, this is easiest to illustrate for a 2 2 contingency table. We have: X Y 2 Y 22 Y + 2 Y 2 Y 22 Y 2+ Total Y + Y +2 N where 2 2k= Y jk = N and Y jk denotes the number of individuals in the sample having X = j and = k. Unlike the univariate case, with this structure the positions of the cells tell us something about the individuals falling into them. For example, all individuals in a specific cell have one characteristic in common with all the individuals in the other cell in the same row and another characteristic in common with all the individuals in the other cell in the same column. Note that if we had third random variable X 3, along with the binary variables X and, this would create a structure of c 3 two-way tables. Example - Coronary Heart Patient Data A random sample of N = 200 coronary heart disease patients had their blood pressure (BP) and serum cholesterol (SC) levels measured resulting in the following data summary: SC BP 23 26 49 2 82 69 5 Total 05 95 200 Note that low values for each variable are coded by a and high values by a 2. It can be generally helpful to express the data as proportions rather than counts. In this example a fixed size random sample of 200 patients was obtained and then each individual was classified into one of the four cells of the table. It therefore seems sensible to express the counts as proportions of the total sample size. ie. 3

SC BP 0.5 0.30 0.245 2 0.40 0.345 0.755 Total 0.525 0.475.000 There is a tendency for higher proportions of patients to have high blood pressure at both low and high serum cholesterol levels than have low blood pressure. The proportions having low and high serum cholesterol levels are fairly similar at both blood pressure levels. An hypothesis of particular interest in this scenario is whether the row and column variables are independently distributed. In the above example, this corresponds to testing whether serum cholesterol level is independent of blood pressure level. We can express the null hypothesis as: H 0 : p jk = p j+ p +k, j, k =, 2 This independence hypothesis can also be formulated in terms of the four parameters {p jk ; j, k =, 2} by writing H 0 for j =, k = as follows: p = (p + p 2 )(p + p 2 ) Multiplying p on the left hand side by = (p + p 2 + p 2 + p 22 ) we get, after some simplification p p 22 = p 2 p 2 We could also obtain this result for any other combination of j and k and it is thus equivalent to the condition expressed under H 0. Therefore, H 0 is true if and only if ρ = p p 22 p 2 p 2 =. ρ as defined above is called the odds ratio. Hence, H 0 can be equivalently expressed as H 0 : ρ = or, in terms of the log-odds ratio as H 0 : log ρ = 0 Since the MLE of p jk is y jk /N we can estimate ρ for a particular set of data as r = y y 22 y 2 y 2 4

and log r = log y + log y 22 log y 2 log y 2 which is a linear contrast of the cell frequencies. In terms of random variables, we have and it can be shown that V (log R) N R = Y Y 22 Y 2 Y 2 ( + + + ) p p 2 p 2 p 22 which is estimated by + + + y y 2 y 2 y 22 A 00( α)% confidence interval for log ρ is therefore given by ( log r ± z α/2 + + + ) y y 2 y 2 y 22 where z α/2 is the 00( α/2)% point of the N(0, ) distribution. Confidence limits for ρ are then obtained by transforming the end points exponentially. Note that we have formulated the problem in terms of log R because the asymptotic distribution for log R is fairly well-approximated by a Normal distribution while that for R can be markedly skewed for values close to zero. Therefore, to check the independence hypothesis, we can construct say, a 95% confidence interval for ρ and check if the value is contained in the interval. For the heart patient data above we have r = (23 69)/(26 82) = 0.744 so that log r = 0.295. A 95% confidence interval for log ρ is therefore given by 0.295 ±.96 23 + 26 + 82 + 69 ie.( 0.940, 0.35) A 95% confidence interval for ρ is then (0.390,.420) which does contain the value. We therefore conclude that serum cholesterol and blood pressure levels are independently distributed for coronary heart patients. 5