Categorical Data Analysis 1

Similar documents
Introduction 1. STA442/2101 Fall See last slide for copyright information. 1 / 33

The Multinomial Model

STA 2101/442 Assignment 3 1

Contingency Tables Part One 1

Interactions and Factorial ANOVA

Interactions and Factorial ANOVA

Introduction to Bayesian Statistics 1

STA 2101/442 Assignment 2 1

Power and Sample Size for Log-linear Models 1

Generalized Linear Models 1

Factorial ANOVA. STA305 Spring More than one categorical explanatory variable

UNIVERSITY OF TORONTO Faculty of Arts and Science

Factorial ANOVA. More than one categorical explanatory variable. See last slide for copyright information 1

The Bootstrap 1. STA442/2101 Fall Sampling distributions Bootstrap Distribution-free regression example

One-sample categorical data: approximate inference

STA441: Spring Multiple Regression. This slide show is a free open source document. See the last slide for copyright information.

STA441: Spring Multiple Regression. More than one explanatory variable at the same time

STA 431s17 Assignment Eight 1

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

CS 361: Probability & Statistics

Probability and Probability Distributions. Dr. Mohammed Alahmed

CS 361: Probability & Statistics

13.1 Categorical Data and the Multinomial Experiment

An introduction to biostatistics: part 1

Random Variable. Discrete Random Variable. Continuous Random Variable. Discrete Random Variable. Discrete Probability Distribution

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

Chapter 26: Comparing Counts (Chi Square)

Instrumental Variables Again 1

Two-sample Categorical data: Testing

Discrete Multivariate Statistics

Discrete Distributions

Math 140 Introductory Statistics

Math 124: Modules Overall Goal. Point Estimations. Interval Estimation. Math 124: Modules Overall Goal.

Math 140 Introductory Statistics

Salt Lake Community College MATH 1040 Final Exam Fall Semester 2011 Form E

Chapter 10: Chi-Square and F Distributions

Binomial and Poisson Probability Distributions

STA442/2101: Assignment 5

STA 2101/442 Assignment Four 1

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Hypothesis testing. Data to decisions

Lecture 01: Introduction

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Mock Exam - 2 hours - use of basic (non-programmable) calculator is allowed - all exercises carry the same marks - exam is strictly individual

Structural Equa+on Models: The General Case. STA431: Spring 2015

Chi Square Analysis M&M Statistics. Name Period Date

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

Lecture 20 Random Samples 0/ 13

Chapters 4-6: Estimation

Regression Part II. One- factor ANOVA Another dummy variable coding scheme Contrasts Mul?ple comparisons Interac?ons

4. Suppose that we roll two die and let X be equal to the maximum of the two rolls. Find P (X {1, 3, 5}) and draw the PMF for X.

20 Hypothesis Testing, Part I

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

STA 302f16 Assignment Five 1

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

Harvard University. Rigorous Research in Engineering Education

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Section 9.1 (Part 2) (pp ) Type I and Type II Errors

Categorical Data Analysis. The data are often just counts of how many things each category has.

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

Inference for Single Proportions and Means T.Scofield

Frequency table: Var2 (Spreadsheet1) Count Cumulative Percent Cumulative From To. Percent <x<=

Structural Equa+on Models: The General Case. STA431: Spring 2013

High-Throughput Sequencing Course

E509A: Principle of Biostatistics. GY Zou

PSY 305. Module 3. Page Title. Introduction to Hypothesis Testing Z-tests. Five steps in hypothesis testing

Review of One-way Tables and SAS

Probability Exercises. Problem 1.

PARTICLE RELATIVE MASS RELATIVE CHARGE. proton 1 +1

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

Lecture 10: Generalized likelihood ratio test

Statistics 3858 : Contingency Tables

ACMS Statistics for Life Sciences. Chapter 13: Sampling Distributions

Continuous Random Variables 1

Chapter 7. Inference for Distributions. Introduction to the Practice of STATISTICS SEVENTH. Moore / McCabe / Craig. Lecture Presentation Slides

QUEEN S UNIVERSITY FINAL EXAMINATION FACULTY OF ARTS AND SCIENCE DEPARTMENT OF ECONOMICS APRIL 2018

PubH 5450 Biostatistics I Prof. Carlin. Lecture 13

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Joint Distributions: Part Two 1

14.75: Leaders and Democratic Institutions

Violating the normal distribution assumption. So what do you do if the data are not normal and you still need to perform a test?

Chapter 1 Statistical Inference

Sampling Distributions

Conditional Probability

Topic 10: Hypothesis Testing

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

Testing Independence

CONTINUOUS RANDOM VARIABLES

POISSON PROCESSES 1. THE LAW OF SMALL NUMBERS

Do students sleep the recommended 8 hours a night on average?

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 1: Probability Fundamentals

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Probability. Introduction to Biostatistics

DSST Principles of Statistics

AP Statistics Ch 12 Inference for Proportions

Transcription:

Categorical Data Analysis 1 STA 312: Fall 2012 1 See last slide for copyright information. 1 / 1

Variables and Cases There are n cases (people, rats, factories, wolf packs) in a data set. A variable is a characteristic or piece of information that can be recorded for each case in the data set. For example cases could be patients in a hospital, and variables could be Age, Sex, Diagnosis, Have family doctor (Yes-No), Family history of heart disease (Yes-No), etc. 2 / 1

Variables can be Categorical, or Continuous Categorical: Gender, Diagnosis, Job category, Have family doctor, Family history of heart disease, 5-year survival (Y-N) Some categories are ordered (birth order, health status) Continuous: Height, Weight, Blood pressure Some questions: Are all normally distributed variables continuous? Are all continuous variables quantitative? Are all quantitative variables continuous? Are there really any data sets with continuous variables? 3 / 1

Variables can be Explanatory, or Response Explanatory variables are sometimes called independent variables. The x variables in regression are explanatory variables. Response variables are sometimes called dependent variables. The Y variable in regression is the response variable. Sometimes the distinction is not useful: Does each twin get cancer, Yes or No? 4 / 1

Our main interest is in categorical variables Especially categorical response variables In ordinary regression, outcomes are normally distributed, and so continuous. But often, outcomes of interest are categorical Buy the product, or not Marital status 5 years after graduation Survive the operation, or not. Ordered categorical response variables, too: for example highest level of hockey ever played. 5 / 1

Distributions We will mostly use Bernoulli Binomial Multinomial Poisson 6 / 1

The Poisson process Why the Poisson distribution is such a useful model for count data Events happening randomly in space or time Independent increments For a small region or interval, Chance of 2 or more events is negligible Chance of an event roughly proportional to the size of the region or interval Then (solve a system of differential equations), the probability of observing x events in a region of size t is e λt (λt) x x! for x = 0, 1,... 7 / 1

Poisson process examples Some variables that have a Poisson distribution Calls coming in to an emergency number Customers arriving in a given time period Number of raisins in a loaf of raisin bread Number of bomb craters in a region after a bombing raid, London WWII In a jar of peanut butter... 8 / 1

Steps in the process of statistical analysis One possible approach Consider a fairly realistic example or problem Decide on a statistical model Perhaps decide sample size Acquire data Examine and clean the data; generate displays and descriptive statistics Estimate parameters, perhaps by maximum likelihood Carry out tests, compute confidence intervals, or both Perhaps re-consider the model and go back to estimation Based on the results of inference, draw conclusions about the example or problem 9 / 1

Coffee taste test A fast food chain is considering a change in the blend of coffee beans they use to make their coffee. To determine whether their customers prefer the new blend, the company plans to select a random sample of n = 100 coffee-drinking customers and ask them to taste coffee made with the new blend and with the old blend, in cups marked A and B. Half the time the new blend will be in cup A, and half the time it will be in cup B. Management wants to know if there is a difference in preference for the two blends. 10 / 1

Statistical model Letting π denote the probability that a consumer will choose the new blend, treat the data Y 1,..., Y n as a random sample from a Bernoulli distribution. That is, independently for i = 1,..., n, P (y i π) = π y i (1 π) 1 y i for y i = 0 or y i = 1, and zero otherwise. Note that Y = n i=1 Y i is the number of consumers who choose the new blend. Because Y B(n, π), the whole experiment could also be treated as a single observation from a Binomial. 11 / 1

Find the MLE of π Show your work Maximize the log likelihood. ( π log l = n ) π log P (y i π) i=1 ( = n ) π log π y i (1 π) 1 y i i=1 = ( π log π ) n i=1 y i (1 π) n n i=1 y i ( ) = n n ( y i ) log π + (n y i ) log(1 π) π i=1 i=1 n i=1 = y i n n i=1 y i π 1 π 12 / 1

Setting the derivative to zero, n i=1 y i π = n n i=1 y i 1 π (1 π) n y i = π(n i=1 n y i π i=1 n y i = nπ i=1 n y i ) i=1 n y i = nπ π i=1 n i=1 π = y i = y = p n So it looks like the MLE is the sample proportion. Carrying out the second derivative test to be sure, n i=1 y i 13 / 1

Second derivative test 2 log l π 2 = ( n i=1 y i n n i=1 y ) i π π 1 π = n i=1 y i n n i=1 y i = n π 2 ( 1 y (1 π) 2 + y π 2 (1 π) ) 2 < 0 Concave down, maximum, and π = y = p. 14 / 1

Numerical estimate Suppose 60 of the 100 consumers prefer the new blend. Give a point estimate the parameter π. Your answer is a number. > p = 60/100; p [1] 0.6 15 / 1

Carry out a test to answer the question Is there a difference in preference for the two blends? Start by stating the null hypothesis H 0 : π = 0.50 H 1 : π 0.50 A case could be made for a one-sided test, but we ll stick with two-sided. α = 0.05 as usual. Central Limit Theorem says π = Y is approximately normal with mean π and variance π(1 π) n. 16 / 1

Several valid test statistics for H 0 : π = π 0 are available Two of them are and Z 1 = n(p π0 ) π0 (1 π 0 ) Z 2 = n(p π0 ) p(1 p) What is the critical value? Your answer is a number. > alpha = 0.05 > qnorm(1-alpha/2) [1] 1.959964 17 / 1

Calculate the test statistic(s) and the p-value(s) > pi0 =.5; p =.6; n = 100 > Z1 = sqrt(n)*(p-pi0)/sqrt(pi0*(1-pi0)); Z1 [1] 2 > pval1 = 2 * (1-pnorm(Z1)); pval1 [1] 0.04550026 > > Z2 = sqrt(n)*(p-pi0)/sqrt(p*(1-p)); Z2 [1] 2.041241 > pval2 = 2 * (1-pnorm(Z2)); pval2 [1] 0.04122683 18 / 1

Conclusions Do you reject H 0? Yes, just barely. Isn t the α = 0.05 significance level pretty arbitrary? Yes, but if people insist on a Yes or No answer, this is what you give them. What do you conclude, in symbols? π 0.50. Specifically, π > 0.50. What do you conclude, in plain language? Your answer is a statement about coffee. More consumers prefer the new blend of coffee beans. Can you really draw directional conclusions when all you did was reject a non-directional null hypothesis? Yes. Decompose the two-sided size α test into two one-sided tests of size α/2. This approach works in general. It is very important to state directional conclusions, and state them clearly in terms of the subject matter. Say what happened! If you are asked state the conclusion in plain language, your answer must be free of statistical mumbo-jumbo. 19 / 1

What about negative conclusions? What would you say if Z = 1.84? Here are two possibilities. By conventional standards, this study does not provide enough evidence to conclude that consumers prefer one blend of coffee beans over the other. The results are consistent with no difference in preference for the two coffee bean blends. In this course, we will not just casually accept the null hypothesis. 20 / 1

Confidence Intervals Approximately for large n, 1 α P r{ z α/2 < Z < z α/2 } { } n(p π) = P r z α/2 < < z α/2 p(1 p) = P r { p z α/2 p(1 p) n } p(1 p) < π < p + z α/2 n Could express this as p ± z α/2 p(1 p) n z α/2 p(1 p) n is sometimes called the margin of error. If α = 0.05, it s the 95% margin of error. 21 / 1

Give a 95% confidence interval for the taste test data. The answer is a pair of numbers. Show some work. = ( ) p(1 p) p(1 p) p z α/2, p + z n α/2 n ( ) 0.6 0.4 0.6 0.4 0.60 1.96, 0.60 + 1.96 100 100 = (0.504, 0.696) In a report, you could say The estimated proportion preferring the new coffee bean blend is 0.60 ± 0.096, or Sixty percent of consumers preferred the new blend. These results are expected to be accurate within 10 percentage points, 19 times out of 20. 22 / 1

Meaning of the confidence interval We calculated a 95% confidence interval of (0.504, 0.696) for π. Does this mean P r{0.504 < π < 0.696} = 0.95? No! The quantities 0.504, 0.696 and π are all constants, so P r{0.504 < π < 0.696} is either zero or one. The endpoints of the confidence interval are random variables, and the numbers 0.504 and 0.696 are realizations of those random variables, arising from a particular random sample. Meaning of the probability statement: If we were to calculate an interval in this manner for a large number of random samples, the interval would contain the true parameter around 95% of the time. So we sometimes say that we are 95% confident that 0.504 < π < 0.696. 23 / 1

Confidence intervals (regions) correspond to tests Recall Z 1 = n(p π 0 ) and π0 (1 π 0 ) Z2 = n(p π 0 ). p(1 p) From the derivation of the confidence interval, if and only if p z α/2 p(1 p) n z α/2 < Z 2 < z α/2 < π 0 < p + z α/2 p(1 p) n So the confidence interval consists of those parameter values π 0 for which H 0 : π = π 0 is not rejected. That is, the null hypothesis is rejected at significance level α if and only if the value given by the null hypothesis is outside the (1 α) 100% confidence interval. There is a confidence interval corresponding to Z 1 too. Maybe it s better See Chapter 1. In general, any test can be inverted to obtain a confidence region. 24 / 1

Selecting sample size Where did that n = 100 come from? Probably off the top of someone s head. We can (and should) be more systematic. Sample size can be selected To achieve a desired margin of error To achieve a desired statistical power In other reasonable ways 25 / 1

Power The power of a test is the probability of rejecting H 0 when H 0 is false. More power is good. Power is not just one number. It is a function of the parameter. Usually, For any n, the more incorrect H 0 is, the greater the power. For any parameter value satisfying the alternative hypothesis, the larger n is, the greater the power. 26 / 1

Statistical power analysis To select sample size Pick an effect you d like to be able to detect a parameter value such that H 0 is false. It should be just over the boundary of interesting and meaningful. Pick a desired power, a probability with which you d like to be able to detect the effect by rejecting the null hypothesis. Start with a fairly small n and calculate the power. Increase the sample size until the desired power is reached. There are two main issues. What is an interesting or meaningful parameter value? How do you calculate the probability of rejecting H 0? 27 / 1

Calculating power for the test of a single proportion True parameter value is π Z 1 = n(p π0 ) π0 (1 π 0 ) Power = 1 P r{ z α/2 < Z 1 < z α/2 } = { } n(p π0 ) 1 P r z α/2 < π0 (1 π 0 ) < z α/2 =... { n(π0 π) π0 (1 π 0 ) n(p π) = 1 P r z α/2 < π(1 π) π(1 π) π(1 π) } n(π0 π) π0 (1 π 0 ) < + z α/2 π(1 π) π(1 π) { } n(π0 π) π0 (1 π 0 ) n(π0 π) π0 (1 π 0 ) 1 P r z α/2 < Z < + z α/2 π(1 π) π(1 π) π(1 π) π(1 π) ( ) ( ) n(π0 π) π0 (1 π 0 ) n(π0 π) π0 (1 π 0 ) = 1 Φ + z α/2 + Φ z α/2, π(1 π) π(1 π) π(1 π) π(1 π) where Φ( ) is the cumulative distribution function of the standard normal. 28 / 1

An R function to calculate approximate power For the test of a single proportion Power = 1 Φ ( ) ( ) n(π0 π) π0 (1 π 0 ) n(π0 π) π0 (1 π 0 ) + z α/2 + Φ z α/2 π(1 π) π(1 π) π(1 π) π(1 π) Z1power = function(pi,n,pi0=0.50,alpha=0.05) { a = sqrt(n)*(pi0-pi)/sqrt(pi*(1-pi)) b = qnorm(1-alpha/2) * sqrt(pi0*(1-pi0)/(pi*(1-pi))) Z1power = 1 - pnorm(a+b) + pnorm(a-b) Z1power } # End of function Z1power 29 / 1

Some numerical examples > Z1power(0.50,100) [1] 0.05 > > Z1power(0.55,100) [1] 0.168788 > Z1power(0.60,100) [1] 0.5163234 > Z1power(0.65,100) [1] 0.8621995 > Z1power(0.40,100) [1] 0.5163234 > Z1power(0.55,500) [1] 0.6093123 > Z1power(0.55,1000) [1] 0.8865478 30 / 1

Find smallest sample size needed to detect π = 0.55 as different from π 0 = 0.50 with probability at least 0.80 > samplesize = 50 > power=z1power(pi=0.55,n=samplesize); power [1] 0.1076602 > while(power < 0.80) + { + samplesize = samplesize+1 + power = Z1power(pi=0.55,n=samplesize) + } > samplesize; power [1] 783 [1] 0.8002392 31 / 1

Find smallest sample size needed to detect π = 0.60 as different from π 0 = 0.50 with probability at least 0.80 > samplesize = 50 > power=z1power(pi=0.60,n=samplesize); power [1] 0.2890491 > while(power < 0.80) + { + samplesize = samplesize+1 + power = Z1power(pi=0.60,n=samplesize) + } > samplesize; power [1] 194 [1] 0.8003138 32 / 1

Conclusions from the power analysis Detecting true π = 0.60 as different from 0.50 is a reasonable goal. Power with n = 100 is barely above one half pathetic. As Fisher said, To call in the statistician after the experiment is done may be no more than asking him to perform a postmortem examination: he may be able to say what the experiment died of. n = 200 is much better. How about n = 250? > Z1power(pi=0.60,n=250) [1] 0.8901088 It depends on what you can afford, but I like n = 250. 33 / 1

What is required of the scientist Who wants to select sample size by power analysis The scientist must specify Parameter values that he or she wants to be able to detect as different from H 0 value. Desired power (probability of detection) It s not always easy for the scientist to think in terms of the parameters of a statistical model. 34 / 1

Copyright Information This slide show was prepared by Jerry Brunner, Department of Statistics, University of Toronto. It is licensed under a Creative Commons Attribution - ShareAlike 3.0 Unported License. Use any part of it as you like and share the result freely. The L A TEX source code is available from the course website: http://www.utstat.toronto.edu/ brunner/oldclass/312f12 35 / 1