Unit 3. Discrete Distributions

Size: px

Start display at page:

Download "Unit 3. Discrete Distributions"

Lorena Simon
6 years ago
Views:

1 PubHlth Discrete Distributions Page 1 of 39 Unit 3. Discrete Distributions Topic 1. Proportions and Rates in Epidemiological Research Review - Bernoulli Distribution. 3. Review - Binomial Distribution Poisson Distribution Hypergeometric Distribution.. 6. Fisher s Eact Test.. 7. Discrete Distribution Themes Appendices A. Mean and Variance of Bernoulli(π).. B. The Binomial(N,π) is the Sum of N Independent Bernoulli(π)... C. The Poisson is an Etension of a Binomial

2 PubHlth Discrete Distributions Page 2 of Proportions and Rates in Epidemiological Research The concepts of proportion of and rate of describe different aspects of disease occurrence. This distinction is important to epidemiological research. A proportion is a relative frequency. Proportion = It is dimensionless. # events that actually occurred # events that could have occurred Valid range: 0 to 1 E.g. Toss a coin 10 times. If we observe 2 heads : Proportion heads = 2/10 = 20% # Events that could have occurred = 10 tosses # Events occurred = 2 heads Prevalence measures are eamples of proportions. A rate is a count of event occurrence per unit of time. As such, it is measured relative to an interval of time. It is not dimensionless. Rate = # events that actually occurred # time periods eperienced Valid range: 0 to E.g., persons are known to have smoked, collectively, for 1,000 pack years. If we observe 3 occurrences of lung cancer: Rate lung cancer = 3/1000 pack years # Time periods eperienced = 1000 pack years # Events occurred = 3 Incidence densities are eamples of rates.

3 PubHlth Discrete Distributions Page 3 of 39 Some Commonly Used Proportions and Rates Proportions describe either eisting disease or new disease within a time frame. flow of occurrence of new disease with time. Rates describe the force or Prevalence = Prevalence # persons with disease at a point in time # persons in the population at a point in time Type of Measure: Proportion Denominator = # events that could have occurred Numerator = # events that actually occurred # persons in the population at a point in time # persons having the disease at a point in time Valid range: 0 to 1 Intepretation Other names The proportion of the population with disease at a point in time Point prevalence Eample: In 1988, the New York State Breast and Cervical Cancer Screening Program registry included 16,529 women with a baseline mammogram. Upon review of their medical records, 528 were found to have a history of breast cancer. The prevalence of breast cancer in this 1988 selected cohort was 528 P = = = 3.19% 16,529 These 528 women were ecluded from the analyses to determine the factors associated with repeat screening mammogram.

4 PubHlth Discrete Distributions Page 4 of 39 Cumulative incidence = # disease onsets during interval of time # persons at risk in population at start of interval Cumulative Incidence Denominator = # events that could have happened Numerator = # events that actually occurred Type of Measure: Proportion # persons in the population at the start of the time interval who could possibly develop disease. Thus, all are disease free at the start of the interval. # persons developing the disease during the time interval. Valid range: 0 to 1 Interpretation Note!! The proportion of healthy persons who go on to develop disease over a specified time period. We assume that every person in the population at the start was followed for the entire interval of time. Eample: In this eample, the event of interest is the completion of a repeat screening mammogram. There were 9,485 women in the New York State Breast and Cervical Cancer Screening Program registry as of 1988 with a negative mammogram, who did not have a history of breast cancer, and provided complete data. 2,604 obtained a repeat screening mammogram during the 6 year period The 6-year cumulative incidence of repeat screening mammogram is therefore 2,604 CI 6year = = = 27% 9,485 Further analysis is focused on the identification of its correlates.

5 PubHlth Discrete Distributions Page 5 of 39 Incidence Density = Incidence Density # disease onsets during interval of time Sum of individual lengths of time actually at risk Type of Measure: Rate Denominator = # time periods eperienced event free. Sum of individual lengths of time during which there was opportunity for event occurrence during a specified interval of time. a This sum is called person time and is N i= 1 (time for i th person free of event) It is also called person years or risk time. Numerator = # new event occurrences Valid range: Intepretation Other names Note!! # disease onsets during a specified interval of time. 0 to The force of disease occurrence per unit of time Incidence rate We must assume that the risk of event remains constant over time. When it doesn t, stratified approaches are required. Be careful in the reporting of an incidence density. Don t forget the scale of measurement of time. I = 7 per person year = 7/52 = 0.13 per person week = 7/ = per person day

6 PubHlth Discrete Distributions Page 6 of 39 A prevalence estimate is useful when interest is in What to Use? who has disease now versus who does not (one time camera picture) planning services; e.g. - delivery of health care The concept of prevalence is not meaningfully applicable to etiologic studies. Susceptibililty and duration of disease contribute to prevalence Thus, prevalence = function(susceptibililty, incidence, survival) Etiologic studies of disease occurrence often use the cumulative incidence measure of frequency. Recall that we must assume complete follow-up of entire study cohort. The cumulative incidence measure of disease frequency is not helpful to us if persons migrate in and out of the study population. Individuals no longer have the same opportunity for event recognition. Etiologic studies of disease occurrence in dynamic populations will then use the incidence density measure of frequency. Be careful here, too! Does risk of event change with time? With age? If so, calculate person time separately in each of several blocks of time. This is a stratified analysis approach.

7 PubHlth Discrete Distributions Page 7 of Review - Bernoulli Distribution In order to analyze variations in proportions and rates, we ll need a few probability models. There are 4 and they are interrelated. Two of them were introduced in PubHlth 540, Introductory Biostatisics: Bernoulli; and Binomial. Two additional distributions that are often used in analyses of discrete data are the following: Poisson; and hypergeometric. Recall - A Bernoulli Trial is the Simplest Binomial Random Variable. A simple eample is the coin toss chance of heads can be re-cast as a random variable. Let Z = random variable representing outcome of one toss, with Z = 1 if heads 0 if tails π= Probability [ coin lands heads }. Thus, π = Probability [ Z = 1 ]

8 PubHlth Discrete Distributions Page 8 of 39 We have what we need to define a probability distribution. Enumeration of all possible outcomes - outcomes are mutually eclusive - outcomes ehaust all possibilities 1 0 Associated probabilities of each - each probability is between 0 and 1 - sum of probabilities totals 1 Outcome Pr[outcome] 1 Pr[Z=1] = π 0 Pr[Z=0] = (1-π) Total = 1 In epidemiology, the Bernoulli might be a model for the description of ONE individual (N=1): This person is in one of two states. He or she is either in a state of: 1) event with probability π; or 2) non event with probability (1-π) The description of the likelihood of being either in the event state or the non-event state is given by the Bernoulli distribution

9 PubHlth Discrete Distributions Page 9 of 39 Bernoulli Distribution Suppose Z can take on only two values, 1 or 0, and suppose: Probability [ Z = 1 ] = π Probability [ Z = 0 ] = (1-π) This gives us the following epression for the likelihood of Z=z. Tip - Recall from PubHlth 540, Unit 5, that the likelihood function is called a probability density function and is written with the notation f Z (z). When the random variable is discrete (as is the case for the Bernoulli), we can write the following: f Z (z) = Probability [ Z = z ] = π z (1-π) 1-z for z = 0 or 1. Epected value of Z is E[Z] = μ = π Variance of Z is Var[Z] = σ 2 = π (1-π) Eample: Z is the result of tossing a coin once. If it lands heads with probability =.5, then π =.5. Later we ll see that individual Bernoulli distributions are the basis of describing patterns of disease occurrence in a logistic regression analysis.

10 PubHlth Discrete Distributions Page 10 of Review - Binomial Distribution Recall from PubHlth 540 (Unit 4) that the Binomial distribution model is the appropriate model to use for the outcome that is the number of successes in a set of N independent success/failure Bernoulli trials. E.g. - What is the probability that 2 of 6 graduate students are female? - What is the probability that of 100 infected persons, 4 will die within a year? Binomial Distribution Among N trials, where The N trials are independent Each trial has two possible outcomes: 1 or 0 Pr [ outcome =1] = π for each trial; and therefore Pr [ outcome = 0] = (1-π) X = # events of outcome The probability density function f X () is now: f X () = Probability [ X = ] = F HG N I π KJ (1-π) N- for =0,, N where F HG Epected value is E[X] = μ = N π Variance is Var[X] = σ 2 = N π (1-π) I N = # ways to choose X from N = KJ N!! (N-)! and where N! = N(N-1)(N-2)(N-3) (4)(3)(2)(1) and is called the factorial.

11 PubHlth Discrete Distributions Page 11 of 39 Your Turn A roulette wheel lands on each of the digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9 with probability =.10. Write down the epression for the calculation of the following. #1. The probability of 5 or 6 eactly 3 times in 20 spins. #2. The probability of digit greater than 6 at most 3 times in 20 spins.

12 PubHlth Discrete Distributions Page 12 of 39 #1. Solution: Note The k is your. The p is your π Event is outcome of either 5 or 6 Pr[event] = π =.20 N = 20 X is distributed Binomial(N=20, π=.20) Pr[X=3]= [.20] [ 1.20] 3 20 [.20] 3 [.80] 17 = 3 =.2054 #2. Solution: Event is outcome of either 7 or 8 or 9 Pr[event] = π =.30 N = 20 X is distributed Binomial(N=20, π=.30) Note The k is your. The p is your π Translation: At most 3 times is the same as less than or equal to 3 times Pr[X 3] = Pr[X=0] + Pr[X=1] + Pr[X=2] + Pr[X=3] = [.30] [.70] = = [.30] [.70] + [.30] [.70] = [ ] [ ] [ ] [ ] 17

13 PubHlth Discrete Distributions Page 13 of 39 Review of the Normal Approimation for the Calculation of Binomial Probabilities Calculations of eact binomial probabilities become quite tedious as the number of trials and number of events gets large. Fortunately, the central limit theorem allows us to replace these eact calculations with very good approimate calculations. The approimate calculations are actually normal probability calculations. The following is an eample where the calculation of the required binomial probability would be too much to do! Eample Calculate the chances of between 5 and 28 events (inclusive) in 180 trials with probability of event =.041. Idea of Solution Translate the required eact calculation into a very good approimate one using the z-score. X distributed Binomial (N=180, π=.041) says that μ BINOMIAL = nπ = (180)(.041) = 7.38 σ = n π (1-π) = (180)(.041)(.959) = 7.08 σ 2 BINOMIAL BINOMIAL = σ = BINOMIAL The approimate calculation using the z-score uses μ = μ BINOMIAL and σ = σbinomial 5-μ 28-μ Pr[5 X 28] Pr[ Normal(0,1) ] σ σ = Pr[ Normal(0,1) ] = Pr [ Normal(0,1) ] Pr [ Normal(0,1) ], because is in the etreme right tail. = Pr [ Normal (0,1) ], because of symmetry of the tails of the normal =.8146

14 PubHlth Discrete Distributions Page 14 of 39 Review of the Normal Approimation for the Calculation of Binomial Probabilities continued. More generally, we can use a z-score and the normal distribution for the following reasons. Binomial probabilities are likelihood calculations for a discrete random variable. Normal distribution probabilities are likelihood calculations for a continuous random variable. When substituting for the eact probabilities, we use the Normal distribution that has mean and variance parameter values equal to those of our Binomial distribution. μ = μ = nπ normal binomial σ = σ = n π (1-π) 2 2 normal binomial Desired Binomial Probability Calculation Normal Approimation w Correction Pr [ X=k] Pr [ (k-1/2) < X < (k+1/2) ] Pr [ X > k ] Pr [ X > (k-1/2) ] Pr [ X < k ] Pr [ X < (k+1/2) ]

15 PubHlth Discrete Distributions Page 15 of Poisson Distribution So far we have considered The description of 1 event occurrence for 1 person (Bernoulli) The description of X event occurrences in sample of N persons (Binomial) In stead of thinking in terms of persons, think instead about PERSON TIME: A familiar eample of the idea of person time is pack years smoking E.g. How shall we describe a small number of cancer deaths relative to a large accumulation of person time, such as 3 cancer deaths in 1000 pack years of smoking? Setting. It is no longer the analysis of events among N persons as a proportion. Instead, it is an analysis of events over person time as an incidence rate. T he Poisson Distribution can be appreciated as an etension of the Binomial Distribution. See Appendi C for details. The concept of N persons A large accumulation of person time The likelihood of an event eperienced by 1 person the likelihood of an event in 1 unit of person time. This will be quite small!

16 PubHlth Discrete Distributions Page 16 of 39 If X is distributed Poisson with mean μ, Poisson Distribution f X () = Probability [ X = ] = [ ] μ ep -μ! for = 0, 1, Epected value of X is E[X] = μ Variance of X is Var[X] =σ 2 = μ That s right mean and variance are the same. Where the mean μ is the parameter representing the product of the (incidence rate) (period of observation) The poisson distribution is an appropriate model for describing the frequency of occurrence of a rare event in a very large number of trials. Eample: Suppose lung cancer occurs at a rate of 2 per 1000 pack years. Using a Poisson distribution model, calculate the probability of eactly 3 cases of lung cancer in 3000 pack years Solution: Step 1 Solve for Poisson parameter μ μ = (incidence rate) (period of observation) = ( 2 per 1000 pack years) (3000 pack years) = (.002)*(3000) = 6 Step 2 Identify desired calculation Calculate the probability of eactly 3 cases of lung cancer is translated as follows For X distributed Poisson(μ=6), Probability [X=3] =???

17 PubHlth Discrete Distributions Page 17 of 39 Step 3 Solve, by hand or with use of calculator on the web. Probability [X=3] = 3 μ ep(-μ) 6 ep(-6) = =.0892! 3!

18 PubHlth Discrete Distributions Page 18 of 39 Eample continued. This same eample can be used to illustrate the correspondence between the Binomial and Poisson likelihoods. If lung cancer occurs at a rate of 2 per 1000 pack years, calculate the probability of eactly 3 cases of lung cancer in 3000 pack years using (a) a Binomial model, and (b) a Poisson model. (a) Binomial n = # trials π = Pr [ event ] (b) Poisson What happens as n and π 0? Epected # events nπ μ Pr [ X= events ] n π ( 1-π ) n- μ ep(-μ)! Where ep = numerical constant = e = Solution using the Binomial: Solution using the Poisson: # trials = n = 3000 Length of interval, T = 3000 π =.002 Rate per unit length, λ =.002 per 1 pack year μ = λt = (.002/pack year)(3000 pack years)=6 Pr[3 cases cancer] = F I Pr[3 cases cancer] ] [. 998] = HG 3 KJ[. 6 3 ep[ 6 ] 3! =.0891 =.0892

19 PubHlth Discrete Distributions Page 19 of 39 An eample of the Poisson sampling design is a cross-sectional study Disease Healthy Eposed a b a+b Not eposed c d c+d a+c b+d a+b+c+d a = # persons who are both eposed and with disease b = # persons who are both eposed and healthy c = # persons without eposure with disease d = # persons without eposure and healthy The counts a, b, c, and d are each free to vary and follow their own distribution. a, b, c, and d are 4 independent Poisson random variables. Let s denote the means of these μ 11, μ 12, μ 21, and μ 22. Likelihood[22 table] = L[a,b,c,d] = L NM a μ ep μ a! b b go Lμ ep μ QP NM b! c b go Lμ ep μ QP NM c! d b go Lμ epb μ QP NM d! go QP An association between eposure and disease means the following. 1) Pr[disease given eposure] Pr [disease given no eposure] μ11 μ 21 μ + μ μ + μ ) Pr[eposure given disease] Pr [eposure given no disease] μ11 μ12 μ + μ μ + μ

20 PubHlth Discrete Distributions Page 20 of 39 An eample of the Binomial sampling design is a cohort study Disease Healthy Eposed a b a+b fied by design Not eposed c d c+d fied by design a+c b+d a+b+c+d a = # eposed persons develop disease b = # eposed persons who do not develop disease c = # uneposed persons who develop disease d = # uneposed persons who do not develop disease The counts a and c are each free to vary and follow their own distribution. The counts b and d do not vary because of the fied row totals a and c are 2 independent Binomial random variables. Let s denote the means of these π 1 and π 2. Likelihood[22 table] = L[a,c given (a+b) and (c+d) are fied] = LF N MHG I KJ O Q P LF N MHG a+ b c+ d a b c d π ( 1 π ) π ( 1 π ) a c I KJ O Q P An association between eposure and disease means the following. Pr[disease among eposed] Pr [disease among non-eposed] π π 1 2

21 PubHlth Discrete Distributions Page 21 of 39 An eample of the Binomial sampling design is a case-control study Disease Healthy Eposed a B a+b Not eposed c D c+d a+c fied b+d fied a+b+c+d a = # persons with disease whose recall reveals eposure c = # persons with disease whose recall reveals no eposure b = # healthy persons whose recall reveals eposure d = # healthy persons whose recall reveals no eposure The counts a and b are each free to vary and follow their own distribution. The counts c and d do not vary because of the fied column totals a and b are 2 independent Binomial random variables. Let s denote the means of these θ 1 and θ 2. Likelihood[22 table] = L[a,b give (a+c) and (b+d) are fied] a + c b d θ (1 θ ) + θ (1 θ ) d a b a c b = An association between eposure and disease means the following. Pr[eposure among disease] Pr [eposure among healthy] θ θ 1 2

22 PubHlth Discrete Distributions Page 22 of Hypergeometric Distribution Sometimes, a 22 table analysis of an eposure-disease relationship has as its only focus the investigation of association. This will become more clear in units 4 and 5. In this setting, it turns out that the most appropriate probability model is the hypergeometric distribution model. The Hypergeometric Distribution is introduced in two ways: 1 - The Hypergeometric Distribution in Games of Poker 2 - The Hypergeometric Distribution in Analyses of 22 Tables 1 - The Hypergeometric Distribution in Games of Poker. Eample - In a five card hand, what is the probability of getting 2 queens? 2 queens is, in reality, 2 queens and 3 NON queens. 52 We know the total number of possible hands is 5 Because there are 4 queens in a full deck, the # selections of 2 queens from 4 is 4 2 Similarly, there are (52 4) = 48 non queens in a full deck. 48 Thus, the # selections of 3 non queens from 48 is 3 Putting these together, the number of hands that have 2 queens and 3 non queens is the product Pr[2 queens in five card hand] = (Number of five card hands that have 2 queens) (Total number of five card hands) = 52 5

23 PubHlth Discrete Distributions Page 23 of The Hypergeometric Distribution in Analyses of a 22 Table. Eample A biotech company has N=259 pregnant women in its employ. 23 of them work with video display terminals. Of the 259 pregnancies, 4 ended in spontaneous abortion. Assuming a central hypergeometric model, what is the probability that 2 of the 4 spontaneous abortions were among the 23 women who worked with video display terminals? Spontaneous Abortion Healthy Video Display Terminal Not In this eample, the number of women who worked with video display terminals (23) is fied. That is, the row totals are fied. Similarly, the total number of spontaneous abortions (4) is also fied. That is, the column totals are fied. And the total number of pregnancies (259) is fied. Thus, only one cell count can vary. Having chosen one, all others are known by subtraction. In this eample, we want to know if spontaneous abortion has occurred disproportionately often among the video display terminal workers? Is it reasonable that 2 of the 4 cases of spontaneous abortion occurred among the relatively small number (23) of women who worked with video display terminals? The chance model is the central hypergeometric model just introduced. Using the approach just learned is the following.

24 PubHlth Discrete Distributions Page 24 of 39 Eample At this biotech company, what is the probability that 2 of the 4 abortions occurred among the 23 who worked with video display terminals? 2 abortions among VDT workers is in, reality, 2 abortions among VDT workers and 2 abortions among NON-VDT workers. We know the total number of possible choices of 4 abortions from 259 pregnancies is As there are 23 VDT workers, the # selections of 2 from this group is 23 2 Similarly, as there are (259 23) = 236 non-vdt workers, the number of selections of from this group is 2 Thus, the probability of 2 of the 4 abortions occurring among VDT workers by chance is the hypergeometric probability = 259 4

25 PubHlth Discrete Distributions Page 25 of A Hypergeometric Probability Model for a 22 Table of Eposure-Disease Counts Now we consider the general setting of a 22 table analysis of eposure-disease count data. Case Control Eposed a b (a+b) fied Not eposed c d (c+d) fied (a+c) fied (b+d) fied n = a + b + c + d Here, the only interesting cell count is a = # persons with both eposure and disease Eample In this 22 table with fied total eposed, and fied total number of events, what is the probability of getting a events of (eposed and event)? a events of case among (a+b) eposed is in, reality, a events among (a+b) eposed AND c events among (c+d) NOT eposed. We know the total number of possible choices of (a+c) cases from (a+b+c+d) total is a+b+c+d a+c As there are (a+b) eposed, the # selections of a from this group is a+b a Similarly, as there are (c+d) NOT eposed, the number of selections of c from this c+d group is c Thus, the probability of a events of CASE occurring among (a+c) EXPOSED by chance is the hypergeometric probability a+ b c+ d a c = a+ b+ c+ d a+ c Note This is actually the central hypergeometric. More on this later.

26 PubHlth Discrete Distributions Page 26 of 39 Nice Result The central hypergeometric probability calculation for the 22 table is the same regardless of arrangement of rows and columns. Cohort Design Pr["a" cases among "(a+b)" eposed fied marginals] a+b c+d a c = a+b+c+d a+c Case-Control Design Pr["a" eposed among "(a+c)" cases fied marginals] a+c b+d a b = a+b+c+d a+b

27 PubHlth Discrete Distributions Page 27 of Fisher s Eact Test of Association in a 22 Table Now we can put this all together. We have a null hypothesis chance model for the probability of obtaining whatever count a we get in the upper left cell of the 22 table. It is the hypergeometric probability distribution model. This chance model (the central hypergeometric distribution) is the null hypothesis probability model that is used in the Fisher s eact test of no association in a 22 table. The odds ratio OR is a single parameter which describes the eposure disease association. Consider again the VDT eposure and occurrence of spontaneous abortion data: Disease Healthy Eposed Not eposed We re not interested in the row totals. Nor are we interested in the column totals. Thus, neither the Poisson nor the Bionomial likelihoods are appropriate models for our particular question. Rather, our interest is in the number of persons who have both traits (eposure=yes) and (disease=yes). Is the count of 2 with eposure and disease significantly larger than what might have been epected if there were NO association between eposure and disease? We re interested in this likelihood because it is the association, the odds ratio OR,that is of interest, not the row totals, nor the column totals. Thus, we will take advantage of the central hypergeometric model.

28 PubHlth Discrete Distributions Page 28 of 39 What is the null hypothesis likelihood of the 22 table if the row and column totals are fied? Recall the layout of our 22 table. With row and column totals fied, only one cell count can vary. Let this be the count a. Disease Healthy Eposed a b a+b Not eposed c d c+d a+c b+d a+b+c+d Null Hypothesis Conditional Probability of 22 Table (Conditional means: Row totals fied and column totals fied) When the OR =1, we have the central hypergeometric model just introduced. Probability [ a given row and column totals given OR=1 ] = F HG F HG IF KJ HG a+ c b+ d a b a+ b+ c+ d a+ b (optional for the interested reader): When the OR 1, the correct probability model for the count a is a different hypergeometric distribution; this latter is called a non-central hypergeometric distribution. Interestingly, the probability calculation involves the magnitude of the OR which is now a number different from 1. I KJ I KJ Probability [ a given row and column totals when OR 1] is the non-central hypergeometric probability = F HG min( a+ b, a+ c) u= ma( 0, a d) a+ c a a+ F HG u IFb KJ HG b cif KJ HG + d I KJ OR I KJ b+ d a+ b u OR Note You will not need to work with the non-central hypergeometric distribution in this course. a u

29 PubHlth Discrete Distributions Page 29 of 39 How to compute the p-value for this hypothesis test: The solution for the p-value will be the sum of the probabilities of each possible value of the count a that are as etreme or more etreme relative to a null hypothesis odds ratio OR = 1. Step 1: If the row and column totals are fied, what values of the count a are possible? Answer: Because the column total is 4, the only possible values of the count a are 0, 1, 2, 3, and 4 Step 2: What are the probabilities of the five possible 22 tables? Answer: We calculate the hypergeometric distribution likelihood for each of the 5 tables. While we re at it, we ll calculate the empirical odds ratio (OR) accompanying each possibility: a=0 Disease Healthy Eposed Pr(table)=.6875 Not eposed OR= a=1 Disease Healthy Eposed Pr(table)=.2715 Not eposed OR=

30 PubHlth Discrete Distributions Page 30 of 39 a=2 Note This is our observed Disea se Healthy Eposed Pr(table)=.0386 Not eposed OR= a=3 Disease Healthy Eposed Pr(table)=.0023 Not eposed OR= a=4 Disease Healthy Eposed Pr(table)=.0001 Not eposed OR=infinite Step 3: What is the p-value associated with the observed table? Answer: The observed table has a count of a =2. Recalling our understanding of the ideas of statistical hypothesis testing, the p-value associated with this table is the likelihood of this table or one that is more etreme. The more etreme tables are the ones with higher counts of a because these tables correspond to settings where the OR are as large or larger than the observed value of 11.1 Step 4: p-value = Pr [Table with a = 2] + Pr [Table with a = 3] + Pr [Table with a = 4] = =.0410 Interpretation of Fisher s Eact Test calculations. Under the null hypothesis of no association between eposure and disease, the chances of obtaining an OR=11.1 or greater, when the row and column totals are fied, are approimately 4 in 100.

31 PubHlth Discrete Distributions Page 31 of Discrete Distributions - Themes With Replacement Without Replacement Framework A proportion π are event A fied population of size =N contains a subset of size =M that are event Likelihood of outcome Sampling Sample of size=n with replacement Outcome # events = F HG n I K J π ( 1 π) n Eample Test of equality of proportion Sample of size=n without replacement # events = F HG M IF KJ HG F HG N n N n I KJ M I KJ Test of independence

32 PubHlth Discrete Distributions Page 32 of 39 Appendi A Mean (μ) and Variance (σ 2 ) of a Bernoulli Distribution Mean of Z = μ = π The mean of Z is represented as E[Z]. recall: E stands for epected value of E[Z] = π because the following is true: E[Z]= All possible z [z]probability[z=z] = [0] Pr [Z=0] + [1] Pr [Z=1] =[ 0] (1-π) + [1] (π) = π Variance of Z = σ 2 = (π)(1-π) The variance of Z is Var[Z] = E[ (Z (E[Z]) 2 ]. Var[Z] = π(1-π) because the following is true: Var[Z] = E [(Z-π) ] = 2 2 All possible z (z-π) Probability[Z=z] = 2 2 [(0- π) ]Pr[Z=0]+[(1- π) ]Pr[Z=1] 2 2 = [π ] (1-π) + [(1-π) ](π) = π (1-π) [ π + (1-π) ] = π (1-π)

33 PubHlth Discrete Distributions Page 33 of 39 Appendi B The Binomial(N,π) is the Sum of N Independent Bernoulli(π) The eample of tossing one coin one time is an eample of 1 Bernoulli trial. Suppose we call this random variable Z: Z is distributed Bernoulli (π) with Z=1 when the event occurs and Z=0 when it does not. Tip - The use of the 0 and 1 coding scheme, with 1 being the designation for event occurrence, is key, here. If we toss the same coin several times, say N times, we have N independent Bernoulli trials: Z 1, Z 2,., Z N are each distributed Bernoulli (π) and they re independent. Consider what happens if we add up the Z s. We re actually adding up 1 s and 0 s. The total of Z 1, Z 2,., Z N is thus the total number of 1 s. It is also the net number of events of success in N trials. Let s call this number of events of success a new random variable X. N Z i = X = # events in N trials i=1 This new random variable X is distributed Binomial. X represents the net number of successes in a set of independent Bernoulli trials. A simple eample is the outcome of several coin tosses (eg how many heads did I get? ). The word choice net is deliberate here, to remind ourselves that we re not interested in keeping track of the particular trials that yielded events of success, only the net number of trials that yielded event of success. E.g. - What is the probability that 2 of 6 graduate students are female? - What is the probability that of 100 infected persons, 4 will die within a year?

34 PubHlth Discrete Distributions Page 34 of 39 Steps in calculating the probability of N i=1 Z = X = i Step 1 Pick just one arrangements of events in N trials and calculate its probability The easiest is the ordered sequence consisting of () events followed by (N-) non events events N- non events Pr [ ordered sequence ] = Pr [( Z 1 =1), (Z 2 =1) (Z =1), (Z +1 =0), (Z +2 =0) (Z N =0) ] = π π π π (1- π) (1- π) (1- π) (1- π) = π (1- π) N- Note! Can you see that the result is the same product π the sequence the events occurred? How handy! (1- π) N- regardless of where in Step 2 Determine the number of qualifying ordered sequences that satisfy the requirement of having eactly events and (N-) non events. N Number of qualifying ordered sequences = = N!! (N-)! Step 3 The probability of getting X= events, without regard to sequencing, is thus the sum of the probabilities of each qualifying ordered sequence, all of which have the same probability that was obtained in step 1. Probability [N trials yields events] = (# qualifying sequence) (Pr[one sequence]) N N- N Pr [X = ] = Pr [ Z i = ] = π (1-π) i=1

35 PubHlth Discrete Distributions Page 35 of 39 Appendi C The Poisson Distribution is an Etension of Binomial The concept of N persons A large accumulation of person time The likelihood of an event eperienced by 1 person the likelihood of an event in 1 unit of person time. This will be quite small! The etension begins with a Binomial Distribution Situation. We begin by constructing a binomial likelihood situation. Let T = total accumulation of person time (e.g pack years) n = number of sub-intervals of T (eg 1000) T/n = length of 1 sub-interval of T (e.g.- 1 pack year) λ = event rate per unit length of person time What is our Binomial distribution probability parameter π? π = λ (T/n) because it is (rate)(length of 1 sub-interval) What is our Binomial distribution number of trials? n = number of sub-intervals of T We ll need 3 assumptions 1) The rate of events in each sub-interval is less than 1. Rate per sub-interval = Pr[1 event per sub-interval] 0 < π = (λ)(t/n) < 1 2) The chances of 2 or more events in a sub-interval is zero. 3) The subintervals are mutually independent.

36 PubHlth Discrete Distributions Page 36 of 39 Now we can describe event occurrence over the entire interval of length T with the Binomial. F HG I K J π Let X be the count of number of events. n Probability [ X = ] = (1-π) n- for =0,, n n T T λ λ n n = 1 n because π = ( λ)[ T/ n]. Some algebra (if you care to follow along) will get us to the poisson distribution probability formula. The algebra involves two things Letting n in the binomial distribution probability; and Recognizing that the epected number of events over the entire interval of length T is λt because λ is the per unit subinterval rate and T is the number of units. (analogy: for rate of heads =.50, number of coin tosses = 20, the epected number of head is [.50][20] = 10). This allows us, eventually, to make the substitution of λt = μ. n λt λt Pr[X=] = 1- n n N- n n λt λt λt = 1-1- n n n -

37 PubHlth Discrete Distributions Page 37 of 39 Work with each term on the right hand side one at a time: - 1 st term - n n! n(n-1)(n-2)...(n-+1)(n-)! n(n-1)(n-2)...(n-+1) = = =!(n-)!!(n-)!! As n infinity, the product of terms in the numerator (n)(n)(n) (n) = n. Thus, as n infinity n n! - 2 nd term - λt 1 1 = [λt] = [ μ] n n n Thus, λt 1 = [ μ] n n - 3 rd term - n λt μ 1- = 1- n n n What happens net is a bit of calculus. As n infinity Thus, as n infinity n λt μ 1- = 1- n n n e μ n μ 1- e n -μ where e = constant = 2.718

38 PubHlth Discrete Distributions Page 38 of 39-4 th term - Finally, as n the quotient (λt/n ) is increasingly like (0/n) so that - - λt [] 1 1 n n = Thus, as n infinity - λt 1- = 1 n Now put together the product of the 4 terms and what happens as n infinity n n λt λt λt Pr [X=] = 1-1- n n n - as n infinity n! [ ] 1 n μ { e μ } { 1 } μ e μ ep = =!! -μ -μ

39 PubHlth Discrete Distributions Page 39 of 39 If X is distributed Poisson, Poisson Distribution f X () = Probability [ X = ] = Epected value of X is E[X] = μ Variance of X is Var[X] =σ 2 = μ [ ] μ ep -μ! for = 0, 1, Thus, the poisson distribution is an appropriate model for describing the frequency of occurrence of a rare event in a very large number of trials.

Unit 3. Discrete Distributions

BIOSTATS 640 - Spring 2016 3. Discrete Distributions Page 1 of 52 Unit 3. Discrete Distributions Chance favors only those who know how to court her - Charles Nicolle In many research settings, the outcome