Basic Biostatistics International Graduate School of Genetic and Molecular Epidemiology GAME

Size: px
Start display at page:

Download "Basic Biostatistics International Graduate School of Genetic and Molecular Epidemiology GAME"

Transcription

1 Basic Biostatistics International Graduate School of Genetic and Molecular Epidemiology GAME Paul W. Dickman Department of Medical Epidemiology and Biostatistics Karolinska Institutet September 2003 Lectures Classroom exercises Computer labs Course structure Textbook (Rosner, Fundamentals of Biostatistics, 5th edition) [1]. Relatively cheap Appropriate level of mathematical detail Suitable as a reference text for the future At least some examples relevant to genetics and molecular biology I will present material in approximately the same order as the text and provide references to relevant sections in the text for each topic Basic Biostatistics, GAME, September 2003 Basic Biostatistics, GAME, September Exam Friday September 19 A review session will be held from 8:30-10:30 on Friday September 19, at the end of which each student will be assigned a small item from one of the lab sessions they have done during the previous two weeks, e.g., a piece of Stata output, a graph, or a table. Each student must then prepare to explain/interpret this in a 5 min presentation in the afternoon, with at most two slides or transparencies (which they can prepare between 10:30-13:00). The presentations will take place between 14:00-17:00, and each presentation will be followed by some questions from the lecturers, for no more than 5 mins, both on the material presented as well as other course material (e.g classroom exercises). What is statistics? some textbook definitions A standard textbook definition of statistics Analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions based on the data. (J. H. Zar, Introductory Biostatistics, Prentice Hall, 1996) Biostatistics is the development and use of statistical methods to solve problems and answer questions that arise in human biology and medicine. Each student will be assigned one of three grades: Fail, Pass, or Pass with Distinction. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September But what exactly is statistics? The essence of statistics is understanding variation. Understanding that all measurements are subject to random variation. Being able to distinguish systematic variation from random variation. Drawing conclusions in the presence of random variation. To assist us in this task we often use mathematical functions (probability distributions) to describe how characteristics of interest vary in a population. Consider the following statement: The number of children who drowned in garden ponds has fallen, the Royal Society for the Prevention of Accidents reports. Eight children died in garden drowning accidents in 2000 compared with ten in The drop has been attributed to better safety awareness. [The Independent, 23 January 2002] Is statistics a branch of mathematics? The field of mathematical statistics could possibly be considered a branch of mathematics. We are not, however, here to study mathematical statistics. Most individuals have an intuitive understanding of the concept of probability and the distinction between systematic and random variation. Statistics provides a formal framework for these concepts which enables us to, for example, formally establish that a new treatment is superior to an existing one or to solve complex multi-dimensional problems (e.g. enumerate the risk of cardiovascular disease for an individual with specified age, weight, and lifestyle). Much of the formal framework is mathematical so mathematics cannot be completely avoided. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Our aim is to focus on the concepts without emphasizing the mathematics. The concepts can then be placed in a more rigorous mathematical framework at a later stage. Is there a distinction between statistics and epidemiology? What does statistics cover? Following is the general sequence of steps in a research project Planning Design Execution (data collection) Data processing Data analysis Presentation Interpretation Publication Statisticians can contribute to every stage, although the major steps where statistical thinking is required are design, analysis, and interpretation. (the Zar definition misses the design part) In Chapter 1, Rosner describes a study where he participated and discusses the role of statistics at each stage. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

2 Aims for a first course in biostatistics It is important to be able understand statistics (mathematical statistics) and to be able to carry out standard statistical analyses (i.e. solve textbook statistical exercises). In the bigger picture, the challenge is to be able to place each applied problem in a statistical framework. This involves Selecting an analysis appropriate for the study design and type of data at hand. Verifying that assumptions are not violated (e.g. normality, random sample, independence of observations). Assessing whether confounding or bias may impact on the interpretation. In general, computations can be carried out using a computer so knowing the mathematical formulae is not essential. It is crucial, however, to apply the correct method of analysis, understand the limitations of the method (if any), the assumptions involved, and how to interpret the results. Possible explanations for an observed association between an exposure and an outcome Confounding. Randomisation all but eliminates this problem. In an observational study, good design and appropriate analysis minimise confounding. Bias. For example, selection bias, misclassification of exposure, recall bias. Good planning and conduct of the study can minimise bias. Chance. Hypothesis tests and confidence intervals guide us. Causal effect of the factor under study. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Descriptive statistics (Rosner Chapter 2) See the handout on descriptive statistics. Some properties of the mean, variance, and standard deviation (Rosner 2.3 & 2.5) Suppose we have a sample x 1...x n. Now suppose we add a constant value, k, to each observation to obtain the sample y 1...y n such that y i = x i + k. (ifk is negative then this is equivalent to subtracting a constant value from each observation) Exercise: show that ȳ = x + k. If s 2 x is the sample variance of x 1...x n and s 2 y is the sample variance of y 1...y n then we can easily show that s 2 x = s 2 y (Equation 2.5, page 22). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Now suppose we multiply each observation x 1...x n by a constant value, k, to obtain the sample z 1...z n such that z i = kx i. We can easily show that z = k x (Equation 2.2, page 17). We also see that s 2 z = k 2 s 2 y (Equation 2.6, page 23). That is, if we multiply each value in a sample by a constant k then the sample variance is multiplied by k 2. However, the standard deviation is multiplied by k. Bar graphs Histograms Stem-and-leaf plots Box plots Graphic methods ( 2.8) That is, s z = ks y. The standard deviation is in the same units as the mean, which is why we usually work with the standard deviation rather than the variance. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Figure 1: Histogram (Rosner Figure 2.6) Basic Biostatistics, GAME, September Figure 2: Stem-and-leaf plot (Rosner Figure 2.7) Basic Biostatistics, GAME, September

3 Presenting data visually Use the technique most appropriate for displaying the information (which may be a table). For example, a graph is not required to display two numbers. The Visual Display of Quantitative Information by Edward Tufte provides an excellent exposition of how, and how not to, present quantitative information visually. Please resist the urge to present three dimensional figures (bar charts, histograms, pie charts, and the like). Figure 3: Boxplot (from the Rosner study guide) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Probability ( 3.2) An event is the basic element to which probability can be applied. For example, we might consider the event that a 30 year old women lives to see her 70th birthday or the event that a newborn baby is a boy. An event either occurs or it does not occur. In the study of probability, events are denoted by uppercase letters such as A, B, andc. Basic rules for probability The probability of an event E, denoted by Pr(E) or P (E), always satisfies 0 Pr(E) 1. Two events A and B that cannot occur simultaneously are said to be mutually exclusive. (Definition 3.1, page 47) For example, the events A (baby is a boy) and B (baby is a girl) are mutually exclusive. The probability of an event occurring can be defined as the proportion of times that event would occur if we repeated the experiment a large number of times under identical circumstances. (Definition 3.1, page 46) For example, we estimate the probability that a newborn baby is a boy by observing the proportion of boys in a large number of births. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September For two mutually exclusive events, the probability of either occurring is the sum of the two individual probabilities. For example, if the probability of being blood group A is 0.43 and the probability of being blood group B is 0.08 then the probability of being blood group A or B is = In mathematical language we write P (A B) =P (A)+P (B) where A and B are mutually exclusive events. A B, the union of A and B, is the event either A, orb, or both A and B. Figure 4: Diagrammatic representation of A B; A, B mutually exclusive Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The complement event (Definition 3.6) Ā is the event that A does not occur. It is called the complement of A. Pr(Ā) =1 Pr(A) since either A or Ā must occur (i.e. Pr(A)+Pr(Ā) =1). Figure 5: Diagrammatic representation of A B; A, B not mutually exclusive Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

4 A B, the intersection of A and B, is the event both A and B occur. The multiplication law of probability ( 3.4) If two or more events are independent then the probability of each and every event occurring is the product of the individual probabilities. For example, if 3 unrelated patients are in a waiting room then the probability of them all being blood group A is = If the three patients are two parents and their child then this result does not apply since the events are not independent. Independence is an essential concept in statistics. Figure 6: Diagrammatic representation of A B By independence we mean that if we know the outcome of one event then this tells us nothing about the other event(s). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September If A and B are independent events then P (A B) =P (A) P (B). Rosner Example 3.14 (page 51) The events A and B are dependent if P (A B) P (A) P (B). Note that if A and B are mutually exclusive events then, by definition, P (A B) =0. Independence and mutual exclusivity are separate concepts! Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Rosner Example 3.15 (page 51) The addition law of probability ( 3.5) For any two events, labelled A and B, which may or may not be mutually exclusive we have P (A B) =P (A)+P(B) P (A B) where P (A B) is the probability that both A and B occur. If A and B are mutually exclusive events then P (A B) =0. See Figures 4 and 5. What would it mean (conceptually) if the events A + and B + were independent? For example, if the event A is person is blood type A and the event B is person is blood type B then P (A B) =0so P (A B) is simply the sum of the individual probabilities. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Conditional probability ( 3.6) We are often interested in determining the probability an event A will occur given that we already know the outcome of another event B. For example, we may wish to estimate the probability an individual will live to be age 90 given that he or she has already survived to age 85. The notation P (A B) is used to represent the probability that A occurs given that B has already occurred. The conditional probability of A occurring given that B has already occurred P (A B) = P (A B). P (B) The probability of both A and B occurring is given by P (A B) =P (A B)P (B). The probability of both A and B occurring can also be written as Note that the rule P (A B) =P (B A)P (A). P (A B) =P (A B)P (B) =P (B A)P (A). is applicable both when A and B are independent, when they are dependent, and when they are mutually exclusive. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

5 If A and B are independent then P (A B) =P (A) and P (B A) =P (B). Therefore, for independent events, P (A B) =P (A)P (B). Note that the terms independent and mutually exclusive do not mean the same thing. If A and B are independent then the probability of B is not affected by knowledge of whether or not A has occurred, that is P (B A) =P (B). Relative risk (Definition 3.10, page 55) The relative risk (RR) is simply a ratio of conditional probabilities RR = Pr(D E) Pr(D Ē) (1) where D is the event diseased and E is the event exposed. However, if A and B are mutually exclusive then P (B A) =0. For example, penetrance is a conditional probability P(diseased genotype). In linkage analysis we set up a model for the joint probability of the genotype and phenotype which is a function of the penetrance, the marker genotyping error rate (another conditional probability), population allele frequencies, and transmission probabilities (which are functions of the recombination fraction). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Example: applying basic probability rules Consider the following information P (A) = P(subject is male)=0.5 P (B) = P(subject is taller than 1.75m)=0.6 P (A B) = P(subject is a male who is taller than 1.75m)=0.4 Are the two events independent? What is the probability of being male given that one is taller than 1.75m? What is the probability of being taller than 1.75m given that one is male? Example: sexually transmitted disease (Example 3.20, page 55) Using the data in Example 3.15 from Rosner, find the conditional probability that doctor B makes a positive diagnosis given that doctor A makes a positive diagnosis. What is the conditional probability that doctor B makes a positive diagnosis given that doctor A makes a negative diagnosis? What is the relative risk that that doctor B makes a positive diagnosis given that doctor A makes a positive diagnosis? Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The birthday problem Given a group of 25 people, what is the probability of two people in the group having the same birthday (i.e. celebrate their birthdays on the same day)? Diagnostic tests The concept of conditional probability is central to issues of diagnostic testing or screening. The following concepts should be familiar to you from your first course in epidemiology. sensitivity = P(test positive disease) specificity = P(test negative no disease) false positive = P(test positive no disease) = 1 specificity false negative = P(test negative disease) = 1 sensitivity positive predictive value = P(disease test positive) negative predictive value = P(no disease test negative) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Bayes rule ( 3.7) Imagine that we have a screening test for a disease (with known prevalence) where the sensitivity and specificity of the screening test are known. We wish to calculate the positive predictive value. We noted earlier that P (A B) =P (A B)P (B) =P (B A)P (A). The known sensitivity of the test is P (B A) =0.99. That is, the probability of a positive test among individuals with the disease is We also know that the disease prevalence is 1 in 10,000. That is, P (A) = P (B), the probability of a positive test is slightly more difficult to calculate. Since disease and no disease are mutually exclusive we sum the probabilities of a positive test in each of these two groups. Bayes rule states that P (A B) = P (B A)P (A). P (B) We use the formula P (B) =P (B A)P (A)+P(B Ā)P (Ā) where Ā means not A, that is, not having the disease. Let A be the event have the disease and B be the event return a positive test. Note that P ( B Ā) is the specificity of the test. That is, the probability of a negative test among individuals who do not have the disease. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

6 Since P (A) = then P (Ā) = If we assume that the specificity is 0.98 then P (B Ā) =0.02 and we have P (B) =P (B A)P (A)+P(B Ā)P (Ā) P (B) = The quantity we want to calculate is P (A B), the probability of having the disease given that one has had a positive test. Using Bayes rule P (A B) = P (B A)P (A). P (B) We have a diagnostic test which gives the correct result 99% of the time when the patient has the disease and the correct answer 98% of the time when the patient does not have the disease. Nevertheless, among the patients who return a positive test, less than 1% of them have the disease! If you think this is a mathematical trick then consider what we expect to occur with a group of 10,000 people randomly selected from the population and screened for disease using this test. This is one of the reasons that disease screening programs must be very carefully considered. Rosner presents a similar example (Example 3.23, page 59). P (A B) = = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Examples of probability calculations In a certain country, the probability of becoming infected with HIV via a blood transfusion is 1%. 1. Assume that a person receives a blood transfusion on 30 occasions. What is the probability that the person becomes infected with HIV at some time by blood transfusion? 2. What is the corresponding probability for a person who has received a blood transfusion on 100 occasions? 3. Consider a test used to detect the presence of performance enhancing drugs in athletes. The test returns a false positive 1% of the time. If an athlete is tested, on average, 100 times per year would you be surprised if one test per year was positive? 1. Pr(infected at least once in 30 transfusions) = 1 Pr(not infected in any of the 30 transfusions) = =0.26. Therefore, the probability that the person becomes infected with HIV at some time by the blood transfusion is Pr(infected at least once in 100 transfusions) = 1 Pr(not infected in any of the 100 transfusions) = =0.63. Therefore, the probability that the person becomes infected with HIV at some time by the blood transfusion is The probability of at least one positive test in a year is Therefore it should not be a surprise if an athlete returns a single positive test. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Calculating probabilities a further example Assume that it is known that 30% of all doctoral students at KI wear glasses during lectures. Now assume that we randomly select 5 students in this class. 1. What is the probability that all 5 students are wearing glasses? 2. What is the probability that no students are wearing glasses? 3. What is the probability that at least 1 student is wearing glasses? 4. What is the probability that exactly 1 student is wearing glasses? 5. What is the probability that exactly 2 students are wearing glasses? Random variables and probability distribution Any characteristic that can be measured or categorised is called a variable. If a variable can assume a number of different values such that any particular outcome is determined by chance, it is a random variable. Random variables are typically represented by upper case letters such as X, Y, and Z. A discrete random variable can assume only a finite or countable number of outcomes, e.g., marital status. A continuous random variable, such as weight or height, can take on any value within a specified range. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Every random variable has a corresponding probability distribution. In the discrete case, it specifies all possible outcomes of the random variable along with the probability that each will occur. Rosner takes a slightly more theoretical approach: A random variable is a numeric function that assigns probabilities to different events in a sample space. (Definition 4.1) If a random variable can take on a large number of values, a probability distribution may not be a useful way to summarise its behaviour. It is therefore common to talk about the mean and variance of a random variable. A discrete probability distribution Let the random variable X represent the number of individuals in the sample (i.e. the 5 selected students) who are wearing glasses. X must take one of the values 0, 1, 2, 3, 4, or 5. The probability distribution of X can be written in a table. r Pr(X = r) Σ Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

7 X is a discrete random variable, as opposed to a continuous random variable it can only take a certain (countable) number of values. Note that Pr(X =0)=0.7 5 = and that Pr(X =5)=0.3 5 = The probabilities sum to one, as they must for a valid probability distribution. There is a general formula for calculating these probabilities (the binomial distribution). A mathematical formulae is not, however, a requirement for a discrete probability distribution (see the next slide). Common discrete probability distributions (all of which can be described by mathematical formulae) are the binomial, Poisson, and hypergeometric. Hypothetical probability distribution for the colour of non stops (Swedish candy similar to M & M s) Colour Prob. red green orange black yellow brown The concepts of mean and variance are not relevant for such a probability distribution. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The binomial distribution ( 4.8) Applicable when the outcome variable of interest is binary, that is, for each individual in the study, the outcome must assume one of two possible values. For example: 1. dead/alive in a toxicity test 2. contaminated/sterile in a sterility test 3. cancer/no cancer in a cancer trial 4. heads/tails on tossing a coin (the classic literature example) 5. pass/fail on an exam The two possible values of the outcome are often called success and failure. I will call the two outcomes C and not-c (where C stands for characteristic). Such data are sometimes called proportion data. We usually study more than one individual so our summarised results are often in terms of the proportion of items with C in the sample. For example: 1. 3/8 subjects died in a toxicity test /400 ampoules were found to be sterile in a sterility test 3. 12/348 subjects were diagnosed with cancer in a cancer trial 4. 11/20 tosses of a coin resulted in heads 5. 10/12 students pass an exam For example, in the pharmaceutical industry, sterility tests are performed to ensure that products labelled sterile are free from living microbes. Depending on the product, it may be acceptable that up to 5% not be strictly sterile. As each shipment is produced, a random sample of the product is drawn and tested for sterility. Based on the number of contaminated items in the sample, we can draw inference about the proportion of contaminated items in the entire shipment. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The number of sample items in a study with the characteristic C can be described by a binomial distribution if the following conditions are satisfied: 1. a sample of size n is drawn from a theoretically infinite population containing two types of members (C and not-c); 2. the type being studied, C, exists in proportion p of the population; 3. every item in the study has the same probability, p, of being type C; and 4. each selection of a sample item is independent of all others. Issues of interest are: 1. for given values of n and p, what is the distribution of the number of sample items with the characteristic C; 2. if a sample of size n contains r items with characteristic C, what does this tell us about the true value of p in the population. Consider first the situation where we know the values of n and p. The glasses example fits into this framework. Example of the binomial distribution wearing glasses Assume that it is known that 30% of all epidemiology doctoral students wear glasses during lectures, that is p =0.3. Now assume that we randomly select 5 students in this class, that is n =5. Let the random variable X represent the number of individuals in the sample (i.e. the 5 selected students) who are wearing glasses. r Pr(X = r) Σ Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September We would like to have a general formula for calculating these probabilities. We first note that Pr(X =0)=0.7 5 = and that Pr(X =5)=0.3 5 = The probability that the first person is wearing glasses and the other 4 are not is given by = This is not, however, the same as Pr(X =1) (the probability that exactly one of the five are wearing glasses). We also need to consider, for example, the probability that the 2nd person is wearing glasses and the other four are not. There are actually 5 ways that we can have 1 person wearing glasses and 4 people not. Therefore, Pr(X =1)= = It seems that a general formula could take the form Pr(X = r) =k p r (1 p) n r where k is the number of ways in which we can obtain r people with the characteristic and n r people without the characteristic. Note that there is only one way in which we can have all people wearing glasses and one way in which we could have all people not wearing glasses. How many ways can we have 2 people wearing glasses and 3 people not wearing glasses? It is possible to work out that there are 10 possible ways the first person with each of the other 4 plus the second person with the third, fourth and fifth person, plus the third person with the fourth and fifth person plus the fourth and fifth person. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

8 We can therefore construct the probability distribution as follows number probability r of ways of each Pr(X = r) An arithmetic expression exists for the number ways in which we can arrange rcs and (n r) non-cs inasampleofsizen. 0! is defined to be equal to 1. n n! Table 1: Values of n! for selected values of n. We first have to define n-factorial, which exists for positive integers n, is written n!, and is equal to the product of all integers from n down to 1. n! =n (n 1) (n 2) (n 3) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The binomial coefficient (definition 4.11, page 90) The number of ways of obtaining rcsand (n r) non-cs in a sample of size n is given by ( ) n n! =, for r =0, 1, 2, 3,...n. r r!(n r)! ( ) n is known as the binomial coefficient and is read n choose r. r It is the number of combinations of n things taken r at a time, where the order in which they are taken is unimportant. For example, to win lotto the player must correctly select 7 out of 35 numbers, but the order in which the 7 numbers are drawn is not important. ( ) 35 There are = 6, 724, 520 different 7-number combinations. 7 If ordering were important, there would be = combinations. That is, there would be 7! times more possible combinations and lotto would be almost impossible to win. Ordering unimportant 35!/(28! 7!) Ordering important 35!/28! Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September For example, the number of ways in which a total of one person can be wearing glasses in a group of 5 is given by ( 5 5! = 1) 1!(5 1)! = ( ) =5. The number of ways in which a total of two people can be wearing glasses in a group of 5 is given by ( 5 5! = 2) 2!(5 2)! == (2 1) (3 2 1) = =10. That is, in a group of 5 individuals we can identify 10 unique pairs of individuals. The binomial distribution where n =10and p =0.5 We toss an unbiased coin 10 times and are interested in the distribution of X, the total number of heads obtained in the 10 tosses. number probability r of ways of each Pr(X = r) Total Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The random variable X (number of heads in 10 tosses of a coin) can be described by a binomial distribution where n =10and p =0.5. Note that the probabilities sum to one, as they must for a valid probability distribution. Note also that each of the 1024 different orderings has the same probability of occurring (1/1024= ). This only occurs in the special case where p =0.5. The probability distribution is symmetric when p = 0.5 and becomes less symmetric as p moves further from 0.5. The probability of tossing 8 or more heads in 10 tosses of an unbiased coin is = This is identical to the probability of tossing 2 or fewer heads since the distribution is symmetric. General formulae for the binomial distribution If X is a random variable described by a binomial distribution with parameters n and p then the probability distribution of X is given by n! Pr(X = r) = r!(n r)! pr (1 p) n r, for r =0, 1, 2, 3,...n. Pr(X = r) is the probability of obtaining rcs in a sample of size n where the proportion of Cs in the population is p. If X is a random variable described by a binomial distribution with parameters n and p then the mean and variance of X are given by E(X) =np, Var(X) =np(1 p) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

9 The binomial distribution where n =5and p = 1 4 Table 2: The probability distribution for a binomial random variable with n =5 and p = 1 4. number probability r of ways of each Pr(X = r) Total Pr(X=r) Figure 7: The probability distribution for a binomial random variable with n =5 and p = 1 4. r Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The expected value and variance of X are E(X) =5/4 and Var(X) =15/16. Rosner Example 4.25 (page 93) Obviously, we can not obtain 5/4 items with the characteristic C inasingle trial. E(X) = 5/4 represents the average number of items with the characteristic C in a long series of trials. The variance is greatest when p =0.5. The binomial distribution is a discrete distribution; a discrete random variable must take one of a countable number of values. The normal distribution, on the other hand, is a continuous distribution and is defined for all real numbers between and. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The expected value (mean) ( 4.4) and variance ( 4.5) of a discrete random variable For a random variable X (discrete or continuous), the analogue of the arithmetic mean, x is referred to as the expected value and is denoted by E(X) or µ. For a discrete random variable E(X) =µ = R x ipr(x = x i) (definition 4.5) i=1 where x 1...x R are all possible values of the random variable (i.e. values with non-zero probability). Consider a game where we toss a coin and if the result is heads the player receives $2 and if the result is tails the player receives nothing. Let the random variable X represent the amount you receive following a single game (tossing the coin once). What is the probability distribution (probability mass function) of X? What is E(X)? How much would you be willing to pay to play this game? The variance is R Var(X) =E(X µ) 2 = (x i µ) 2 Pr(X = x i) (definition 4.6) i=1 Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Expected return for the game of sicbo Sicbo is a casino game where bets are placed on the outcome of 3 dice. One type of bet is a single number bet. We may choose, for example, to bet on the number 5. If none of the three dice is a 5 we lose our money. If one die is a 5 our bet returns even money (we receive our original stake plus an equal amount in winnings), if two dice are 5 s we win double our stake money (plus the original stake is returned) and if all three dice are 5 s we win triple our stake. Result Odds Return for $1 No die 0 0 One Die Two Dice Three Dice Note: Odds reported in a gambling context traditionally refer to the odds that the event does not occur (so-called odds against) whereas odds in epidemiology refer to the odds that the event occurs. If we place a $1 bet, what is our expected return? Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

10 Let X be a random variable representing the return to the player for a 1 unit investment. Recall that E(X) =µ = R i=1 xipr(x = xi). Rosner Example 4.27 (page 93) First we need to calculate the probability of each of the four outcomes, which can be done using a binomial distribution with n =3and p =1/6. Result x i Pr(X = x i) x ipr(x = x i) No die 0 125/216 = One Die 2 75/216 = Two Dice 3 15/216 = Three Dice 4 1/216 = Is the game favourable to the player? We could also calculate the variance of the expected return. How could knowledge of the variance be of interest to a gambler? Some casinos offer odds of 6-1 or 12-1 for three dice the same. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September We observed an incidence of 0.15 compared to the expected incidence of Is it possible that such an extremely high incidence could occur due to chance or does this provide evidence that children in households where both parents are chronic bronchitics are at higher risk? We will get Stata to do the calculations.. bitesti N Observed k Expected k Assumed p Observed p That is, if the children in these 20 households had the same risk of developing chronic bronchitis as the national average (i.e. 5%) then there is a probability of that we would observe 3 or more cases simply due to chance. Usually we require this probability to be lower than 5% before we are prepared to say that it is not due to chance (statistically significant). (we will discuss the concepts of hypothesis tests and P-values later) If we actually performed the hypothesis test because we observed a high incidence then standard statistical tests are invalid the hypothesis should be constructed before we collect the data. Pr(k >= 3) = Pr(k <= 3) = (one-sided test) (one-sided test) We see that the probability of 3 or more cases has probability it is unusual, but not extremely unusual. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Continuous Probability Distributions (Chapter 5) A continuous random variable can take on any value in a specified range or interval. Example: Let X be a random variable which represents the amount of serum triglycerides in the blood, measured in mg/dl. The probability distribution of X is depicted by a smooth curve called a probability density function. The probability density function (pdf) indicates which values are more likely to occur than others. For a truly continuous variable (for which there is an infinite number of possible values), the probability associated with any specific value is equal to 0. Instead of assigning probabilities to specific outcomes of the random variable X, as we did with discrete random variables, probabilities are assigned to ranges of values. The area under the curve between any two points a and b is the probability that X falls between a and b. The total area under the probability density function is equal to 1. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The cumulative distribution function (Definition 5.2) The cumulative distribution function of X is P(X a) =F(a). Its value is the area under the probability density function to the left of a. Expected value and variance of a continuous random variable (page 120) The expected value and variance of a continuous random variable X have the same meaning as they do for a discrete random variable. E(X), orµ, is the average value taken on by the variable X. Var(X), orσ 2, is the average squared distance of each possible value of X from µ. The standard deviation is the positive square root of the variance, σ = Var(X). There are no simple equations for E(X) and Var(X) their values must be obtained using calculus. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

11 The Normal Distribution ( 5.3) The probability density function of a normal random variable X is given by The normal distribution is the most widely used distribution in statistics. It is also called the Gaussian distribution or the bell-shaped curve. f(x) = [ 1 1 e 2 σ 2 (x µ)2] 2πσ Many random variables including blood pressure, weight, height, and serum cholesterol level are approximately normally distributed. However, the distribution s real value will be seen in the areas of estimation and hypothesis testing. It is also used as an approximation to many other distributions. where <x< and π is a constant. µ and σ are the parameters of the normal distribution they completely define its shape. It so happens that µ =E(X) and σ 2 = Var(X). We write N(µ, σ 2 ) to denote a normal distribution with mean µ and variance σ 2. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Normal distribution with mean µ and standard deviation σ Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September If values in a population follow a normal distribution then 68% of all observations fall within one standard deviation of the mean and 95% of the values fall within two standard deviations of the mean. I used this property earlier without motivating it. Example: If we knew (or were willing to accept) that serum cholesterol in a certain population followed a normal distribution with mean 205 mg/dl and standard deviation 38 mg/dl then we would know the complete distribution of serum cholesterol values in the population. For example, we would know that 68% of the population have values between 167 mg/dl and 243 mg/dl. We might use the random variable X to represent serum cholesterol and write X N(205, 38 2 ). We could then calculate, for example, the proportion of the population with a value less than 220 mg/dl. To find this probability we look up statistical tables (nowadays it is actually more common to use a computer). But there exist an infinite number of normal distributions and we cannot have tables for them all! Probabilities are tabulated for what is called the standard normal distribution which is the normal distribution with µ =0and σ =1. It is common to use the letter Z to refer to a random variable which has a standard normal distribution (although Rosner does not follow this convention). Approximately 68% of the area under the standard normal curve lies between +1 and 1, about 95% between +2 and 2, and about 99% between +2.5 and 2.5. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

12 The cumulative distribution function for a standard normal curve is represented by Φ(x) =P(X x) where X N(0, 1) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Using standard normal tables Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September What is P( 1 X 1)? From column D in Table 3, P( 1 X 1) = What is P(X 2)? From column B, P(X 2) = Therefore, P(X 2) = Forwhatvalueofbis it true that P( b X b) =0.95? From column D, b =1.96 Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The 100 u th percentile of the standard normal distribution is denoted by z u P(X <z u) = u What is the 80 th percentile of the standard normal curve? Want the value z.80 for which From column A of Table 3 in Rosner, and P(X <z.80) = 0.80 P(X <0.84) = P(X <0.85) = Common these days to use a computer program (e.g. Stata or Excel) rather than statistical tables. tablesq Z 1.96 Pr(Z <= 1.96) = Pr(Z >= 1.96) = Pr( Z >= 1.96) = tablesq Z 2 Pr(Z <= 2) = Pr(Z >= 2) = Pr( Z >= 2) = tablesqi Z 0.8 Pr(Z<=z) = 0.8 Quantile (z) = Therefore, the 80 th percentile is approximately Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

13 Conversion from an N(µ, σ 2 ) distribution to an N(0, 1) distribution ( 5.5) If X N(µ, σ 2 ) and then Z N(0, 1). Z = X µ σ By transforming X into Z, the table of areas for the standard normal curve can be used to estimate probabilities associated with X. This procedure is known as standardisation of a normal variable. Basic Biostatistics, GAME, September asic Biostatistics, GAME, September Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Random variables have the following properties E(X + c) = E(X)+c E(cX) = c E(X) Var(X + c) = Var(X) Var(cX) = c 2 Var(X) Therefore, ( ) X µ E(Z) = E σ = 1 E(X µ) σ = 1 [E(X) µ] σ = 1 [µ µ] σ = 0 Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September and ( ) X µ Var(Z) = Var σ = 1 Var(X µ) σ2 = 1 σ 2 Var(X) = 1 σ 2 (σ2 ) = 1 Example: The diastolic blood pressures of males years of age are normally distributed with µ =80mm Hg and σ 2 = 144 mm Hg 2. Individuals with blood pressures above 95 mm Hg are considered to be hypertensive. What is the probability that a randomly selected male has a blood pressure above95mmhg? ( P(X >95) = P Z> = P(Z>1.25) = ) Approximately 10.6% of this population would be classified as hypertensive. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

14 What value of diastolic blood pressure cuts off the upper 5% of this population? Using column B of the table of the standard normal distribution, the value Z =1.645 cuts off an area of 0.05 Want value of X which corresponds to Z =1.645 Z = X µ σ = X X = 99.7 Distribution of scores on an intelligence test The Encyclopædia Britannica Online entry for intelligence test contains the following text On the IQ scale about 2 out of 3 scores fall between 85 and 115 and about 19 out of 20 scores fall between 70 and 130. A score of about 130 or above is considered gifted, while a score below about 70 is considered mentally deficient or retarded. What statistical distribution do IQ scores have? What proportion of the population have an IQ of 140 or more? Approximately 5% of this population has a diastolic blood pressure greater than 99.7 mm Hg Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The normal approximation to the binomial distribution ( 5.7) Recall that if a random variable X has a binomial distribution with parameters n and p then the mean and variance of X are given by E(X) = np, Var(X) = np(1 p) If n is sufficiently large and p is not too close to 0 or 1 then X has an approximate normal distribution with the same mean and variance. As a general rule, the normal approximation is valid when both np and n(1 p) aregreaterthan5. Inference (i.e. hypothesis testing, confidence intervals for parameters) is much simpler when the normal approximation is appropriate. Pr(X=r) Figure 8: The probability distribution for a binomial random variable with n =5 and p = 1 4 with a superimposed normal distribution with µ =5/4 and σ2 =15/16. The normal approximation to the binomial is not appropriate. r For the case where n =5and p = 1 4 (Table 2) we would not expect the normal approximation to be good since np =5/4 < 5. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Note that the normal distribution is always symmetric whereas the binomial distribution is not necessarily symmetric. In this example, the normal approximation assumes that there is a non-zero probability of obtaining, for example, 1 items with the characteristic C. If we sample 50 items rather than 5 (i.e. X has a binomial distribution with n=50 and p = 1 4 ) Pr(X=r) 0.2 We are using a continuous distribution to approximate a discrete distribution r Figure 9: The probability distribution for a binomial random variable with n =50and p = 1 4 with a superimposed normal distribution with µ =50/4 and σ 2 = 150/16. The normal approximation to the binomial is appropriate. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The normal approximation is problematic when p is close to 0 or 1 (unless n is very large) since the binomial distribution will be heavily skewed. Pr(X=r) r Continuity correction Because the binomial distribution is discrete, it is specified by listing Pr(X = r) for r =0, 1,...n. The normal distribution is continuous so we must approximate Pr(X = r) by the probability between r 1 and r in N(np, np(1 p)). Figure 10: The probability distribution for a binomial random variable with n =50 and p =0.01. The distribution is asymmetric and the normal approximation with µ =0.5 and σ 2 =0.495 is clearly inappropriate. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

15 Example An exam consists of 50 multiple choice questions, each with 4 alternative answers. One student has not studied, so attempts to guess every single question. 1. What is the probability that the student answers exactly 12 questions correctly? Let X be a random variable which represents the number of questions answered correctly. X has a binomial distribution with n =50and p =0.25. The exact answer is Pr(X = 12) = ( ) = Using the normal approximation, we assume X has a normal distribution with µ =12.5 and σ 2 =9.375 and calculate the probability that X is between 11.5 and Pr(11.5 <X<12.5) = Pr(X <12.5) Pr(X <11.5) = Pr(Z ) Pr(Z ) = Pr(Z <0) Pr(Z < 0.326) where Z N(0, 1) = = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September What is the probability that the student answers 15 or more questions correctly? Sampling from a population. tablesq B B(50,.25) = 15 Pr(k == 15) = Pr(k >= 15) = Pr(k <= 15) = The exact solution is Pr(X 15) = Using the normal approximation, the solution is Pr(X >14.5) = Pr(Z > ) where Z N(0, 1) = Pr(Z >0.653) = Population too large to study conveniently. Study a sample (or samples) and make inference about the population. With uncertainty! Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Population Can be real; people (eg. blood sample), procedures,... ) conceptual future The more homogeneous it is, the easier it is to describe. Should be clearly defined (e.g. cholesterol level of Swedish males 20-74). When comparing, for example, exposed to unexposed we assume that there are two separate populations and we take a sample from each. Must be representative (random). Sample The bigger the sample, the better the inference. Drawing a satisfactory sample can pose many problems. We will assume during this course that all samples are random samples from the reference population. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Examples Estimating serum cholesterol in Swedish men aged Estimating the proportion who experience a specific adverse event when taking a new drug. Survival of women diagnosed with breast cancer. Estimation (Chapter 6) So far we have explored the properties of different probability models (the binomial and the normal). In doing this, we have assumed that the specific probability distributions were known. That is, we have described what happens when we sample from a population where we know both the form of the statistical distribution (normal or binomial) as well as the values of the relevant parameters (p,n,µ,σ,). Now, we focus on the process of drawing conclusions about an entire population based on the data in a sample statistical inference. Estimation is one of the main areas of statistical inference; hypothesis testing is another. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

16 Two methods of estimation are: Point estimation: calculating a single number to estimate the population parameter of interest - µ for a normal distribution (also σ 2 ) - p for a binomial distribution Interval estimation: calculating a range of reasonable values for the parameter (i.e. a confidence interval) Overview of the schedule Point estimation (mean & proportion) for a single sample Interval estimation (mean & proportion) for a single sample Hypothesis testing for a single sample (means and proportions) Inference for two samples problems (means and proportions) Estimation Testing Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Estimating the mean of a distribution ( 6.5) Example: Checking accuracy of a specific automatic pipette for measuring volume. Estimate the mean of n deliveries of 100 µl of water at 20 o C using an electronic balance which can weigh to 0.1 mg (1/1000 of the weight of 100 µl of water); reference/target population is conceptual. Estimate µ based on the information contained in a sample drawn from the population (Assuming pipettes are accurate and used correctly and water is air-free & at 20 o C then, according to standard tables, deliveries should weigh 99.8mg) Deliveries should be independent (re-zero the balance, use new pipette tips for each delivery). (we will ignore the possible effects of training & fatigue) Random sample of size n, estimate µ by X = n i=1 Xi For example, in a sample of n =50observations we might estimate the mean to be 99.3mg. This is only one sample, representative of all possible samples. If we took another sample of size 100 and calculated the sample mean we would get a different value. n A natural approach is to use the sample mean to estimate the unknown population mean (µ); X is an estimator of µ Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Example from Rosner: sampling birthweights On page 163, Rosner provides the birthweights of 1000 consecutive deliveries at Boston City Hospital and examines several samples of size 10. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September What properties of the sample mean make it a good estimator of the population mean? The sample mean is an unbiased estimator the average value over a large number of repeated samples will be the parameter of interest. Mathematically: An estimator ˆθ of a parameter θ is unbiased if E(ˆθ) =θ. Many unbiased estimators of µ exist, including the sample median and the average of the smallest and largest values. The sample mean is preferred because it is the minimum variance unbiased estimator of µ (see the illustration on the next slide). The concepts of bias and variance as related to estimators in mathematic statistics are defined mathematically and are quantifiable. They are conceptually similar to concepts of bias, precision, and validity familiar to epidemiologists. Basic Biostatistics, GAME, September asic Biostatistics, GAME, September

17 In the framework of mathematical statistics X is a random variable (it has a probability distribution). x is a realisation of X (a single observed value from the distribution). x 1,...,x n represent a sample of size n from the distribution (these are observed values). x is the observed value of the sample mean (for a single sample). Standard error of the mean ( 6.5.2) X is an unbiased estimator of the population mean irrespective of sample size. So why do we prefer larger samples? A larger sample size improves the precision (decreases the variance). X is a random variable so has a corresponding probability distribution with a mean and a variance. X is a random variable (it has a probability distribution). It is an estimator of the population mean µ (which is an unknown constant). x is a single realisation of X (from a single sample). It is an estimate, not an estimator. X is an unbiased estimator of the population mean; E( X) =µ. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September We know that E( X) =E(X) =µ, but what is Var( X)? Var( X) = n i=1 Var( ) n = 1 n n 2Var( X i) i=1 = 1 n n 2 Var(X i) = nσ2 n 2 = σ2 n The standard deviation of the sample mean is therefore σ/ n. This is called the standard error of the mean. In general the standard deviation of an estimator is called a standard error. Basic Biostatistics, GAME, September i=1 asic Biostatistics, GAME, September The central limit theorem ( 6.5.3) We have found that if we take a random sample of size n from a population with mean µ and variance σ 2 then the sample mean will have expected value µ and variance σ 2 /n. We have not made any assumption about the shape of the underlying distribution (only that it has mean µ and variance σ 2 ). If the underlying population is normally distributed with mean µ and variance σ 2 then the sample mean is also normally distributed with mean µ and variance σ 2 /n. But what is the distribution of the sample mean when the underlying population is not normal? A result in mathematical statistics, called the central limit theorem, tells us that even if the underlying distribution from which we draw a sample is not normal then the distribution of the sample mean will be approximately normal (for sufficiently large samples). What constitutes a sufficiently large sample depends on how non-normal the underlying data are a sample size of 30 is sufficient for the most non-normal distribution. This theorem is extremely important because it enables us to perform statistical inference based on the normal distribution, even if the underlying population is not normally distributed. For example, assume we calculate 5 random numbers between the values 0 and 9 where each number is equally likely. The distribution we are drawing from is called a uniform distribution. If we do this 2000 times and calculate the mean then we might obtain the following values for the 2000 means (taken from the text by Altman [2]). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Averages are more Normal than basic observations. Basic Biostatistics, GAME, September asic Biostatistics, GAME, September

18 Interval estimation for the mean ( 6.5.4) Let s return to the pipette example and imagine that from a sample of size n =50we find that x =99.3. To supplement the point estimate for µ, we would like to construct a range of plausible values for µ? For now, assume that σ is known (σ=1.75) Central limit theorem states that In other words, X N(µ, σ 2 /n), if n is large. Z = X µ σ/ is distributed as a standard normal random variable. n If Z follows the standard normal distribution, P( 1.96 Z 1.96) = 0.95 So, if µ, σ are the population mean and standard deviation, ( P 1.96 X ) µ σ/ n 1.96 =0.95 Rearranging for µ, ( P X 1.96 σ µ X σ ) =0.95 n n If we took repeated samples, the interval ( x 1.96 n σ, x n σ ) will contain µ 95% of the time. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September This is the 95% confidence interval (C.I.) for the mean µ. x ± 1.96(σ/ n) are 95% confidence limits for µ. The 100% (1 α) C.I. for µ is ( x z 1 α/2 σ n, x + z 1 α/2 σ n ) where z 1 α/2 is the value which cuts off the upper α/2 100% ofthe standard normal curve. The pipette example: Construct a 95% C.I. for the mean delivery of water of the pipette. For a sample of n =50, x =99.3 Assume σ = 1.75 (known) We get, ( , ) or (98.81, 99.79) What does this interval really mean? If 100 random samples are selected from the population with mean µ and variance σ 2, then approximately 95 of the intervals would contain µ and 5 would not. It does not mean that µ is a random variable which assumes a value within the interval 95% of the time µ is fixed, X is random. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September What would be a 99% C.I. for the mean? For N(0, 1), 99% of the observations lie between 2.58 and 2.58, So the 99% C.I. for µ is ( x 2.58 σ n, x σ n ) Substituting n =50, x =99.3 and σ =1.75 yields (98.66, 99.94) N.b. This is wider than the 95% C.I. (98.81, 99.79) The 90% C.I. for µ is ( x n σ, x n σ ) or (98.89, 99.71) How could this 90% interval be made tighter? - increasing the sample size n, decreases the standard error σ/ n -Ifn were 70, the 90% C.I. would be (98.96, 99.64). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September How large a sample would be needed to construct a 90% C.I. with length 0.6? Setting the interval length equal to 0.6, If we rearrange, ( ) (1.645) n = 0.6 Confidence interval for µ when σ is unknown Up until now we have assumed that the standard deviation of the underlying population, σ, is known. If µ is unknown, σ is unlikely to be known. Logical to estimate σ using the sample. So, n =(9.596) 2 = (1.645)(1.75) n = = The sample variance s 2 = 1 (n 1) n (x i x) 2 i=1 A sample of size 93 would be needed. seems a natural choice. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

19 For a random sample, X 1,X 2,...X n, drawn from a population with mean µ and variance σ 2, S 2 is an unbiased estimator of σ 2 That is, E(S 2 )=σ 2. This is why we use (n 1) rather than n as the devisor in the formula for the sample variance. The central limit theorem states that Confidence interval for µ when σ is unknown thet distribution If the random sample X 1,X 2,...X n is drawn from a normal population with mean µ then t = X µ S/ n What about X µ S/ n? X N(µ, σ 2 /n), X µ σ/ N(0, 1) n has a t distribution with n 1 degrees of freedom. Like the normal distribution, the t is symmetric. The t distribution tends to be flatter in the center and have thicker tails. In addition to the sampling variability in the mean, there is also sampling variability in the variance. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The t distribution is actually a family of distributions indexed by the degrees of freedom (df). For each possible value of df, there is a different t distribution. The df measure the amount of information in the data that can be used to estimate σ 2. Given a sample of size n, one degree of freedom is lost by estimating the mean. As the df increase (and S becomes a more reliable estimator of σ), the t distribution approaches the normal distribution. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Summary so far When σ known (or n>200 when σ is estimated by s) The 100% (1 α) C.I. for µ is ( [ ] [ ]) s s x z 1 α/2, x + z 1 α/2 n n When σ is estimated by s (and n 200) The 100% (1 α) C.I. for µ is ( [ ] [ ]) s s x t n 1, 1 α/2, x + t n 1, 1 α/2 n n Tables of areas are available for the t distribution (e.g. table 5 in Rosner). Many authors claim that we can use the normal distribution provided n>200 Rosner uses a more conservative rule. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Pipette example again: Suppose we estimate mean weight of pipette delivery based on a sample of 14 independent measurements. The weights are normally distributed with unknown mean µ and unknown variance σ 2. results from the sample - x = mean weight of deliveries =99.3 mg - s =1.65 mg Estimate the population mean µ using a 95% confidence interval From tables, t 13,.975 =2.160, so the 95% C.I. for µ is ( [ ] [ ]) , or (98.35, ) Confidence interval for a mean using Stata. use "gage&pipette.dta". summarize pipette Variable Obs Mean Std. Dev. Min Max pipette ci pipette Variable Obs Mean Std. Err. [95% Conf. Interval] pipette ci pipette, level(99) Variable Obs Mean Std. Err. [99% Conf. Interval] pipette Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

20 If σ was known to be 3.6, the 95% C.I. would be ( [ ] [ ]) , or (for σ unknown it was (98.35, )) (98.44, ) C.I. s based on the t distribution are wider than the C.I. s based on the normal distribution. Point and interval estimates for the binomial proportion p ( 6.8) We are often interested in estimating p, the proportion in the population with the characteristic C. Recall that if X is a random variable described by a binomial distribution with parameters n and p then the mean and variance of X are given by E(X) =np, Var(X) =np(1 p) A rule in mathematical statistics is that if X is a random variable and k a constant then E(kX) =ke(x) and Var(kX) =k 2 Var(X). Therefore, E(X/n) =p and Var(X/n) =p(1 p)/n. A point estimate for p based on a sample is ˆp = r/n where r is the number of Cs in a sample of size n. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The variance is estimated by ˆp(1 ˆp)/n since we do not know p. A 95% confidence interval for p, assuming a normal approximation, is given by ( ˆp(1 ˆp) ) ˆp(1 ˆp) ˆp 1.96 n, ˆp n. Consider a clinical trial of a new drug where we wish to estimate the proportion of individuals in the population who would experience a specific side effect if they took the drug. Of 200 individuals in the study, 15 experience the side effect. The point estimate of p, the proportion of individuals in the population who would experience side effects if they took the drug, is ˆp =15/200 = ( ) The standard error is given by 200 = An approximate 95% CI for p is therefore ± = (0.0385, 0.112). This confidence interval is approximate because it is based on the normal approximation to the binomial. Exact (i.e. based on the binomial distribution) confidence intervals can also be constructed (Rosner 6.8.3). For an exact 95% confidence interval, the lower limit is the value of p such that Pr(X 15) = and the upper limit is the value of p such that Pr(X 15) = In this example, the exact confidence interval is (0.0426, 0.121). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September By default, Stata generally provides exact, rather than approximate, confidence intervals for binomial proportions. For example,. cii Binomial Exact -- Obs Mean Std. Err. [95% Conf. Interval] With the advent of fast computers and efficient algorithms, exact tests are now available much more often than 10 years ago. Exact tests and confidence intervals should be used when they are given. For small studies, exact tests must be used (StatXact and LogXact provide exact tests for many study designs). Hypothesis testing: one-sample inference (Chapter 7) So far we have discussed methods for point and interval estimation of parameters of interest. However, we may have preconceived ideas about what these parameters might be and wish to test whether the data conform with these ideas. Example: hemoglobin levels in children under age 6 In the general population µ = g/100 ml, σ = 0.85 g/100 ml. In a sample of children exposed to high levels of lead x =12.0 g/100 ml. Is this compatible with the hypothesised value of 12.29? What if x is 11.0? or 10.0? This type of question is formulated in a hypothesis testing framework by specifying two hypotheses a null and an alternative hypothesis and comparing the relative probabilities of obtaining the sample data under these two competing hypotheses. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September We assume that we have a random sample of children from the population of all children exposed to lead. We will assume (for the moment) that this population has mean µ (which is unknown) and standard deviation σ = 0.85 (assumed known). We will now construct two competing hypotheses for the value of the population mean µ. The null hypothesis is that there is no difference in the mean hemoglobin value in the population of exposed children and the general population H 0 : µ =12.29 g/100 ml The alternative hypothesis, which contradicts H 0,is H A : µ g/100 ml We know that, even if the null hypothesis is true, we do not expect our sample mean (12.0 g/100 ml) to be equal to the population mean (12.29 g/100 ml). The question is whether the observed difference is consistent with what might be expected due to chance, or is the observed difference so large that it provides evidence against the null hypothesis. We now continue under the assumption that the null hypothesis is true. That is, we assume our sample of children exposed to lead is a random sample from a population with µ = g/100 ml and σ = 0.85 g/100 ml. If this is the case, then the sample mean should be normally distributed (by the central limit theorem) with mean g/100 ml and σ =0.85/ 74 g/100 ml. We could now examine whether a value of 12.0 is consistent with this distribution. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

21 Since we don t have tables for this distribution, we standardise to the standard normal distribution. Under the null hypothesis we know that Z = X µ 0 σ/ n will have a standard normal distribution. The observed value of z for this sample is z= 0.85/ 74 = 2.93 Is such a value consistent with the standard normal distribution? We quantify the answer to this question by calculating how often we would observe a value as extreme, or more extreme, than this value. From tables we see that the probability of observing a value less than or equal to is We also need to consider that, by chance, we could have observed a sample mean as extreme in the other direction (i.e. a sample mean greater than expected). Therefore, if the null hypothesis were true, the probability of observing values as extreme, or more extreme, than what we observed is = That is, the probability of observing such a difference due to chance is less than 1%. A more likely explanation is that the sample has been drawn from a population with a mean other than 12.29, that is, the null hypothesis is false. How do we determine what level of variation is reasonable due to chance? Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September A convention is that if the probability of observing such an extreme value due to chance is less than 5% then we conclude that the observed difference is not due to chance. The statistic, z, that we calculated to be 2.93 is called a test statistic. The probability of obtaining a test statistic as extreme or more extreme than the actual test statistic obtained (assuming that the null hypothesis is true) is called the p-value. The p-value in the above example is The 5% rule corresponds to examining whether the p-value is less than Rather than calculating the actual p-value, it has been convention in the past to compare the calculated test statistic to a critical value of the test statistic the test statistic corresponding to a p-value of It is now simple to calculate the actual p-value and the critical value approach should be avoided. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September In practice, we generally use statistical software to perform hypothesis tests and most software packages report the actual p-value. The following Stata output is for a t-test (which we will look at next) and the p-value is therefore slightly different.. ttesti Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] x Ho: mean(x) = P<0.05 is, by convention, considered to be evidence of a statistically significant difference. The issue of statistical significance is not, however, a black and white issue. A P-value of, for example, 0.07 still indicates weak evidence of a statistically significant difference. If you need to report the result of a statistical hypothesis test, report the exact P-value and let the reader make their own judgement. Do not, for example, report NS (for not significant) if the P-value is 0.06 or P <0.05 when the P-value is Ha: mean < Ha: mean!= Ha: mean > t = t = t = P < t = P > t = P > t = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The relationship between hypothesis testing and confidence intervals ( 7.7) Suppose we test H 0 : µ = µ 0 versus H A : µ µ 0 H 0 is rejected at the 5% significance level if, and only if, the 95% confidence interval for µ does not contain µ 0. We fail to reject H 0 at the 5% significance level if, and only if, the 95% confidence interval for µ does contain µ 0. In the lead example, we rejected the null hypothesis that µ =12.29 so we do not expect the 95% CI to contain the value A 95% CI for the mean hemoglobin value in the population of children exposed to lead is ( [ ] [ ]) s s x z 1 α/2, x + z 1 α/2 n n ( [ ] [ ]) , (11.81, 12.19) The 95% CI does not contain the value Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

22 Hypothesis testing when σ is unknown σ is unknown in most practical applications. We estimate σ 2 by s 2 = 1 n (n 1) i=1 (xi x)2 As previously mentioned, if the random sample x 1,x 2,...x n is drawn from a normal population with mean µ then t = x µ s/ n has a t distribution with n 1 degrees of freedom. The test statistic is now a t-statistic rather than a z-statistic. For a two-sided test, H 0 : µ = µ 0 would be rejected if Pipette example again: Suppose we estimate mean weight of pipette delivery based on a sample of 14 independent measurements. The weights are normally distributed with unknown mean µ and unknown variance σ 2. Results from the sample of n=14 deliveries - x = mean weight of deliveries =99.3 mg - s =1.65 mg An accurate pipette would deliver 99.8mg of water - is the tested pipette accurate, or is there systematic error? H 0 : µ = µ 0 =99.8 mg H A : µ 99.8 mg t<t n 1,α/2 or t>t n 1,1 α/2 Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Test at α =0.05 level of significance t = / 14 = 1.13 Since t 13,0.025 =2.160, we conclude that there is no evidence to reject the null hypothesis that µ =99.8mg, Recall, that the 95% confidence interval was (98.35, ). Stata commands for the one-sample t-test (n =14, x =99.3,s=1.65,µ 0 =99.8).ttesti , level(95) One-sample t test Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] x Degrees of freedom: 13 Ho: mean(x) = 99.8 Ha: mean < 99.8 Ha: mean!= 99.8 Ha: mean > 99.8 t = t = t = P < t = P > t = P > t = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The lead example in Stata. ttesti Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] x Ho: mean(x) = Ha: mean < Ha: mean!= Ha: mean > t = t = t = P < t = P > t = P > t = The test statistic is the same as when we assumed σ was known, but the reference distribution is now the t rather than the normal and the p-value is therefore slightly higher (the t distribution has more weight in the tails). Hypothesis testing an overview of the approach We are generally interested in showing that a difference exists. The approach is 1. Define a null hypothesis of no difference, e.g., µ 1 = µ 0 or p 1 = p Define a test statistic which has a known statistical distribution under the null hypothesis (i.e. when the null hypothesis is true). 3. Calculate the value of the test statistic based on the sample. 4. Assess whether the value of the test statistic is consistent with what we would expect for a value drawn from the known distribution under the null. 5. If the test statistic is in the upper or lower tails of the distribution we conclude that the test statistic must have come from some other distribution; we conclude that the null hypothesis must therefore be false. 6. If the tests statistic is consistent with what we would expect then we fail to reject the null hypothesis. The 95% CI is also slightly wider when we assume σ is unknown. c.f. (11.81, 12.19) when we assumed σ was known. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Failing to reject the null hypothesis does not mean that the null hypothesis is true; it means there was not conclusive evidence that it was false. In other words, a finding of no evidence of a difference is not the same as evidence of no difference. Sometimes our aim is simply to establish that one treatment is equally as good as another. For example, a new treatment for a disease may have fewer side effects, but we must first demonstrate that it is equally as good as the existing treatment(s) before it can be approved for sale. Proving non-inferiority or equivalence is a much more difficult issue and will not be covered here. One-sided or two-sided test? The choice between a one-sided and a two-sided test can be controversial. A one-sided test can sometimes achieve significance where a two-sided test does not. Some journals refuse to publish studies that rely on one-sided tests. The decision about which type of test to use should always be made before the data are collected. A two-sided test is the more conservative choice Just because we believe an exposure should produce an effect in a certain direction is not sufficient grounds to perform a one-sided test. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

23 P-values and their interpretation The P value is the probability of having observed our data (or more extreme data) when the null hypothesis is true (Altman [2, 8.5.1, p. 167]). For example, in a clinical trial our data essentially refers to the observed difference between the treatment groups (the null hypothesis is that the true effects are the same in each treatment group). If the P value is large, say greater than 0.2, then the data we observed are consistent with what would occur often when the null hypothesis is true we fail to reject the null hypothesis. Conversely, if the P value is very small, say less than , then the null hypothesis appears implausible because the data we observed could hardly ever occur by chance when the null hypothesis is true we reject the null hypothesis. It is common to use an arbitrary cutoff of 0.05 and call any result with a P value less than 0.05 statistically significant (sometimes written P<0.05 even when the P value is 0.001) while other results are not significant (sometimes written as P>0.05 even when the P value is 0.53). This approach is discouraged. If one has to use a significance test it is preferable to quote the P value so that the reader may make his or her own interpretation of the strength of the evidence against the null hypothesis. A crucial proviso, not mentioned by Altman, is that this interpretation assumes there are no sources of bias in the data collection or analysis processes. A P-value of , for example, does not necessarily mean that there is an association between exposure and disease (the observed difference may be due to selection bias) and definitely does not mean that the exposure caused the disease. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Clinical significance versus statistical significance Clinical significance and statistical significance are very different concepts. A study compared blood pressure measurements in the left and right arms and found a difference of around 1 mm Hg in both the systolic and diastolic blood pressure (Altman 5.4). The difference was highly statistically significant but of no clinical importance. On the other hand, many clinically important differences may not be statistically significant due to low power (small sample sizes). Hypothesis testing testing a sample proportion against a fixed constant ( 7.9) Assume that a new drug will only be approved for sale if we can demonstrate that we expect that no more than 10% of individuals who take the drug will experience side effects. That is, we require p<0.1. In our sample of 200 individuals, 15 experienced side effects. Based on our sample, we wish to test the hypothesis that the true proportion of individuals who would experience side effects if they took the drug is less than 0.1. We test the null hypothesis, H 0 : p =0.1 vs the alternative hypothesis H 1 : p<0.1 in a one-sided test. Under the null hypothesis, we would expect 20 individuals in our trial to experience side effects. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Only 15 actually experienced side effects, which is promising. However, we don t know whether we have observed a lower number of side effects simply due to chance (with p not less than 0.1) or because the true value of p is less than one. We proceed by calculating the probability that we observed a lower number of side effects than expected simply due to chance. If the null hypothesis is true, the probability that 15 or less individuals in the sample experience side effects is given by 15 Pr(X = r) where X Bin(200, 0.1) r=0 If this probability is less than 0.05 we would conclude that there is evidence that p is less than 0.1. The probability is actually 0.143, which is consistent with the null hypothesis. That is, we fail to reject the null hypothesis so the drug will not be approved (at least not on the basis of this study). We can calculate the probability in Stata via the StataQuest Calculator / Statistical tables menu item.. tablesq B B(200,0.1) = 15 Pr(k == 15) = Pr(k >= 15) = Pr(k <= 15) = This is an exact solution, based on the binomial distribution, for which we were required to evaluate and sum each of the 16 probabilities for r =0up to 15. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The normal approximation is appropriate. Under H 0, X has an approximate normal distribution with µ =20and σ 2 =18. Based on the normal approximation (and using a continuity correction), the probability that 15 or fewer individuals in the study experience the side effect, assuming the null hypothesis is true is Pr(X 15.5) = Pr(Z ) where Z N(0, 1) 18 = Pr(Z ) = The P-value for the test is (c.f for the exact method) meaning we fail to reject H 0. That is, the drug will not be approved. You can perform this test in Stata using the Statistics / Summaries, tables,... / Classical tests of hypotheses / Binomial probability test calculator menu. By default, Stata gives the exact test.. bintesti N Observed k Expected k Assumed p Observed p Pr(k >= 15) = (one-sided test) Pr(k <= 15) = (one-sided test) Pr(k <= 15 or k >= 25) = (two-sided test) -- Binomial Exact -- Variable Obs Mean Std. Err. [95% Conf. Interval] Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

24 If we use a normal approximation:. bintesti , normal Variable Obs Proportion Std. Error x Ho: p =.1 z = Pr > z = % CI = (0.0385,0.1115) In the above example, reporting that the exact 95% confidence interval for p is (0.0426, 0.121) conveys the information that the null hypothesis that p =0.1 is rejected at the 5% (two-sided) level (since the CI contains 0.1) and, in addition, provides a range of plausible values for p. Confidence intervals are generally always two-sided. Our hypothesis test is one-sided. If we want to base a one-sided hypothesis test at the 5% significance level on whether or not a confidence interval contains the null value, we must construct a 90% confidence interval. The P-value provided (0.2386) is for a two-sided test, whereas a one-sided test is appropriate for our study. If possible, it is preferable to report confidence intervals for a parameter of interest rather than P-values from hypothesis tests. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Power of a statistical test ( 7.5) Fail to Reject H 0 reject H 0 H 0 True Type I error (α) ok H 0 False ok Type II error (β) Table 3: The two types of error in a statistical hypothesis test. Recall that there are two errors we can make in a statistical hypothesis test. 1. Type I error reject the null hypothesis when it is true, usually fixed at α =0.05 by convention. 2. Type II error fail to reject the null hypothesis when it is false. The probability of type II error is denoted by β. Ideally, α and β should be as small as possible. The power of a study is given by (1 β). For a fixed α, power depends on Sample size; the bigger the better. The size of the effect we wish to detect; bigger effects are easier to detect. Level of random variation; systematic effects are easier to detect when the level of random variation is lower. The level of random variation is a function of baseline disease prevalence when the outcome is a proportion or the variance when the outcome is a mean. In the previous example, the point estimate (ˆp =0.075) was within the range where the drug would be accepted for use, yet we failed to reject the null hypothesis that p =0.1, meaning the drug could not be accepted. It is quite possible that the true proportion experiencing side effects is less than 0.1 (our best estimate was ˆp =0.075) but our study did not have sufficient power to demonstrate this conclusively. That is, we could not rule out the possibility that the low proportion of side effects was due to chance. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September If the study was larger then we would be more certain about our point estimate (the variance of the estimator would be smaller). If the same proportion of individuals in the study experienced the side effect, but the study was 5 times larger, we would then have sufficient evidence that the true proportion is less than 0.1. That is, if 75 out of 1000 individuals experienced side effects.. bintesti N Observed k Expected k Assumed p Observed p Pr(k >= 75) = (one-sided test) Pr(k <= 75) = (one-sided test) Pr(k <= 75 or k >= 127) = (two-sided test) Ho: proportion =.1 -- Binomial Exact -- Variable Obs Mean Std. Err. [95% Conf. Interval] Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The power of a study is the probability of rejecting the null hypothesis when it is false. It is important to estimate power for a range of possible scenarios when designing a study. A good study should have power of at least 80%. It is a waste of time, and unethical when live subjects are involved, to perform a study which has low power. If the study has low power then the result will almost certainly be that we fail to reject the null hypothesis, irrespective of whether it is true or false. The take-home message is to consult a statistician while designing a study. Basic Biostatistics, GAME, September From Purcell et al. (2003) Bioinformatics asic Biostatistics, GAME, September

25 Power and sample size calculation To calculate power, need to provide details of Number of subjects in the study and how they are allocated between exposed and unexposd. The size of the effect we wish to detect; bigger effects are easier to detect. Level of random variation; systematic effects are easier to detect when the level of random variation is lower. The level of random variation is a function of baseline disease prevalence and the exposure prevalence (or allele frequency) when the outcome is a proportion or the variance when the outcome is a mean. DSTPLAN (ftp://odin.mdacc.tmc.edu/pub/win32/dstplan_4.2_se.exe) isa popular freeware program for sample size and power calculation in epidemiology and biostatistics. Commercial packages also exist. Stata has functions for sample size and power calculation. QUANTO (free to download from computes sample size or power for association studies of genes, gene-environment interaction, or gene-gene interaction [3]. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Multiple comparisons ( ) Hypothesis testing: two-sample inference (Chapter 8) We have previously discussed one-sample tests. The underlying parameters of the population from which a sample was drawn were compared with comparable values from other populations whose parameters were assumed to be known. More often we want to compare the parameters of two different populations, neither of whose values is assumed known. This type of problem is commonly referred to as two-sample inference we obtain samples from two populations. Two-sample designs can be classified as independent-sample or paired-sample. For example, we may compare blood pressure among a group of women using oral contraceptives and make comparisons with a group of women who have never used oral contraceptives. This is an independent-sample design two completely different groups of women are being compared. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Alternatively, we may measure the BP of a group of women who who have never used oral contraceptives and then rescreen the women 1 year later to identify those women who become OC users these women become the study population. We then measure BP at the follow-up visit and compare BP at the time they were not using the pill to the time when they were using the pill. This is a paired-sample design because each data point in the first sample is related to a unique data point in the second sample the two groups of BP measurements are not independent. Independent vs. paired samples Paired samples: Each observation in the first group has a corresponding observation in the second group (corresponding observations typically not independent!) Independent samples: Observations in each of the two groups are not related to each other What do you think of this design as a means of studying the effect of OC use on BP? Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Example of a paired-sample design We could plot the data as follows: A study was conducted to examine the effectiveness of aspirin in reducing temperature among children suffering from influenza. Twelve 5-year-olds with the flu had their temperatures taken immediately before being given aspirin and then again one hour later. Is there a statistically significant reduction in temperature? Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

26 Set up the analysis: x i1 = temperature of child i before aspirin x i2 = temperature of child i after aspirin d i = x i2 x i1 = differences Assume d i are normally distributed with mean before aspirin: temperature of child i is normally distributed with mean µ i after aspirin: temperature of child i is normally distributed with mean µ i + If =0then temperature has not been reduced. Use a one-sample t-test, treating d i as a single measurement - the paired t-test.. use influenza. list before after differ~e Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September summarize Variable Obs Mean Std. Dev. Min Max before after difference Paired t-test H 0 : =0, no temperature reduction H A : 0 Test of H 0 versus H A at the α =0.05 level. For the 12 children with the flu, the sample mean difference in temperatures is and the sample standard deviation is d = n i=1 di n = F n i=1 (di d) 2 s d = n 1 = F Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The test statistic is t = d 0 s d/ n = / 12 = Under H 0, this is the outcome of a t random variable with 12 1=11df For t 11, the area under the curve to the left of is < Therefore, p<2(0.0005) = so we reject H 0, concluding that that aspirin was effective in reducing temperature. A 95% confidence interval for As for a single measurement, ( ( ) ( )) sd sd d t n 1, 1 α/2, d + tn 1, 1 α/2 n n In this case t 11,.975 =2.201, so the 95% confidence interval for is ( ( ) ( )) , which gives ( 2.235, 1.115) which does not contain 0. We would report that the difference is highly statistically significant (p <0.001). Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Paired t-test in Stata comparing mean(after - before). use influenza. ttest after=before Paired t test Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] after before diff Ho: mean(after - before) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff)!= 0 Ha: mean(diff) > 0 t = t = t = P < t = P > t = P > t = What if we compare mean(before - after)?. ttest before=after Paired t test Variable Obs Mean Std. Err. Std. Dev. [95% Conf. Interval] before after diff Ho: mean(before - after) = mean(diff) = 0 Ha: mean(diff) < 0 Ha: mean(diff)!= 0 Ha: mean(diff) > 0 t = t = t = P < t = P > t = P > t = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

27 Paired samples through matching Paired samples can also arise through matching subjects in one group with subjects in the other group. Example: Gynecology - effect of different contraceptive methods on fertility. How long does it take for users of oral contraceptives and diaphragms to become pregnant after stopping? For each OC user find a diaphragm user in same age, race, parity, SES groups. Two-sample t-test for independent samples ( 8.4) Example: A study was conducted to compare the serum iron levels of healthy children to those of children with cystic fibrosis. A random sample of n 1 =9healthy children has mean serum iron level x 1 =18.9 µmol/l and standard deviation s 1 =5.9 µmol/l. For a sample of n 2 =13children with cystic fibrosis, the mean is x 2 =11.9 µmol/l and s 2 =6.3 µmol/l. Note: the two sample sizes need not be the same. Is the mean serum iron level the same for each of these groups of children? Assume that the two underlying populations of serum iron levels are independent and normally distributed - X 1 N(µ 1,σ 2 1 ) and X 2 N(µ 2,σ 2 2 ) Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Test the null hypothesis versus the alternative H 0 : µ 1 = µ 2 H A : µ 1 µ 2 Two-sample t-test for independent samples with equal variance ( 8.5) Assume for now that σ 2 1 = σ 2 2 = σ 2. Reject H 0 if x 1 x 2 is too far from 0. Recall the central limit theorem (CLT) for a single normal population ) X N (µ, σ2 n For two normal populations, an extension of the CLT says that X 1 X 2 N (µ 1 µ σ 1 2 2, + σ 2 ) 2 n 1 n 2 Two different situations can arise the variances of the underlying populations are either equal or they are not equal. Using the CLT extension Hence, if σ known [ ]) x 1 x 2 N (µ 1 µ 2,σ 2 + 1n1 1n2 z = will follow a standard normal distribution. ( x1 x2) (µ1 µ2) σ2 [(1/n 1)+(1/n 2)] Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September If σ unknown (as is usually the case), then t = ( x1 x2) (µ1 µ2) s 2 p[(1/n 1)+(1/n 2)] Estimating σ 2 using s 2 p s 2 p is called the pooled estimate of the variance. (Equation 8.10, page 281) It is a weighted average of the two sample variances can be used as a test statistic, where s 2 p is an estimate of the variance. Under H 0 (µ 1 µ 2 =0) the test statistic t has a t distribution with n 1 + n 2 2 df, so compare it to the table of areas for the t distribution to find p, the probability of observing a discrepancy as large or larger than x 1 x 2 s 2 p = 2 (n1 1)s1 +(n 2 1)s2 2 n 1 + n 2 2 If p α, reject H 0 If p>α, do not reject H 0 This test is the two-sample t test. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Investigating the relationship between cystic fibrosis and serum iron levels H 0 : µ 1 = µ 2, H A : µ 1 µ 2 The test statistic is t = ( x1 x2) (µ1 µ2) sp 2 [(1/n 1)+(1/n 2)] First, calculate the pooled estimate of the variance = ( ) 0 (37.74)[(1/9) + (1/13)] s 2 p = 2 (n1 1)s1 +(n 2 1)s2 2 n 1 + n 2 2 = (9 1)(5.9)2 +(13 1)(6.3) = = 2.63 Referring to the t distribution with =20df, 2.63 cuts off an area between and 0.01 in the upper tail of the distribution, so 0.01 <p<0.02 Reject H 0 at the α =0.05 level of significance - mean serum iron level is lower for children with cystic fibrosis than it is for healthy children. We can use x 1 x 2 =7.0 as a point estimate for the true difference in population means µ 1 µ 2 and construct a confidence interval. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

28 A 95% CI for µ 1 µ 2 for the cystic fibrosis example For a t distribution with 20 df, 95% of the observations lie between and 2.086, so ( x1 x2) (µ1 µ2) P = 0.95 sp 2 [(1/n 1)+(1/n 2)] Rearranging terms, confidence limits are ( x 1 x 2) ± s 2 p ( 1 n n 2 ) So the 95% CI is (1.4, 12.6), which does not contain 0. Note that we can use the 95% CI for µ 1 µ 2 to test the hypothesis H 0 : µ 1 = µ 2 at the 5% level. We cannot, however, perform a hypothesis test by calculating separate confidence intervals for µ 1 and µ 2 and studying if they overlap. If the two confidence intervals do not overlap then we will reject H 0 in a formal hypothesis test but it is possible that we will reject H 0 even when the confidence intervals overlap (as in the example on the next slide). or ( 1 ( ) ± ) 13 Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September ttesti Two-sample t-test in Stata Two-sample t test with equal variances Obs Mean Std.Err. Std.Dev. [95% Conf. Int.] x y combined diff Degrees of freedom: 20 ttesti is for the two-sample t-test (obtained via the statistics/summaries... /classical tests of hypotheses/two-sample mean comparison calculator menu item). Depending on the format of the data, the Stata commands for a two-sample t-test are: ttest x, by(group) (via the statistics/summaries... /classical tests of hypotheses/group mean comparisons tests menu item). or ttest x1==x2 (via the statistics/summaries.../classical tests of hypotheses/two-sample mean comparison test menu item). Ho: mean(x) - mean(y) = diff = 0 Ha: diff < 0 Ha: diff!= 0 Ha: diff > 0 t = t = t = P < t = P > t = P > t = Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Testing for the equality of two variances ( 8.6) When performing the two-sample t-test for independent samples we assumed that the underlying variances of the two samples were the same (although we discussed separate tests depending on whether this common variance was known or unknown). It is possible to test the hypothesis that the variances are equal. The test statistic is the ratio of the sample variances F = s 2 1/s 2 2 which has an F distribution under the null hypothesis. Details are given in Rosner ( 8.6). The test can be performed in Stata (see the screen shot on the following slide). You will encounter the F distribution again next week when studying analysis of variance (ANOVA) which is the extension of the t-test to more than two groups. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Two-sample t-test for independent samples with unequal variance ( 8.7) Flowchart for two-sample inference concerning means As you might expect, if there is reason to believe that the variances of the two underlying populations are different then we need to use a slightly different form of the t-test. Details are given in Rosner 8.7. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

29 t-tests in Stata (using the command line) t-tests in Stata (using the menus). help ttest Mean comparison tests ttest varname == # [, level(#) ] ttest varname1 == varname2 [, unpaired unequal welch level(#) ] ttest varname, by(groupvar) [ unequal welch level(#) ] ttesti #obs #mean #sd #val [, level(#) ] ttesti #obs1 #mean1 #sd1 #obs2 #mean2 #sd2 [, unequal welch level(#) ] Simply choose the appropriate form of the test depending on the format of your data and the appropriate options (unpaired and unequal) to specify a paired/unpaired test and, for the unpaired version (i.e. independent samples) whether the variances are assumed to be equal or not. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Nonparametric methods (Chapter 9) Previously we have assumed that data have come from some underlying distribution, such as the normal or binomial, whose general form is assumed known. Methods of estimation were developed based on these assumptions. Such methods are known as parametric statistical methods because the parametric form of the distribution is assumed known. If we are not willing to make assumptions about the shape of the distribution, or cannot rely on the central limit theorem due to, for example, small sample size, then nonparametric methods can be applied. Parametric methods presuppose that there is a meaningful measure of distance between possible data values nonparametric methods are a useful alternative if such a measure is not available. For example, the Wilcoxon signed-rank test ( 9.3) is a nonparametric alternative to the paired t-test. The Wilcoxon rank-sum test ( 9.4) is a nonparametric alternative to the t-test for two independent samples. I will not further discuss nonparametric methods in this course although you should be aware that they exist. If the underlying parametric assumptions are justified then parametric methods are more powerful than nonparametric methods. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September The chi-square distribution ( 6.7.2) If G = n i=1 Z2 i where Z1...Zn are independent random variables each with a standard normal distribution then G is said to follow a chi-square distribution with n degrees of freedom. We write G χ 2 n. The chi-square distribution is a family of distributions indexed by the parameter n. It can be shown that the expected value of the χ 2 n distribution is n and the variance is 2n. The chi-square distribution is usually used to describe the distribution of test statistics rather than naturally occurring quantities. For example, tests for association for R C contingency tables which we will look at next. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September Testing independence of two categorical variables ( 10.6) (aka. tests for association for R C contingency tables) If each member of a population is examined for two characteristics, and each characteristic classified into a number of categories, we may want to know if the two characteristics are independent. I ll present the approach using an example. The mathematical details can be found in any textbook (e.g. Rosner 10.6). A sample of 250 seedlings were classified for vigour and leaf colour with the following results. Vigour Leaf colour Good Average Weak Total Green Yellow-green Yellow Total This is an example of a cross-sectional study. The proportion of seedlings with good vigour is 67/250 and the proportion of green seedlings is 138/250. These are called marginal proportions. If the two characteristics are independent, then the expected proportions in each of the 9 cells in the table will be the product of the marginal proportions. Basic Biostatistics, GAME, September Basic Biostatistics, GAME, September

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2

Probability Experiments, Trials, Outcomes, Sample Spaces Example 1 Example 2 Probability Probability is the study of uncertain events or outcomes. Games of chance that involve rolling dice or dealing cards are one obvious area of application. However, probability models underlie

More information

Part 3: Parametric Models

Part 3: Parametric Models Part 3: Parametric Models Matthew Sperrin and Juhyun Park April 3, 2009 1 Introduction Is the coin fair or not? In part one of the course we introduced the idea of separating sampling variation from a

More information

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability? Probability: Why do we care? Lecture 2: Probability and Distributions Sandy Eckel seckel@jhsph.edu 22 April 2008 Probability helps us by: Allowing us to translate scientific questions into mathematical

More information

Conditional Probabilities

Conditional Probabilities Lecture Outline BIOST 514/517 Biostatistics I / pplied Biostatistics I Kathleen Kerr, Ph.D. ssociate Professor of Biostatistics University of Washington Probability Diagnostic Testing Random variables:

More information

Lecture 2: Probability and Distributions

Lecture 2: Probability and Distributions Lecture 2: Probability and Distributions Ani Manichaikul amanicha@jhsph.edu 17 April 2007 1 / 65 Probability: Why do we care? Probability helps us by: Allowing us to translate scientific questions info

More information

Continuous Probability Distributions

Continuous Probability Distributions 1 Chapter 5 Continuous Probability Distributions 5.1 Probability density function Example 5.1.1. Revisit Example 3.1.1. 11 12 13 14 15 16 21 22 23 24 25 26 S = 31 32 33 34 35 36 41 42 43 44 45 46 (5.1.1)

More information

1 The Basic Counting Principles

1 The Basic Counting Principles 1 The Basic Counting Principles The Multiplication Rule If an operation consists of k steps and the first step can be performed in n 1 ways, the second step can be performed in n ways [regardless of how

More information

University of Jordan Fall 2009/2010 Department of Mathematics

University of Jordan Fall 2009/2010 Department of Mathematics handouts Part 1 (Chapter 1 - Chapter 5) University of Jordan Fall 009/010 Department of Mathematics Chapter 1 Introduction to Introduction; Some Basic Concepts Statistics is a science related to making

More information

Continuous Probability Distributions

Continuous Probability Distributions 1 Chapter 5 Continuous Probability Distributions 5.1 Probability density function Example 5.1.1. Revisit Example 3.1.1. 11 12 13 14 15 16 21 22 23 24 25 26 S = 31 32 33 34 35 36 41 42 43 44 45 46 (5.1.1)

More information

Part 3: Parametric Models

Part 3: Parametric Models Part 3: Parametric Models Matthew Sperrin and Juhyun Park August 19, 2008 1 Introduction There are three main objectives to this section: 1. To introduce the concepts of probability and random variables.

More information

2.6 Tools for Counting sample points

2.6 Tools for Counting sample points 2.6 Tools for Counting sample points When the number of simple events in S is too large, manual enumeration of every sample point in S is tedious or even impossible. (Example) If S contains N equiprobable

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Ramesh Yapalparvi Week 2 Chapter 4 Bivariate Data Data with two/paired variables, Pearson correlation coefficient and its properties, general variance sum law Chapter 6

More information

Unit 4 Probability. Dr Mahmoud Alhussami

Unit 4 Probability. Dr Mahmoud Alhussami Unit 4 Probability Dr Mahmoud Alhussami Probability Probability theory developed from the study of games of chance like dice and cards. A process like flipping a coin, rolling a die or drawing a card from

More information

Lecture 1: Probability Fundamentals

Lecture 1: Probability Fundamentals Lecture 1: Probability Fundamentals IB Paper 7: Probability and Statistics Carl Edward Rasmussen Department of Engineering, University of Cambridge January 22nd, 2008 Rasmussen (CUED) Lecture 1: Probability

More information

Probability. Chapter 1 Probability. A Simple Example. Sample Space and Probability. Sample Space and Event. Sample Space (Two Dice) Probability

Probability. Chapter 1 Probability. A Simple Example. Sample Space and Probability. Sample Space and Event. Sample Space (Two Dice) Probability Probability Chapter 1 Probability 1.1 asic Concepts researcher claims that 10% of a large population have disease H. random sample of 100 people is taken from this population and examined. If 20 people

More information

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations

Discrete Mathematics and Probability Theory Fall 2014 Anant Sahai Note 15. Random Variables: Distributions, Independence, and Expectations EECS 70 Discrete Mathematics and Probability Theory Fall 204 Anant Sahai Note 5 Random Variables: Distributions, Independence, and Expectations In the last note, we saw how useful it is to have a way of

More information

Probability Notes (A) , Fall 2010

Probability Notes (A) , Fall 2010 Probability Notes (A) 18.310, Fall 2010 We are going to be spending around four lectures on probability theory this year. These notes cover approximately the first three lectures on it. Probability theory

More information

Discrete Probability. Chemistry & Physics. Medicine

Discrete Probability. Chemistry & Physics. Medicine Discrete Probability The existence of gambling for many centuries is evidence of long-running interest in probability. But a good understanding of probability transcends mere gambling. The mathematics

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics February 19, 2018 CS 361: Probability & Statistics Random variables Markov s inequality This theorem says that for any random variable X and any value a, we have A random variable is unlikely to have an

More information

MATH1231 Algebra, 2017 Chapter 9: Probability and Statistics

MATH1231 Algebra, 2017 Chapter 9: Probability and Statistics MATH1231 Algebra, 2017 Chapter 9: Probability and Statistics A/Prof. Daniel Chan School of Mathematics and Statistics University of New South Wales danielc@unsw.edu.au Daniel Chan (UNSW) MATH1231 Algebra

More information

Lecture #13 Tuesday, October 4, 2016 Textbook: Sections 7.3, 7.4, 8.1, 8.2, 8.3

Lecture #13 Tuesday, October 4, 2016 Textbook: Sections 7.3, 7.4, 8.1, 8.2, 8.3 STATISTICS 200 Lecture #13 Tuesday, October 4, 2016 Textbook: Sections 7.3, 7.4, 8.1, 8.2, 8.3 Objectives: Identify, and resist the temptation to fall for, the gambler s fallacy Define random variable

More information

Module 8 Probability

Module 8 Probability Module 8 Probability Probability is an important part of modern mathematics and modern life, since so many things involve randomness. The ClassWiz is helpful for calculating probabilities, especially those

More information

( ) P A B : Probability of A given B. Probability that A happens

( ) P A B : Probability of A given B. Probability that A happens A B A or B One or the other or both occurs At least one of A or B occurs Probability Review A B A and B Both A and B occur ( ) P A B : Probability of A given B. Probability that A happens given that B

More information

Outline. Probability. Math 143. Department of Mathematics and Statistics Calvin College. Spring 2010

Outline. Probability. Math 143. Department of Mathematics and Statistics Calvin College. Spring 2010 Outline Math 143 Department of Mathematics and Statistics Calvin College Spring 2010 Outline Outline 1 Review Basics Random Variables Mean, Variance and Standard Deviation of Random Variables 2 More Review

More information

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /13/2016 1/33

Dr. Junchao Xia Center of Biophysics and Computational Biology. Fall /13/2016 1/33 BIO5312 Biostatistics Lecture 03: Discrete and Continuous Probability Distributions Dr. Junchao Xia Center of Biophysics and Computational Biology Fall 2016 9/13/2016 1/33 Introduction In this lecture,

More information

Introduction to Probability, Fall 2009

Introduction to Probability, Fall 2009 Introduction to Probability, Fall 2009 Math 30530 Review questions for exam 1 solutions 1. Let A, B and C be events. Some of the following statements are always true, and some are not. For those that are

More information

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com 1 School of Oriental and African Studies September 2015 Department of Economics Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com Gujarati D. Basic Econometrics, Appendix

More information

Statistical Experiment A statistical experiment is any process by which measurements are obtained.

Statistical Experiment A statistical experiment is any process by which measurements are obtained. (التوزيعات الا حتمالية ( Distributions Probability Statistical Experiment A statistical experiment is any process by which measurements are obtained. Examples of Statistical Experiments Counting the number

More information

Lecture 6. Probability events. Definition 1. The sample space, S, of a. probability experiment is the collection of all

Lecture 6. Probability events. Definition 1. The sample space, S, of a. probability experiment is the collection of all Lecture 6 1 Lecture 6 Probability events Definition 1. The sample space, S, of a probability experiment is the collection of all possible outcomes of an experiment. One such outcome is called a simple

More information

Properties of Probability

Properties of Probability Econ 325 Notes on Probability 1 By Hiro Kasahara Properties of Probability In statistics, we consider random experiments, experiments for which the outcome is random, i.e., cannot be predicted with certainty.

More information

Elementary Statistics

Elementary Statistics Elementary Statistics Q: What is data? Q: What does the data look like? Q: What conclusions can we draw from the data? Q: Where is the middle of the data? Q: Why is the spread of the data important? Q:

More information

7.1 What is it and why should we care?

7.1 What is it and why should we care? Chapter 7 Probability In this section, we go over some simple concepts from probability theory. We integrate these with ideas from formal language theory in the next chapter. 7.1 What is it and why should

More information

Probabilistic models

Probabilistic models Kolmogorov (Andrei Nikolaevich, 1903 1987) put forward an axiomatic system for probability theory. Foundations of the Calculus of Probabilities, published in 1933, immediately became the definitive formulation

More information

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019

Lecture 10: Probability distributions TUESDAY, FEBRUARY 19, 2019 Lecture 10: Probability distributions DANIEL WELLER TUESDAY, FEBRUARY 19, 2019 Agenda What is probability? (again) Describing probabilities (distributions) Understanding probabilities (expectation) Partial

More information

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 12. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Fall 203 Vazirani Note 2 Random Variables: Distribution and Expectation We will now return once again to the question of how many heads in a typical sequence

More information

P(A) = Definitions. Overview. P - denotes a probability. A, B, and C - denote specific events. P (A) - Chapter 3 Probability

P(A) = Definitions. Overview. P - denotes a probability. A, B, and C - denote specific events. P (A) - Chapter 3 Probability Chapter 3 Probability Slide 1 Slide 2 3-1 Overview 3-2 Fundamentals 3-3 Addition Rule 3-4 Multiplication Rule: Basics 3-5 Multiplication Rule: Complements and Conditional Probability 3-6 Probabilities

More information

Find the value of n in order for the player to get an expected return of 9 counters per roll.

Find the value of n in order for the player to get an expected return of 9 counters per roll. . A biased die with four faces is used in a game. A player pays 0 counters to roll the die. The table below shows the possible scores on the die, the probability of each score and the number of counters

More information

1 Normal Distribution.

1 Normal Distribution. Normal Distribution.. Introduction A Bernoulli trial is simple random experiment that ends in success or failure. A Bernoulli trial can be used to make a new random experiment by repeating the Bernoulli

More information

MATH 19B FINAL EXAM PROBABILITY REVIEW PROBLEMS SPRING, 2010

MATH 19B FINAL EXAM PROBABILITY REVIEW PROBLEMS SPRING, 2010 MATH 9B FINAL EXAM PROBABILITY REVIEW PROBLEMS SPRING, 00 This handout is meant to provide a collection of exercises that use the material from the probability and statistics portion of the course The

More information

COVENANT UNIVERSITY NIGERIA TUTORIAL KIT OMEGA SEMESTER PROGRAMME: ECONOMICS

COVENANT UNIVERSITY NIGERIA TUTORIAL KIT OMEGA SEMESTER PROGRAMME: ECONOMICS COVENANT UNIVERSITY NIGERIA TUTORIAL KIT OMEGA SEMESTER PROGRAMME: ECONOMICS COURSE: CBS 221 DISCLAIMER The contents of this document are intended for practice and leaning purposes at the undergraduate

More information

Management Programme. MS-08: Quantitative Analysis for Managerial Applications

Management Programme. MS-08: Quantitative Analysis for Managerial Applications MS-08 Management Programme ASSIGNMENT SECOND SEMESTER 2013 MS-08: Quantitative Analysis for Managerial Applications School of Management Studies INDIRA GANDHI NATIONAL OPEN UNIVERSITY MAIDAN GARHI, NEW

More information

Chapter 4 Probability

Chapter 4 Probability 4-1 Review and Preview Chapter 4 Probability 4-2 Basic Concepts of Probability 4-3 Addition Rule 4-4 Multiplication Rule: Basics 4-5 Multiplication Rule: Complements and Conditional Probability 4-6 Counting

More information

F71SM STATISTICAL METHODS

F71SM STATISTICAL METHODS F71SM STATISTICAL METHODS RJG SUMMARY NOTES 2 PROBABILITY 2.1 Introduction A random experiment is an experiment which is repeatable under identical conditions, and for which, at each repetition, the outcome

More information

Probability. Lecture Notes. Adolfo J. Rumbos

Probability. Lecture Notes. Adolfo J. Rumbos Probability Lecture Notes Adolfo J. Rumbos October 20, 204 2 Contents Introduction 5. An example from statistical inference................ 5 2 Probability Spaces 9 2. Sample Spaces and σ fields.....................

More information

Applied Statistics I

Applied Statistics I Applied Statistics I (IMT224β/AMT224β) Department of Mathematics University of Ruhuna A.W.L. Pubudu Thilan Department of Mathematics University of Ruhuna Applied Statistics I(IMT224β/AMT224β) 1/158 Chapter

More information

A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes.

A Probability Primer. A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes. A Probability Primer A random walk down a probabilistic path leading to some stochastic thoughts on chance events and uncertain outcomes. Are you holding all the cards?? Random Events A random event, E,

More information

Lectures Conditional Probability and Independence

Lectures Conditional Probability and Independence Lectures 5 11 Conditional Probability and Independence Purpose: Calculate probabilities under restrictions, conditions or partial information on the random experiment. Break down complex probabilistic

More information

lecture notes October 22, Probability Theory

lecture notes October 22, Probability Theory 8.30 lecture notes October 22, 203 Probability Theory Lecturer: Michel Goemans These notes cover the basic definitions of discrete probability theory, and then present some results including Bayes rule,

More information

Class 26: review for final exam 18.05, Spring 2014

Class 26: review for final exam 18.05, Spring 2014 Probability Class 26: review for final eam 8.05, Spring 204 Counting Sets Inclusion-eclusion principle Rule of product (multiplication rule) Permutation and combinations Basics Outcome, sample space, event

More information

5.3 Conditional Probability and Independence

5.3 Conditional Probability and Independence 28 CHAPTER 5. PROBABILITY 5. Conditional Probability and Independence 5.. Conditional Probability Two cubical dice each have a triangle painted on one side, a circle painted on two sides and a square painted

More information

STAT:5100 (22S:193) Statistical Inference I

STAT:5100 (22S:193) Statistical Inference I STAT:5100 (22S:193) Statistical Inference I Week 3 Luke Tierney University of Iowa Fall 2015 Luke Tierney (U Iowa) STAT:5100 (22S:193) Statistical Inference I Fall 2015 1 Recap Matching problem Generalized

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Let us think of the situation as having a 50 sided fair die; any one number is equally likely to appear.

Let us think of the situation as having a 50 sided fair die; any one number is equally likely to appear. Probability_Homework Answers. Let the sample space consist of the integers through. {, 2, 3,, }. Consider the following events from that Sample Space. Event A: {a number is a multiple of 5 5, 0, 5,, }

More information

4. Discrete Probability Distributions. Introduction & Binomial Distribution

4. Discrete Probability Distributions. Introduction & Binomial Distribution 4. Discrete Probability Distributions Introduction & Binomial Distribution Aim & Objectives 1 Aims u Introduce discrete probability distributions v Binomial distribution v Poisson distribution 2 Objectives

More information

Random Variable. Discrete Random Variable. Continuous Random Variable. Discrete Random Variable. Discrete Probability Distribution

Random Variable. Discrete Random Variable. Continuous Random Variable. Discrete Random Variable. Discrete Probability Distribution Random Variable Theoretical Probability Distribution Random Variable Discrete Probability Distributions A variable that assumes a numerical description for the outcome of a random eperiment (by chance).

More information

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 4.1-1

Lecture Slides. Elementary Statistics Eleventh Edition. by Mario F. Triola. and the Triola Statistics Series 4.1-1 Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by Mario F. Triola 4.1-1 4-1 Review and Preview Chapter 4 Probability 4-2 Basic Concepts of Probability 4-3 Addition

More information

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P. 1 2 Review Data for assessing the sensitivity and specificity of a test are usually of the form disease category test result diseased (+) nondiseased ( ) + A B C D Sensitivity: is the proportion of diseased

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 16. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Spring 206 Rao and Walrand Note 6 Random Variables: Distribution and Expectation Example: Coin Flips Recall our setup of a probabilistic experiment as

More information

Discussion 03 Solutions

Discussion 03 Solutions STAT Discussion Solutions Spring 8. A new flavor of toothpaste has been developed. It was tested by a group of people. Nine of the group said they liked the new flavor, and the remaining indicated they

More information

ACM 116: Lecture 2. Agenda. Independence. Bayes rule. Discrete random variables Bernoulli distribution Binomial distribution

ACM 116: Lecture 2. Agenda. Independence. Bayes rule. Discrete random variables Bernoulli distribution Binomial distribution 1 ACM 116: Lecture 2 Agenda Independence Bayes rule Discrete random variables Bernoulli distribution Binomial distribution Continuous Random variables The Normal distribution Expected value of a random

More information

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc

Examples of frequentist probability include games of chance, sample surveys, and randomized experiments. We will focus on frequentist probability sinc FPPA-Chapters 13,14 and parts of 16,17, and 18 STATISTICS 50 Richard A. Berk Spring, 1997 May 30, 1997 1 Thinking about Chance People talk about \chance" and \probability" all the time. There are many

More information

DISCRETE VARIABLE PROBLEMS ONLY

DISCRETE VARIABLE PROBLEMS ONLY DISCRETE VARIABLE PROBLEMS ONLY. A biased die with four faces is used in a game. A player pays 0 counters to roll the die. The table below shows the possible scores on the die, the probability of each

More information

Probability. Introduction to Biostatistics

Probability. Introduction to Biostatistics Introduction to Biostatistics Probability Second Semester 2014/2015 Text Book: Basic Concepts and Methodology for the Health Sciences By Wayne W. Daniel, 10 th edition Dr. Sireen Alkhaldi, BDS, MPH, DrPH

More information

STAT2201. Analysis of Engineering & Scientific Data. Unit 3

STAT2201. Analysis of Engineering & Scientific Data. Unit 3 STAT2201 Analysis of Engineering & Scientific Data Unit 3 Slava Vaisman The University of Queensland School of Mathematics and Physics What we learned in Unit 2 (1) We defined a sample space of a random

More information

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation

Discrete Mathematics and Probability Theory Fall 2012 Vazirani Note 14. Random Variables: Distribution and Expectation CS 70 Discrete Mathematics and Probability Theory Fall 202 Vazirani Note 4 Random Variables: Distribution and Expectation Random Variables Question: The homeworks of 20 students are collected in, randomly

More information

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces. Probability Theory To start out the course, we need to know something about statistics and probability Introduction to Probability Theory L645 Advanced NLP Autumn 2009 This is only an introduction; for

More information

2. AXIOMATIC PROBABILITY

2. AXIOMATIC PROBABILITY IA Probability Lent Term 2. AXIOMATIC PROBABILITY 2. The axioms The formulation for classical probability in which all outcomes or points in the sample space are equally likely is too restrictive to develop

More information

STEP Support Programme. Statistics STEP Questions

STEP Support Programme. Statistics STEP Questions STEP Support Programme Statistics STEP Questions This is a selection of STEP I and STEP II questions. The specification is the same for both papers, with STEP II questions designed to be more difficult.

More information

What is the probability of getting a heads when flipping a coin

What is the probability of getting a heads when flipping a coin Chapter 2 Probability Probability theory is a branch of mathematics dealing with chance phenomena. The origins of the subject date back to the Italian mathematician Cardano about 1550, and French mathematicians

More information

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text)

Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text) Answer keys for Assignment 10: Measurement of study variables (The correct answer is underlined in bold text) 1. A quick and easy indicator of dispersion is a. Arithmetic mean b. Variance c. Standard deviation

More information

Conditional Probability

Conditional Probability Conditional Probability Terminology: The probability of an event occurring, given that another event has already occurred. P A B = ( ) () P A B : The probability of A given B. Consider the following table:

More information

Mathematical Probability

Mathematical Probability Mathematical Probability STA 281 Fall 2011 1 Introduction Engineers and scientists are always exposed to data, both in their professional capacities and in everyday activities. The discipline of statistics

More information

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary) Chapter 14 From Randomness to Probability How to measure a likelihood of an event? How likely is it to answer correctly one out of two true-false questions on a quiz? Is it more, less, or equally likely

More information

Discrete Distributions

Discrete Distributions Discrete Distributions STA 281 Fall 2011 1 Introduction Previously we defined a random variable to be an experiment with numerical outcomes. Often different random variables are related in that they have

More information

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary

The t-distribution. Patrick Breheny. October 13. z tests The χ 2 -distribution The t-distribution Summary Patrick Breheny October 13 Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/25 Introduction Introduction What s wrong with z-tests? So far we ve (thoroughly!) discussed how to carry out hypothesis

More information

Probabilistic models

Probabilistic models Probabilistic models Kolmogorov (Andrei Nikolaevich, 1903 1987) put forward an axiomatic system for probability theory. Foundations of the Calculus of Probabilities, published in 1933, immediately became

More information

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ). CS 70 Discrete Mathematics for CS Spring 2006 Vazirani Lecture 8 Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical trials,

More information

Discrete Distributions

Discrete Distributions A simplest example of random experiment is a coin-tossing, formally called Bernoulli trial. It happens to be the case that many useful distributions are built upon this simplest form of experiment, whose

More information

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

MATH2206 Prob Stat/20.Jan Weekly Review 1-2 MATH2206 Prob Stat/20.Jan.2017 Weekly Review 1-2 This week I explained the idea behind the formula of the well-known statistic standard deviation so that it is clear now why it is a measure of dispersion

More information

E509A: Principle of Biostatistics. GY Zou

E509A: Principle of Biostatistics. GY Zou E509A: Principle of Biostatistics (Week 4: Inference for a single mean ) GY Zou gzou@srobarts.ca Example 5.4. (p. 183). A random sample of n =16, Mean I.Q is 106 with standard deviation S =12.4. What

More information

Review of Basic Probability Theory

Review of Basic Probability Theory Review of Basic Probability Theory James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) 1 / 35 Review of Basic Probability Theory

More information

3 PROBABILITY TOPICS

3 PROBABILITY TOPICS Chapter 3 Probability Topics 135 3 PROBABILITY TOPICS Figure 3.1 Meteor showers are rare, but the probability of them occurring can be calculated. (credit: Navicore/flickr) Introduction It is often necessary

More information

Announcements. Topics: To Do:

Announcements. Topics: To Do: Announcements Topics: In the Probability and Statistics module: - Sections 1 + 2: Introduction to Stochastic Models - Section 3: Basics of Probability Theory - Section 4: Conditional Probability; Law of

More information

Probability Distribution

Probability Distribution Economic Risk and Decision Analysis for Oil and Gas Industry CE81.98 School of Engineering and Technology Asian Institute of Technology January Semester Presented by Dr. Thitisak Boonpramote Department

More information

Probability theory basics

Probability theory basics Probability theory basics Michael Franke Basics of probability theory: axiomatic definition, interpretation, joint distributions, marginalization, conditional probability & Bayes rule. Random variables:

More information

1 of 6 7/16/2009 6:31 AM Virtual Laboratories > 11. Bernoulli Trials > 1 2 3 4 5 6 1. Introduction Basic Theory The Bernoulli trials process, named after James Bernoulli, is one of the simplest yet most

More information

MAT 271E Probability and Statistics

MAT 271E Probability and Statistics MAT 271E Probability and Statistics Spring 2011 Instructor : Class Meets : Office Hours : Textbook : Supp. Text : İlker Bayram EEB 1103 ibayram@itu.edu.tr 13.30 16.30, Wednesday EEB? 10.00 12.00, Wednesday

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

Probability and Statistics. Terms and concepts

Probability and Statistics. Terms and concepts Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

STAT 201 Chapter 5. Probability

STAT 201 Chapter 5. Probability STAT 201 Chapter 5 Probability 1 2 Introduction to Probability Probability The way we quantify uncertainty. Subjective Probability A probability derived from an individual's personal judgment about whether

More information

AMS7: WEEK 2. CLASS 2

AMS7: WEEK 2. CLASS 2 AMS7: WEEK 2. CLASS 2 Introduction to Probability. Probability Rules. Independence and Conditional Probability. Bayes Theorem. Risk and Odds Ratio Friday April 10, 2015 Probability: Introduction Probability:

More information

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015 Probability and Statistics Joyeeta Dutta-Moscato June 29, 2015 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Fourier and Stats / Astro Stats and Measurement : Stats Notes Fourier and Stats / Astro Stats and Measurement : Stats Notes Andy Lawrence, University of Edinburgh Autumn 2013 1 Probabilities, distributions, and errors Laplace once said Probability theory is nothing

More information

Chapter 7: Theoretical Probability Distributions Variable - Measured/Categorized characteristic

Chapter 7: Theoretical Probability Distributions Variable - Measured/Categorized characteristic BSTT523: Pagano & Gavreau, Chapter 7 1 Chapter 7: Theoretical Probability Distributions Variable - Measured/Categorized characteristic Random Variable (R.V.) X Assumes values (x) by chance Discrete R.V.

More information

Week 04 Discussion. a) What is the probability that of those selected for the in-depth interview 4 liked the new flavor and 1 did not?

Week 04 Discussion. a) What is the probability that of those selected for the in-depth interview 4 liked the new flavor and 1 did not? STAT Wee Discussion Fall 7. A new flavor of toothpaste has been developed. It was tested by a group of people. Nine of the group said they lied the new flavor, and the remaining 6 indicated they did not.

More information

Probability and Discrete Distributions

Probability and Discrete Distributions AMS 7L LAB #3 Fall, 2007 Objectives: Probability and Discrete Distributions 1. To explore relative frequency and the Law of Large Numbers 2. To practice the basic rules of probability 3. To work with the

More information

Part (A): Review of Probability [Statistics I revision]

Part (A): Review of Probability [Statistics I revision] Part (A): Review of Probability [Statistics I revision] 1 Definition of Probability 1.1 Experiment An experiment is any procedure whose outcome is uncertain ffl toss a coin ffl throw a die ffl buy a lottery

More information

2 Chapter 2: Conditional Probability

2 Chapter 2: Conditional Probability STAT 421 Lecture Notes 18 2 Chapter 2: Conditional Probability Consider a sample space S and two events A and B. For example, suppose that the equally likely sample space is S = {0, 1, 2,..., 99} and A

More information