BIOS 2041: Introduction to Statistical Methods

Size: px

Start display at page:

Download "BIOS 2041: Introduction to Statistical Methods"

Britton Sutton
5 years ago
Views:

1 BIOS 2041: Introduction to Statistical Methods Abdus S Wahed* *Some of the materials in this chapter has been adapted from Dr. John Wilson s lecture notes for the same course.

2 Chapter 0 2

3 Chapter 1 Introduction to Statistical Methods 1.1 What is Statistics? Statistics Science of making inferences about specific random phenomena based on limited sample materials. The discipline provides methods for answering questions such as What effect does air pollution have on the residents of Pittsburgh? What proportion of Pittsburgh residents invest in stocks or bonds? Is drug A better than drug B in relieving certain asthma 3

4 symptoms? Does vitamin A prevent cancer? Based on this quarter s performance of stock returns, what strategy will optimize the expected return in the next quarter? A central task of statistical analysis is to draw a conclusion ( make inference ) about a population of interest based on evidence in a sample from that population. Population = the set of all subjects or individuals who could be measured for some variable of interest. Another viewpoint is that the population is the group about which you wish to draw a conclusion. Example: All women in Allegheny County A parameter is a numeric characteristic of a population. Example: Proportion of women in Allegheny County having a female relative who has been treated for breast cancer. Chapter 1 4

5 A sample is a subset of population selected for study. The idea is that the sample will provide the information used in drawing the conclusion about the population. Example: 200 Allegheny County women selected by randomdigit telephone dialing. A statistic is a numeric characteristic of a sample. Example: The observed proportion of women with a female relative who had been treated for breast cancer is 35%. Inference is the conclusion drawn about population on basis of sample. Example: The proportion of Allegheny County women having a female relative who has had breast cancer is 35%. Another example: Population: All patients treated for Acute Myelocytic Leukemia (AML) who are in first complete remission (CR1). Chapter 1 5

6 Parameter: Median duration of remission of treated AML patients in CR1. Sample: 35 AML / CR1 patients treated at the University of Pittsburgh Cancer Institute during Statistic: The median duration of CR1 in these 35 patients was 13 months. Inference: The median duration of CR1 in patients treated for AML is 13 months. 1.2 What is Biostatistics? Biostatistics The branch of statistics that applies statistical methods to medical and biological problems. Biostatisticians help researchers (basic scientists, medical researchers, drug developers) from the inception of a study to its completion. The role of a biostatistician in the process is: Chapter 1 6

7 To formulate the research question in concrete terms hypothesis. To plan the experiment/study that will answer the research question accurately and efficiently e.g. How many subjects (mice, patients, machines) will be needed to answer the research question? How would, for example, subjects be assigned to different groups? What data should be collected on each subject? How would the data be verified and processed? What are the issues with the data? e.g. How would the missing data be handled? Are there measurement errors in the data? How is it going to be handled? Analyze the collected data to draw conclusions regarding the hypotheses. Chapter 1 7

8 Example Drug development. XYZ pharmaceuticals has been conducting research on developing drugs for hepatitis C (Hep C) treatment since Their basic science researchers have convinced Food and Drug Administration (FDA) through phase I and II trials that they have discovered a new molecule of the standard interferon that can be administered once weekly instead of once daily, and they claim that the drug provides better response rate compared to standard interferon. The company is planning to test the drug on a large cohort of hepatitis C patients. The statistician assigned for this study will generally start asking basic questions like: 1. How would you quantify the response? (Usually a simplified answer would be: absence of Hep C virus in the serum 24 weeks after the end of the treatment.) 2. How much improvement do you expect in response rate among the users of the new drug compared to standard interferon users? (The Phase II trial would indicate some ball park figure for this.) Chapter 1 8

9 Based on the answers, the statistician will Formulate the hypothesis in quantitative terms: H 0 : P 1 = P 2, (1.2.1) P 1 is the response rate in the standard interferon group and P 2 is the response rate for the new treatment (weekly interferon). Determine the number of patients to be recruited in the (standard) daily interferon group and in the (new) weekly interferon group. Make sure that the patient safety and privacy is ensured in the protocol keeping in mind the objective of the study. Devise a randomization scheme (possibly double-blinded) to assign treatments to patients so that the two groups are comparable with respect to patient characteristics. Suggest a data collection, verification and management plan. Chapter 1 9

10 How many sites will be used for patient recruitment? What data needs to be collected? What system will be used to transfer the data? How will the data be processed? What information and how often should the data be presented to the DSMB (Data Safety Monitoring Board)? What criteria should be used to declare the new treatment significantly better? How many interim analysis should be planned? What criteria should be used for stopping the trial? Finally, when the trial ends, the statistician will conduct/oversee the data analysis to arrive at a conclusion regarding the hypothesis. In this course, we will mainly talk about: Chapter 1 10

11 Statistical methods to analyze collected data so that answers to specific questions of interest can be made. Design issues, for example, sample size and power, etc. We will cover: Chapters 1-8 (in full), (partial). Chapter 1 11

12 Chapter 1 12

13 Chapter 2 Descriptive Statistics In most cases data consist of many sample points. In a bid to interpret data, the first task is to summarize the data in some concise manner. 2.1 Types of data. Data collected, outcomes of experiments, etc. are often referred to as variables or outcomes, which come in several varieties. The type of outcome observed plays a role in determining which statistical procedures are appropriate. 13

14 Categorical (discrete) - data can be assigned to discrete categories. a) Unordered i) Gender ii) Political party to which one belongs iii) Exposed vs not exposed iv) Disease or no disease b) Ordered i) Good- Better- Best classification ii) Number of times patient admitted to hospital for illness during a given year. Continuous variables a) Ordinary or uncensored i) Standard scale measurements -height Chapter 2 14

15 -weight - optical density -ph ii) Survival times that are actually observed. b) Censored data i) Survival time- may be known only that time is greater than some observed time. Here is the first 10 records from a dataset: Table 2.1: Several records from a dataset Obs ID AGE SEX LEADTYP IQF Chapter 2 15

16 Many numerical and graphical techniques are available for the purpose of summarizing data. We will start with continuous variables. 2.2 Measures of Location The first sets of summary measures will define the center (or middle) of the sample data. Such measures are known as measures of location or measures of central tendency. We will start with the simplest of these measures, the arithmetic mean (or simply, the mean) Arithmetic Mean Arithmetic mean is the sum of the observations divided by the number of observations. Formula: If X is what is measured (observed) and x 1,x 2,...,x n are the values of n measurements, then the arithmetic mean is given by the formula: x = x 1 + x x n n = n i=1 x i. (2.2.1) n Chapter 2 16

17 Example Table 2.1 (Rosner) Table 2.2: Sample of birthweights (g) of live-born infants born at a private hospital in San Diego, California, during a 1-week period. New-born Weight (g) New-born Weight (g) New-born Weight (g) New-born Weight (g) X = birthweights (g) of live-born infants x = = g. (2.2.2) Facts about mean Arithmetic mean is easy to compute. If the sample points change in scale by a factor of c, themean changes by a factor of c. In some cases it fails reflect the center of the sample, specifically in the presence of unusually high or low values (outliers). Chapter 2 17

18 It is most widely used measures of location Median Loosely speaking, the median is a number such that in the ordered sample, half of the sample points lies below it, and half above it. Formula: If n is odd then ( ) n+1 2 th observation is the median. Otherwise, median is defined as the average of the ( ( n 2) th and n 2 +1) th largest observations. Example Table 2.2 (Rosner). White blood cell counts ( 1000) for a sample of 9 patients entering a hospital. The ordered sample is as follows: 3, 5, 7, 8, 8, 9, 10, 12, 35 Here, n = 9, and hence ( ) n+1 2 = 5. The median white blood cell counts for this sample is the 5th observation, which is Chapter 2 18

19 Facts about median: Median is not highly influenced by extreme observations, unless there is only one or two data points. Median depends only on one or two middle observations and hence is less sensitive to the magnitude of other observations in the sample Mode Mode is the most frequently occurring value in the sample. In the above example, the mode white blood cell count is 8000 as it occurs most frequently than any other white-blood count. Facts about mode: If all the data points occur exactly the same number of times, then there is no mode. A sample with one mode is called unimodal;twomodes,bimodal; Chapter 2 19

20 three modes, trimodal; and so on Geometric Mean Geometric mean is often used for summarizing ratios, percentages, indices, or other data sets bounded by zero. The geometric mean of n positive numbers x 1,x 2,...,x n ia defined as the n-th root of their product. Formula: GM = n x 1 x 2... x n =(x 1 x 2... x n ) n. 1 (2.2.3) In Example (2.2.2), the geometric mean is ( ) 1 9 =8.59 Facts about geometric mean: Only defined for non-negative numbers. Usually, if a distribution on the positive axis is asymmetric, then a log transformation is used to make it symmetric. For such distributions the geometric mean is used. Chapter 2 20

21 2.3 Measures of Spread/Variation/Dispersion Refer to Figure 2.4 (FOB) Range Range is the difference between the largest and the smallest observations. For the birthweights data in Table 2.1, the range is Range = = 2077g. For the data in Figure 2.4 (FOB), the range for the Autoanalyzer method is = 49mg/dl, whereas the same for the Microenzymetic method is = 17mg/dl. Thus, one can claim that: The Microenzymetic method measures cholesterol levels more consistently than Autoanalyzer method does. Or, equivalently, Chapter 2 21

22 Measurements of cholesterol levels using Microenzymetic method are more precise than those using Autoanalyzer method. Or, equivalently, Microenzymetic cholesterol measurements have lower variability compared to Autoanalyzer cholesterol measurements. Facts about range: Easy to compute. Depends highly on the extreme values Percentiles/Quantiles and Interquartile Range The 100pth (0 p 1) percentile of a distribution is the value V p such that 100p% of the sample points are less than or equal to V p. Median is the 50th percentile. Chapter 2 22

23 For the birthweights data in Table 2.1, some of the percentiles are calculated as: Position Percentile How we calculated it from the ordered data 10th n p = = 2; The average of 2nd and 3rd observation. 25th n p = = 5; The average of 5th and 6th observation. 50th n p = = 10; The average of 10th and 11th observation. 75th n p = = 15; The average of 15th and 16th observation. 95th n p = = 18; The average of 18th and 19th observation. 99th n p = = 19.8; The 20th observation. Table 2.3: Percentiles for the Birthweights data in Table 2.1 (Rosner) Facts about percentiles Percentiles are also known as quantiles. Percentiles characterize the relative positioning of the observations in the sample. The spread of the distribution about the center can be characterized by specifying cerain quantiles. For instance, 25th and 75th percentiles tell us that the middle half of the sample points lies between these two values. Chapter 2 23

24 The 25th percentile and the 75th percentile of a distribution are commonly referred to as 1st (lower) and 3rd (upper) quartiles. Here are the percentiles for the cholesterol data in Figure 2.4 (FOB): Method N Lower Quartile Median Upper Quartile IQR Auto Micro Table 2.4: Percentiles for the Cholesterol data in Figure 2.4 (Rosner) Interquartile range The distance between the 1st quartile (Q 1 ) and the 3rd quartile (Q 3 ) is known as interquartile range (IQR). Interquartile range is useful for comparing the spread of two distribution as well as detecting outliers. The higher the IQR, the more variable the distribution is. For the cholesterol data, the IQR for Autoanalyzer method and the microenzymatic method are respectively 16 and 5 which justifies our previous claim that the autoanalyzer method is not as precise as the Microenzymatic method. Chapter 2 24

25 For a positively skewed distribution, the distance between the median and upper quartile is greater than the distance between median and the lower quartile. For a negatively skewed distribution, the distance between the median and upper quartile is smaller than the distance between median and the lower quartile. [Birthweights data (Table 2.1, FOB)] For a symmetric distribution, the distance between the median and upper quartile is approximately equal the distance between median and the lower quartile. [For the menstrual cycle data Table 2.3 (FOB), Q 1 =28=Median, Q 3 = 29.] Outliers Outliers are extremely high or low values that are isolated from the overall distribution. Outliers in a data set can be identified based on the lower and upper quartiles. Formula: Chapter 2 25

26 An observation x can be treated as an outlier if either 1. x>q IQR,or 2. x<q IQR. Formula: An observation x is an extreme outlier if either 1. x>q 3 +3 IQR,or 2. x<q 1 3 IQR. Are there any outliers in the cholesterol data set? Mean deviation Let us look at the cholesterol data one more time. [INSERT CHOLSTEROL FIGURE] Look at how each observation differs from the mean; i.e, x 1 x, x 2 x, x 3 x,...,x n x. One way to measure the spread is to look at how sample points in the data differ from the mean. However, the mean of these differences Chapter 2 26

27 are zero for any data. For the autoanalyzer method sample, the differences are: ( ) = 23, ( ) = 7, ( ) = 5, ( ) = 9, and ( ) = 26, and the mean difference is zero. Same is true for the microenzymatic method. Therefore the mean difference about the mean cannot be used to distinguish between samples based on spreads. What if we just take the average of the distances, instead of differences, i.e, x 1 x, x 2 x, x 3 x,..., x n x. Average of the distances from mean is known as mean deviation. For the autoanalyzer method sample, the distances are: 23, 7, 5, 9, and 26 with an average of 14. On the other hand, the mean deviation for the microenzymatic method is 4.4. Chapter 2 27

28 2.3.4 Variance and Standard Deviation In the definition of the mean deviation, we used absolute values of the difference between individual observations and the sample mean. Absolute values are sometimes difficult to deal with. Another measure of spread uses the squared deviations from the mean and averages it over the whole sample. The measure, known as variance, isdefined as: s 2 = n i=1 (x i x) 2. (2.3.1) n 1 The use of n 1 instead of n in the denominator have special justification, which we will discuss in chapter 6. Standard deviation is defined as the positive square root of the variance: s = n i=1 (x i x) 2. (2.3.2) n 1 For the autoanalyzer method, the variance is s 2 = ( 23)2 +( 7) 2 +( 5) = 340. Chapter 2 28

29 For the microenzymatic method, the variance is s 2 = ( 8)2 +( 3) 2 + (0) =39.5. Corresponding standard deviations are respectively s = 340 = 18.4 and s = 39.5 =6.3. Thus the spread, as measured by the standard deviation, is approximately three times as large as that of microenzymatic method. Facts about variance and standard deviation Variance and standard deviation remain unchanged when all the observations in the sample are shifted by the same constant. For example, the following two samples have the same variance (340) and standard deviation (18.4): Sample 1: 77, 93, 95, 109, 126 Sample 2: 177, 193, 195, 209, 226 Standard deviation has the same unit of measurement as the original samples. Chapter 2 29

30 If the sample points change in scale by a factor of c, the variance changes by a factor of c 2 and the standard deviation changes by a factor of c. Standard deviation is the most widely used measure of spread (dispersion) Coefficient of Variation Suppose you are comparing two distributions having different means. How would you compare the variability of a sample with mean 10 and standard deviation 5 to a sample with mean 100 and standard deviation 5? Of course, the former is more variable, as the magnitude of the standard deviation relative to the mean is much higher for that sample compared to the latter. The measure coefficient of variation is designed to account for the magnitude of mean when assessing the spread. It is defined as: CV = s x 100. (2.3.3) Chapter 2 30

31 For the cholesterol data in Table 2.4 (FOB), the coefficient of variations for the Autoanalyzer and Microenzymatic methods are respectively 9.2% and 3.1%. 2.4 Graphical Representation Histogram Histogram is a useful way of presenting data graphically. It presents frequencies (or relative frequencies) on the Y-axis against the data points on X-axis. The frequencies along with the values are usually referred to as the frequency distribution or distribution. When the number of unique observations are too large, the range of the variable is categorized in continuous intervals and the number of observations belonging to those intervals are reported. Distributions having two tails approximately similar are called symmetric distributions. For such distributions Mean Median Mode. Chapter 2 31

32 Histogram of Menstrual Cycle Relative Frequency Time (days) Figure 2.1: Distribution of time intervals between successive menstrual periods (days) of college women (Table 2.3; Rosner; Page 13). Mean=28.5; Median=28; Mode=28. A distribution which has a longer tail on the right is called a positively skewed distribution. For such distributions data points on the right of the median tends to be farther from the median in absolute value than points below median, Chapter 2 32

33 Mean Median Mode. Figure 2.2: Example of a distribution which is neither skewed, nor symmetric. Distributions with a tail on the left are known as negatively skewed distributions. For such distributions Mean Median Mode. For more examples on symmetric, positively skewed and negatively skewed distributions, refer to page 12 of FOB. Chapter 2 33

34 2.4.2 Stem-and-leaf Plot Stem-and-leaf plot is similar to histogram, but it keeps the plot more close to the actual data by using the observations from the actual sample. It shows the basic shape of the distribution just like histogram does. Stem Leaf Number 21 1 Multiply Stem.Leaf by 10**+3 Figure 2.3: Steam-and-leaf plot for the birthweights data in Table 2.1 (FOB) Box plot Chapter 2 34

35 Stem Leaf Multiply Stem.Leaf by 10**+1 Number Figure 2.4: Steam-and-leaf plot for the the variable IQF from the dataset Lead in the case study described in section 2.9 (FOB). Chapter 2 35

36 Figure 2.5: Box plot for the the variable IQF from the dataset Lead in the case study described in section 2.9 (FOB) by exposure type *-----* *-----* / / LEAD_TYP 1 2 Chapter 2 36

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart