Statistics in medicine - PDF Free Download

Statistics in medicine Lecture 1- part 1: Describing variation, and graphical presentation Outline Sources of variation Types of variables Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 0 S L I D E 1 Readings and resources Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Almost every characteristic that is measured on a patient varies THAT IS WHY IT IS CALLED A VARIABLE EXAMPLES Blood glucose level Blood pressure Diet Electrolytes etc. S L I D E 2 S L I D E 3 1

There are different sources of variation Let us consider blood pressure as an example Biologic differences Age, race, diet, affect blood pressure Older patients, of African descent, and those who consume high salt diet tend to have high blood pressure Measurement conditions Time of the day, anxiety, fatigue etc. High blood pressure is observed following exercise, and with anxiety There are different sources of variation Let us consider blood pressure as an example Measurement error Systematic error Distort the data in one direction leading to bias obscure the truth Ex. Defective BP cuff that tend to give high readings Random error Slight, inevitable inaccuracies Not systematic because it makes some readings too high, and some too low Statistics can adjust for random error, but can not fix systematic error S L I D E 4 S L I D E 5 To understand variation, you have to describe it Descriptive statistics definition: Statistics, such as the mean, the standard deviation, the proportion, and the rate, used to describe attributes of a set of a data Variable could be quantitative or qualitative Qualitative Skin color Jaundice Heart murmurs Quantitative Blood pressure Electrolytes levels http://clinicalgate.com/wp-content/uploads/2015/06/b9781437729306000483_f48-02- 97 81437729306.jpg S L I D E 6 S L I D E 7 2

There are different types of variables Nominal Dichotomous (binary) Ordinal (ranked) Continuous (interval) Continuous (ratio) Risks and proportions Counts and units of observation Combining data Nominal variables (qualitative) Nominal are naming variables The simplest scale of measurement. Used for characteristics that have no numerical values, no measurement scales and no rank order. It is also called a categorical or qualitative scale. Ex. Skin color Different number can be assigned to each color E.g. 1: purple, 2: black, 3: white, 4 blue, 5: tan It makes no difference to the statistical analysis which number is assigned to which color, because the number is merely a numerical name for a color Percentages and proportions are commonly used to summarize the data S L I D E 8 S L I D E 9 Dichotomous variables (qualitative) Dichotomous from the Greek cut into two variables Ex.: Normal/abnormal skin color, living/dead Some time it s not enough to describe the data as two categories living/dead, but it is important to know how long the patient survived survival analysis Ordinal ranked variables Used for characteristics that have an underlying order to their values; that have clearly implied direction from better to worse. Are categorical (qualitative) scales Three or more levels Although there is an order among categories, however the difference between two adjacent categories is not the same throughout the scale S L I D E 10 S L I D E 11 3

Ordinal ranked variables Numerical scales (quantitative) Ex. Pitting edema grading scale: 0- no edema - 4+- sever edema Ex. Pain scale: 0- no pain - 10- worst imaginable pain The highest level of measurement. It is used for characteristics that can be given numerical values; the difference between numbers have meaning, ex. BMI, height. http://biology-forums.com/gallery/2137_18_05_12_2_25_00.jpeg https://openclipart.org/detail/218053/pain-scale Percentages and proportions are commonly used to summarize the data Medians are sometime used to describe the whole data Types Interval Ratio Discrete Measures of central tendencies are usually used to summarize: means, medians S L I D E 12 S L I D E 13 Numerical scales (continuous) Has a value on a continuum Interval: arbitrary zero point Ex. Centigrade temperature scale Ratio: absolute zero point Ex. Kalvin temperature scale Numerical scales (Discrete) Has values equal to integers Units of observation: person, animal, thing, etc. Presented in frequency tables One characteristic in the x-axis, one characteristic in the y-axis, and counts in the cells Frequency table of gender by whether serum total cholesterol was checked or not Cholesterol level Gender Checked Not checked Total Female 17(63%) 10(37%) 27(100%) Male 25 (57%) 19(43%) 44(100%) Total 42(59%) 29(41%) 71(100%) https://www.google.com/url?sa=i&rct=j&q=&esrc=s&source=images&cd= &cad=rja&uact=8&ved=0ahukewiuo6nf8sjoahuekh4khxtzanuqjrwi Bw&url=http%3A%2F%2Fwww.livescience.com%2F39994- kelv in.html&psig=afqjcnfgvvg1wdlx78w2v44wdlzqdqb17a&ust=147 1 538633651130 Source: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). S L I D E 14 S L I D E 15 4

Risks and proportions Risk is the conditional probability of an event (e.g. death) in a defined population in a defined period. Share some characteristics of discrete and some characteristics of continuous variables Ex. A discrete event (e.g., death) occurred in a fraction of population Calculated by the ratio of counts in the numerator to counts in denominator Combining data Continuous variable could be converted to ordinal variable When data is converted to categories individual information is lost The fewer the number of categories the greater is the amount of information lost 120 100 80 60 40 20 0 Histogram of neonatal mortality rate per 1000 live births, by birth weight group, United States 1980 Birth weight (g) Source: Buehler W et al. Public Health Rep 1 02:151-161, 1987 S L I D E 16 S L I D E 17 Statistics in medicine Lecture 1- part 2: Describing variation, and graphical presentation Outline Frequency distributions Frequency distribution of continuous data Frequency distribution of binary data Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 18 S L I D E 19 5

Readings and resources Frequency distribution is Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). S L I D E 20 S L I D E 21 Frequency distribution is TABLE of data displaying the VALUE of each data point ( or range of data points) in one column and the FREQUENCY with which that value occurs in the other column PLOT of data displaying the VALUE of each data point ( or range of data points) on one axis and the FREQUENCY with which that value occurs on the other axis S L I D E 22 Frequency tables Definition A table showing the number and or the percentages of observations occurring at different values (or range of values) of a variable. Steps of creating frequency table Decide on the number of non-overlapping intervals It is better to have equal width intervals Usually 6 to 14 intervals are adequate to demonstrate the shape of the distribution Creating intervals means: continuous variable converted to ordinal variable Information on individual level is lost Count the number of observations in each interval Percentages could be calculated as well Percentage=the number of observation in the interval divided by the total number of observations, multiplied by 100 Presented graphically by histogram S L I D E 23 6

Frequency tables Categories of glucose level of 180 participants Category Count % <=70 14 7.78 71-100 104 57.78 101-125 26 14.44 >=126 36 20.00 Glucose level of 180 participants Glucose level Count Glucose level Count Glucose level Count 52 1 88 2 140 4 66 1 89 5 143 1 67 1 90 8 145 5 68 2 92 3 149 2 69 2 95 11 150 4 70 7 96 1 155 2 71 1 98 1 158 1 72 2 100 12 160 1 73 1 103 4 165 4 75 12 108 1 168 1 76 2 110 11 170 1 77 4 115 1 172 1 78 6 120 6 220 1 79 4 121 1 80 11 122 1 82 2 124 1 83 2 130 3 85 9 133 1 86 4 135 3 87 1 139 1 S L I D E 24 There are REAL and THEORITICAL frequency distributions Real Obtained from the actual data Theoretical Calculated using certain assumptions The most commonly used is NORMAL (GAUSSIAN) DISTRIBUTION Most statistical methods assume that the data is normally distributed Real data are seldom perfectly normally distributed Based on the central limit theory, if the sample size is large, the assumption of normal distribution usually hold even if the data is skewed S L I D E 25 Normal (Gaussian) distribution Continuous distribution Used if the population (σ) is known A symmetric bell-shaped probability distribution with a shape that is determined by mean (µ) and standard deviation (σ) Same µ different σ Different µ Same σ Normal (Gaussian) distribution Properties: Bell shape Depends on mean (µ) and standard deviation (σ) Symmetric about the mean (µ) Mean=median=mode S L I D E 26 S L I D E 27 7

Normal (Gaussian) distribution The area under the curve is the probability (relative frequency) of the values comprising the normal distribution. The area under the whole curve = 1 68% within µ + 1σ 95% within µ + 2σ (actually 1.96σ) 99% within µ + 3σ (actually 2.58σ) Normal (Gaussian) distribution, example If the math test scores is normally distributed with a mean of 10 and standard deviation of 3, then what is the range of scores in which 68% of the student scores will lie? 68% of the students will have a score within µ + 1σ 10+3 =between 7 and 13 S L I D E 28 S L I D E 29 Standard normal distribution (z) Standard normal distribution (z) The normal distribution with mean 0 and standard deviation 1 If the mean#0 and SD#1 do z transformation allow using the standard normal table z = x μ, where x is the value of the σ variable, µ is the mean, σ is the SD A positive z means the value is above the mean A negative z means the value is below the mean If the z is known you can get the x x= µ + zσ Graph generated by R Properties: Bell shape Symmetric about the mean Mean=median=mode Mean=0 Standard deviation=1 The area under the curve = 1 68% within µ + 1σ 95% within µ + 2σ 99% within µ + 3σ Graph generated by R S L I D E 30 S L I D E 31 8

Standard normal distribution (z) tables Standard normal distribution (z), example Areas under the standard normal curve (z scores) Could be used to find proportion above,below, or between any z scores The first column includes the stem of the z value The top row includes the second and third digit of the z value Z score Area under the curve to the left i.e. below z Negative z Positive z Source: http://image.slidesharecdn.com/copyofz-table-130515110049-phpapp02/95/copy-of-ztable-1-638.jpg?cb=1368615687 If the mean of students test scores is 80, and the standard deviation is 10, what is the test score that divides the highest 5% of scores (i.e. find the students at or above the 95% percentile)? Solution: Find the z score that marks the upper 5% 1.645 The test score= µ + 1.645σ= 80+1.645*10=96.45 Conclusion: the upper 5% has a test score >96.45 https://i.ytimg.com/vi/sshcpcs5cys/maxresdefault.jpg S L I D E 32 S L I D E 33 Standard normal distribution (z) tables T-distribution If the mean of HDL cholesterol is 45 mg/dl, and the standard deviation is 5, what is the proportion of population that have HDL values > 40 mg/dl? Solution: Find the z score equivalent to 40 mg/dl z = x μ = (40-45)/5= -1 σ P(HDL>40)=P(z>-1)=1-P(z<=1-) Find the area (probability) below (HDL=40) =.1587 P(HDL>40)= 1-0.1587=0.8413 Conclusion: 84.13% of people in the population are expected to have HDL value 40 mg/dl Area under the curve to the left i.e. below z Z score Negative z table A symmetric distribution with mean 0 and standard deviation larger than that for the normal distribution for small sample sizes. Used if the population standard deviation is unknown Needed when the sample size is small t and z distributions are very similar if n>30 Properties: Symmetric Bell shape Shape change based on degrees of freedom k Mean=median=mode=0 Standard deviation > 1 Z & t almost identical when sample size ~30 Graph generated by R Source: http://www.gridgit.com/postpic/2014/10/negative-z-score-table-pdf_287337.png S L I D E 34 S L I D E 35 9

T-distribution T-distribution Degrees of freedom (df) Is the number of observations that are free to vary When calculating the mean, the sum of observations are fixed, therefore when adding up the N observations, each observation could be vary, except the last one, because the total has to be fixed. Therefore, only N-1 observations can vary if one mean is to be estimated (one-sample), and (N1+N2)-2 observations can vary if two means are to be estimated (two-sample) df= total sample size-number of means that are calculated Table of critical values of t distribution Levels of Significance for a One-Tailed Test 0.2500 0.2000 0.1500 0.1000 0.0500 0.0250 0.0100 0.0050 0.0005 df Levels of Signficance for a Two-Tailed Test 0.5000 0.4000 0.3000 0.2000 0.1000 0.0500 0.0200 0.0100 0.0010 1.000 1.376 1.963 3.078 6.314 12.706 31.821 63.657 636.619 1 2 0.816 1.061 1.386 1.886 2.920 4.303 6.964 9.925 31.599 3 0.765 0.978 1.250 1.638 2.353 3.182 4.541 5.841 12.924 4 0.741 0.941 1.189 1.533 2.132 2.776 3.747 4.604 8.610 5 0.727 0.920 1.156 1.476 2.015 2.570 3.365 4.032 6.869 6 0.718 0.906 1.134 1.440 1.943 2.447 3.143 3.707 5.959 7 0.711 0.896 1.119 1.415 1.895 2.365 2.998 3.499 5.408 8 0.706 0.889 1.108 1.397 1.859 2.306 2.896 3.355 5.041 9 0.703 0.883 1.100 1.383 1.833 2.262 2.821 3.250 4.781 10 0.700 0.879 1.093 1.372 1.812 2.228 2.764 3.169 4.587 Source: http://elvers.us/psy216/tables/tvalues.htm S L I D E 36 S L I D E 37 Binomial distribution is used to describe the frequency distribution of dichotomous data The probability distribution that describes the number of successes X observed in n independent trials, each with the same probability of occurrence For binary variables Defined by n and π If sample is large, or proportion ~.5 z distribution could be used Chi-square distribution (X 2 ) is used for analysis of counts The distribution used to analyze counts in frequency tables. A nonsymmetrical distribution with mean (µ) and variance (σ 2 ) Used for categorical (nominal) data Properties: Degrees of freedom = υ µ = υ σ 2 = υ*2 Approaches normal distribution with the increase in df Graphs generated by R Graph generated by R S L I D E 38 S L I D E 39 10

Statistics in medicine Lecture 1- part 3: Describing variation, and graphical presentation Readings and resources Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 40 S L I D E 41 Summarizing numerical data Continuous variable Measures of central tendency Measures of dispersion Nominal data Proportions Percentages Ratios Rates Measures of central tendency Index or summary numbers that describe the middle of a distribution Types: Mean Median Mode S L I D E 42 S L I D E 43 11

The mean The arithmetic mean Types Arithmetic Geometric The most commonly used statistics The arithmetic average of the observations, which is denoted by µ in the population and by in the sample. In a sample the mean is the sum of X values divided by the number n in the sample Arithmetic mean s calculation Sensitive to extreme values Could be used with numerical scales Should NOT be used with ordinal scales S L I D E 44 S L I D E 45 Example of arithmetic mean s calculation Arithmetic mean = 88+86+93+ +106 = 1775 = 20 20 89.05 Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output Subject Glucose 1 88 2 86 3 93 4 79 5 83 6 98 7 74 8 96 9 95 10 78 11 75 12 98 13 90 14 108 15 81 16 108 17 76 18 97 19 72 20 106 The geometric mean Less commonly used than arithmetic mean The nth root of the product of n observations Geometric mean s calculation Log GM i.e. the mean of the log values Exponentiation GM Used with skewed distributions or logarithms S L I D E 46 S L I D E 47 12

Example of geometric mean s calculation The median Geometric mean ' Log Subject Glucose glucose 4.477337 1 88 2 86 4.454347 3 93 4.532599 4 79 4.369448 5 83 4.418841 6 98 4.584967 7 74 4.304065 8 96 4.564348 9 95 4.553877 10 78 4.356709 11 75 4.317488 12 98 4.584967 13 90 4.49981 14 108 4.682131 15 81 4.394449 16 108 4.682131 17 76 4.330733 18 97 4.574711 19 72 4.276666 20 106 4.663439 1781 89.62306 Sum Arethmetic Mean 89.05 4.481153 Geometric mean 88.33649 A measure of central tendency. It is the middle observation; i.e., the one that divides the distribution of values into halves.it is also equal to the 50 th percentile Median s calculation Arrange observation ascending or descending Count in to find Odd number of observations: the middle value Even number of observations: the mean of the two middle values Less sensitive to extreme value than the mean Could be used with numerical scales Could be used with ordinal scales S L I D E 48 S L I D E 49 Example of median s calculation Median (88+90)/2 =89 Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108 The mode The value of a numerical variable that occurs the most frequently Mode s calculation Count the number of times each value occur The mode is the value that is most frequent Some data might not have mode Some data might have two modes bimodal Some data might have > two modes multimodal Modal class could be estimated, which is the interval that has the largest number of observations S L I D E 50 S L I D E 51 13

Example of mode s calculation Use of measures of central tendency Modes 98 and 108 Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108 What is the best measure for a particular dataset? The choice depends on: Type of scale Numerical arithmetic mean or median Ordinal median Logarithmic scale geometric mean Distribution Symmetrical: the same shape on both sides of the mean arithmetic mean or median Skewed: outliers in one direction median Bimodal: mode S L I D E 52 S L I D E 53 Measures of spread (dispersion) The range Index or summary numbers that describe the spread of observations about the middle value. Types Range Standard deviation Coefficient of variation Percentiles Interquartile range The difference between the largest and the smallest observation Range s calculation Rank the data Range=largest value smallest value Sometimes, minimum and maximum values are displayed instead of the range S L I D E 54 S L I D E 55 14

Example of range s calculation The standard deviation Range 108-72=36 Or present the lower and upper values (72,108) Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108 The most common measure of spread, denoted by σ in the population and SD or s in the sample. It can be used with the mean to describe the distribution of observations. It is the square root of the average of the squared deviations of the observations from their mean SD s calculation Other computational formulas exists S L I D E 56 S L I D E 57 The standard deviation Example of SD s calculation SD is used in many statistical tests Could be used with the mean to describe the distribution of observation If the mean 2SD contains zero skewed observations Characteristics of SD: If the distribution is bell shape 67% of observations lie between mean+1sd 95% of observations lie between mean+2sd 99.7% of observations lie between mean+3sd Regardless of the shape At least 75% of observations lie between mean+2sd S L I D E 58 SD s calculation Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Media 89.00000 Variance 134.15526 n Mode 98.00000 Range 36.00000 Interquar 19.00000 tile Range SAS 9.4 output Subject Glucose 1 88-1.05 1.1025 2 86-3.05 9.3025 3 93 3.95 15.6025 4 79-10.05 101.0025 5 83-6.05 36.6025 6 98 8.95 80.1025 7 74-15.05 226.5025 8 96 6.95 48.3025 9 95 5.95 35.4025 10 78-11.05 122.1025 11 75-14.05 197.4025 12 98 8.95 80.1025 13 90 0.95 0.9025 14 108 18.95 359.1025 15 81-8.05 64.8025 16 108 18.95 359.1025 17 76-13.05 170.3025 18 97 7.95 63.2025 19 72-17.05 290.7025 20 106 16.95 287.3025 Sum 1781 2548.95 Mean 89.05 SD 11.58254 S L I D E 59 15

The coefficient of variation The standard deviation divided by the mean. It is used to obtain a measure of relative variation i.e. variation relative to the size of the mean CV s calculation Commonly used in quality control Percentiles A number that indicates the percentage of a distribution that is less than or equal to that number Commonly used to compare individual values to norm Growth charts Used to determine normal laboratory ranges Between 2½ and 97½ percentiles contains the central 95% of the distribution Quantiles Level Quantile 100% Max 108.0 99% 108.0 95% 108.0 90% 107.0 75% Q3 97.5 50% Median 89.0 25% Q1 78.5 10% 74.5 5% 73.0 1% 72.0 0% Min 72. SAS 9.4 output S L I D E 60 S L I D E 61 Interquartile range The difference between the 25 th percentile(first quartile) and the 75 th percentile(third quartile) It contains the central 50% of the distribution Some authors present the first and third quartile values instead of the difference Interquartile range Interquartile range 97.5-78.5=19 Or present the first and third quartile (78.5,97.5) Basic Statistical Measures Location Variability Mean 89.05000 Std Deviation 11.58254 Median 89.00000 Variance 134.15526 Mode 98.00000 Range 36.00000 Interquartile Range 19.00000 SAS 9.4 output Subject Glucose 19 72 7 74 11 75 17 76 10 78 4 79 15 81 5 83 2 86 1 88 13 90 3 93 9 95 8 96 18 97 6 98 12 98 20 106 14 108 16 108 S L I D E 62 S L I D E 63 16

Use of measures of spread Error bar plots What is the best measure for a particular dataset? The choice depends on: Type of measure of central tendency Mean standard deviation Median interquartile range Distribution Symmetrical: the same shape on both sides of the mean standard deviation or interquartile range Skewed: outliers in one direction interquartile range Purpose Compare to norms percentiles Compare distributions measured on different scale coefficient of variation Describe the central 50% of distribution interquartile range Emphasize the extreme values range A graph that displays the mean and a measure of a spread for one or more groups Deciphering the error bar plot The circle The mean The bars The standard deviation Some authors present the standard error S L I D E 64 S L I D E 65 The proportions and percentages Proportion definition: The number of observations with the characteristic of interest divided by the total number of observations. Proportion s calculation If the data contains two groups a and b, then the proportion of a is Could be used with Nominal scales Ordinal scales numerical scales Percentage: is the proportion multiplied by 100% The ratios A part divided by another part. It is the number of observations WITH the characteristic of interest divided by the number of observations WITHOUT the characteristic of interest. Ratio s calculation If the data contains two groups a and b, then the ratio of a to b is S L I D E 66 S L I D E 67 17

The rates The rates A proportion associated with a multiplier, called the base (e.g., 1000, 100,000) and computed over a specified period Rate s calculation If the data contains two groups a and b, then the rate of a is Use of rates in epidemiology and medicine: Mortality rates Cause-specific mortality rates Morbidity rates Adjusting rates: Why crude rate might not be suitable? Comparing populations with dissimilar characteristics such as age, gender, race Types: Direct adjustment Indirect adjustment Details of calculations will be covered in the epidemiology and public health thread class S L I D E 68 S L I D E 69 One of the problems in the analysis of frequency distribution is SKEWNESS Horizontal stretching of the distribution the right and left sides of the distributions are not mirror images i.e. one tail is longer than the other The tail indicates the direction and type of skewed distribution Tail is pointing to the right skewed to the right (positively skewed) Tail is pointing to the left skewed to the left (negatively skewed) The mean follows the tail regardless of the type of skewed distribution The sequence from the tail to the apex is mean, median, mode (realize it is alphabetical order) Mean > median > mode skewed to the right (positively skewed) Mean < median < mode skewed to the left (negatively skewed) Graph source: http://www.statisticshowto.com/wpcontent/uploads/2014/02/pearson-mode-skewness.jpg Statistics in medicine Lecture 1- part 4: Describing variation, and graphical presentation Fatma Shebl, MD, MS, MPH, PhD Assistant Professor Chronic Disease Epidemiology Department Yale School of Public Health Fatma.shebl@yale.edu S L I D E 70 S L I D E 71 18

Readings and resources Chapter 9, p105-118: Jekel's epidemiology, biostatistics, preventive medicine, and public health by David L. Katz et al (4th edition). There are several way to depict continuous variable frequency distribution Histogram Frequency polygons Line graphs Stem and leaf diagrams Quantiles Boxplots S L I D E 72 S L I D E 73 Frequency distribution is usually presented with histogram A bar graph of a frequency distribution of numerical observations Steps of creating histogram Decide on the number of non-overlapping intervals(statistical software might determine this automatically) Put the intervals on the x-axis Put the number or percentages on the y-axis Percentages are used to compare two histograms based on different sample sizes The frequency/percentages are presented with bars Area of each bar is in proportion to percentage of individuals in that interval Combining observations in intervals smoother curve compared to histograms of individual values Frequency 44 41 38 35 32 29 26 23 20 17 14 11 8 5 2 Histogram of Glucose level Interpreting the graph Most participants had fasting blood glucose level of 65 to 125. Only two participants had blood glucose level less than 60 mg/dl. Additionally, the distribution is skewed to the right (positively skewed) ; several participants had had fasting blood glucose level much higher than the target of =< 125 mg/dl. 60 90 120 150 Glucose level 180 210 Minitab 17 output S L I D E 74 S L I D E 75 19

Frequency polygons is another presentation of the frequency distribution Percentage polygons Frequency 4 4 4 1 3 8 3 5 3 2 2 9 2 6 2 3 2 0 1 7 1 4 1 1 Frequency polygon of Glucose level Frequency polygon definition: A line graph connecting the mid-points of the top of the columns of histogram. It is useful in comparing two frequency distributions Steps of creating frequency polygons Create a histogram Connect the mid-points of the top of the columns of histogram Percent 25 20 1 5 1 0 Percentage polygon Percentage polygon definition: A line graph connecting the mid-points of the top of the columns of histogram based on percentages instead of count. It is useful in comparing two or more frequency distributions when frequencies are not equal Steps of creating percentage polygons Create a histogram based on percentages Connect the mid-points of the top of the columns of histogram Extends the line from the midpoints of the first and last columns to the x-axis 8 5 5 2 60 90 1 20 1 50 1 80 21 0 0 60 90 1 20 1 50 1 80 21 0 Glucose level Minitab 17 output Glucose level S L I D E 76 S L I D E 77 Stem-and-leaf plots Stem-and-Leaf Display: Glucose level A graphical display for numerical data. It is similar to both frequency table and histogram For tallying observations Steps of creating stem-and-leaf plot Decide on the number of non-overlapping intervals Draw a vertical line Put the first digits of each interval on the left side of the vertical line stem For each individual, put the second digit on the right side of the vertical line leaves If the observation is one digit, that digit is the leaf Reorder leaves from lowest to highest within each interval Count from either end to locate the median Stem-and-leaf of Glucose level N = 180 Leaf Unit = 1.0 n Stem Leaf 1 5 2 7 6 678899 46 7 000000012235555555555556677778888889999 82 8 000000000002233555555555666678899999 (24) 9 000000002225555555555568 74 10 00000000000033338 57 11 000000000005 45 12 000000124 36 13 00035559 28 14 000035555599 16 15 0000558 9 16 055558 3 17 02 1 18 1 19 1 20 1 21 1 22 0 Vertical line was added manually Median is in this line=91 Minitab 17 output S L I D E 78 S L I D E 79 20

Box plots (box-and-whisker plot) Box plots (box-and-whisker plot) A graph that summarize the data by displaying the minimum, first quartile, median, third quartile, and maximum statistics It could be created from the information displayed in a stem-and-leaf plot or a frequency table Deciphering the box-and-whisker plot The box The top of the box is the is the third quartile The bottom of the box is the first quartile The length of the box is the interquartile range The median is presented with a horizontal line in the box The mean is presented with a plus sign in the box (some programs) The whiskers Depict the minimum and the maximum values Source: editionhttp://www.physics.csbsju.edu/stats/simpl e.box.defs.gif S L I D E 80 S L I D E 81 Glucose level 225 200 175 150 125 100 Boxplot of Glucose level Interpreting the results The boxplot shows: The range(whiskers) is 52,172 The longer upper whisker and large box area above the median indicate that the data is rightly (positive) skewed The median is 91 The mean 101.033 The interquartile range is 79,118.75 One outlier is present 101.033 91 Tabular and graphical presentation of nominal and ordinal data Contingency frequency tables: A table used to display counts and or frequencies for two or more nominal or quantitative variables Gender Post graduate College High school Male 1 3 3 Female 5 6 2 75 50 S L I D E 82 S L I D E 83 21

Tabular and graphical presentation of nominal and ordinal data Dot plots A graphical presentation using dots Graphs for two characteristics Two characteristics are nominal Bar charts - Dot plots Bar charts A graph used with nominal characteristics to display the numbers or percentages of observations with the characteristic of interest The categories are placed on the x- axis The numbers or percentages are placed on the y-axis S L I D E 84 S L I D E 85 Graphs for two characteristics Graphs for two characteristics One characteristic is nominal and the other is numerical: Box plots Error plots Error plots SAS 9.4 output Box plots SAS 9.4 output Two characteristics are numerical: Scatterplots (bivariate plots) A two-dimensional graph displaying the relationship between two numerical characteristics of variables Creating a scatterplot If data does not have an outcome and a predictor Choice of the x and y axis does not matter If data has an outcome and a predictor Put the explanatory (risk factor, predictor) on the x- axis Put the outcome on the y-axis Put a circle for each observation at the point of intersection of its x and y values Scatter plots SAS 9.4 output S L I D E 86 S L I D E 87 22

Quiz A pharmaceutical company tested the effect of sofosbuvir (new HCV drug) on sustained viral response (SVR) in four HCV genotypes. In genotype 1, 2, 3, and 4, the drug was shown to cause SVR in 90%, 93%, 84%, and 96% of the patients respectively. What type of graphical depiction is best suited to show the data? A. Pie chart B. Venn diagram C. Bar diagram D. Histogram S L I D E 88 23