UNIVERSITY OF CALGARY. Measuring Observer Agreement on Categorical Data. Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES

Size: px
Start display at page:

Download "UNIVERSITY OF CALGARY. Measuring Observer Agreement on Categorical Data. Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES"

Transcription

1 UNIVERSITY OF CALGARY Measuring Observer Agreement on Categorical Data by Andrea Soo A THESIS SUBMITTED TO THE FACULTY OF GRADUATE STUDIES IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY GRADUATE PROGRAM IN COMMUNITY HEALTH SCIENCES CALGARY, ALBERTA April, 2015 c Andrea Soo 2015

2 Abstract In order for a patient to receive proper and appropriate health care, one requires errorfree assessment of clinical measurements. For example, a diagnostic test that assesses whether an individual will be classified as having the disease or not having the disease needs to produce accurate and reliable results in order to ensure that an individual who needs treatment receives the correct therapy. Agreement and reliability studies aim to evaluate the accuracy and consistency of diagnostic tests or measurement tools. A model developed by Shoukri and Donner allows for the concurrent assessment of inter-rater (between rater) agreement and intra-rater (within rater) reliability, by incorporating two measurements per rater per subject. The main purpose of this research was to develop methods for the maximum likelihood (ML) approach using the Shoukri-Donner model and compare those methods to the method of moments (MM) approach using Monte Carlo computer simulation studies. Little differences between ML and MM were observed in point estimation. In general, the MM Wald test and MM confidence interval (CI) performed better than any of the other methods. In fact, the goodness of fit (GOF) test and GOF CI (for both ML and MM) were shown to have high empirical type I errors and low coverage levels, respectively, for the inter-rater agreement parameter in some parameter combinations for the 3 parameter case and all considered parameter combinations in the 4 parameter case. Further investigation as to why there is poor performance with the GOF approach needs to be done before one could recommend this approach as a better alternative to the MM approach. Also, it does not appear that the ML approach is necessarily better than the MM approach. Lastly, extending this research to a more general 5 parameter model requires the resolution of several issues before it can be evaluated in point estimation, hypothesis testing, and CI construction. ii

3 Acknowledgements I would like to thank my supervisors, Dr. Misha Eliasziw and Dr. Gordon Fick, for their guidance and mentorship during my studies. I d also like to thank Dr. Fick for the many invaluable breakthrough sessions and discoveries we had leading up to the completion of this thesis. I would like to thank my committee members, Dr. Brenda Hemmelgarn and Dr. Alberto Nettel-Aguirre, for providing their invaluable feedback and advice. I would also like to thank Dr. Xuewen Lu and Dr. Julie Horrocks for participating in my examination and providing me with additional suggestions that improved this thesis. I am grateful for receiving an Alberta Innovates - Health Solutions (AIHS) PhD studentship which supported this research. Lastly, I would like to thank my family and friends for their support and encouragement throughout this process. I could not have done this without you. Especially to my dad, mom and brother, thank you for your patience these past few years and for supporting me in more ways than you can imagine. And to Nathan, thank you for your patience, encouragement, and believing in me. iii

4 Table of Contents Abstract Acknowledgements Table of Contents List of Tables List of Figures and Illustrations List of Symbols, Abbreviations and Nomenclature ii iii iv vii ix xi 1 Introduction to Agreement and Reliability Association Reliability and Agreement Concurrent Assessment of Intra-rater Reliability and Inter-rater Agreement Example: Percentage of Diffusion-Perfusion Mismatch Research Objectives Measuring Agreement and Reliability Quantitative Measures Concordance Correlation Coefficient Intraclass Correlation Coefficient Categorical Measures Adjusted Measures Chance-Corrected Measures Other Measures Models for Reliability and Agreement Log-Linear and Latent Class Models Common Correlation Model iv

5 3 Shoukri and Donner Model The 4 Parameter Shoukri and Donner Model The 3 Parameter Shoukri and Donner Model Method of Moments Point Estimation Maximum Likelihood Point Estimation Confidence Intervals and Hypothesis Testing Method of Moments Variance Estimation Maximum Likelihood Variance Estimation Wald-Type Confidence Intervals Goodness of Fit Approach for Confidence Intervals Profile Likelihood Confidence Intervals Wald-Type Hypothesis Testing Goodness of Fit Approach for Hypothesis Testing Simulation Studies Evaluation Generating the Data Point Estimates Hypothesis Testing Confidence Intervals Monte Carlo Simulation Results For the 3 Parameter Model Point Estimates Hypothesis Testing Confidence Intervals Monte Carlo Simulation Results For the 4 Parameter Model Point Estimates Hypothesis Testing v

6 3.8.3 Confidence Intervals Special Cases Relative Log-Likelihood Example of Perfect Reliability for One of the Raters Extended Shoukri-Donner Model - Beginning Steps Sarmanov Distribution The 5 Parameter Shoukri and Donner Model Method of Moments Point Estimation Method of Moments Variance Estimation Application to Data Example 1: Intra-rater Reliability and Inter-rater Agreement of Assessing Percentage of Diffusion-Perfusion Mismatch Example 2: Intra-rater Reliability and Inter-rater Agreement of Assessing Presence of Dysplasia Example 3: Intra-rater Reliability and Inter-rater Agreement of Assessing Mammograms Example 4: Intra-rater Reliability and Inter-rater Agreement of Lumen Narrowing Conclusions and Recommendations Major Findings and Discussion Shoukri and Donner Model for 3 and 4 Parameters Special Cases Shoukri and Donner Model for 5 Parameters Recommendations Conclusion and Future Directions Bibliography vi

7 List of Tables 1.1 Basic 2x2 contingency table x2 table for diagnostic tests Example of high disagreement and high association Binary data for 13 subjects being rated by 2 raters twice Mismatch data at first time point Mismatch data at second time point Strength of agreement benchmarks for CCC Estimate of the intraclass correlation for 2 raters each rating n subjects Basic 2x2 table Strength of agreement benchmarks for kappa Common correlation model Study design for 2 raters each rating n subjects twice Joint probability distribution of the sum of ratings from rater 1 (X i1 ) and rater 2 (X i2 ) Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) Method of moments estimates for the 3 parameter Shoukri and Donner model Method of moments estimates for the 4 parameter Shoukri and Donner model Category groupings for goodness of fit test for 3 parameter model Category groupings for goodness of fit test for 4 parameter model Observed and expected frequencies for goodness of fit test for mammogram data Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for n 01 = n 11 = n 21 = Parameter estimates for data in Table Method of moments estimates for the 5 parameter Shoukri and Donner model Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for mismatch data Parameter estimates for mismatch data Confidence intervals for mismatch data P-values for mismatch data Joint frequency table of the sum of the ratings from pathologist 1 (X i1 ) and pathologist 2 (X i2 ) for dysplasia data Parameter estimates for dysplasia data Confidence intervals for dysplasia data P-values for dysplasia data vii

8 5.9 Joint frequency table of the sum of the ratings from radiologist 1 (X i1 ) and radiologist 2 (X i2 ) for mammogram data Parameter estimates for mammogram data Confidence intervals for mammogram data P-values for mammogram data Joint frequency table of the sum of the ratings from rater 1 (X i1 ) and rater 2 (X i2 ) for lumen narrowing data Parameter estimates for lumen narrowing data Confidence intervals for lumen narrowing data P-values for lumen narrowing data viii

9 List of Figures and Illustrations 3.1 General example of a log profile likelihood depending on parameter θ Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Empirical bias when π = Distribution of deviation from the true value for ρ w and ρ b when π = Distribution of deviation from the true value for ρ w and ρ b when π = Distribution of deviation from the true value for ρ w and ρ b when π = Plot of MME vs MLE for n = 75 subjects when π = Plot of MME vs MLE for n = 75 subjects when π = Plot of MME vs MLE for n = 75 subjects when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical RMSE when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical type I error levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical coverage levels when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = ix

10 3.43 Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical median widths when π = Empirical bias when π = 0.1 for the 4 parameter model Empirical bias when π = 0.1 for the 4 parameter model Empirical bias when π = 0.3 for the 4 parameter model Empirical bias when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.1 for the 4 parameter model Empirical RMSE when π = 0.1 for the 4 parameter model Empirical RMSE when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.3 for the 4 parameter model Empirical RMSE when π = 0.5 for the 4 parameter model Empirical RMSE when π = 0.5 for the 4 parameter model Empirical type I error levels when π = 0.1 for the 4 parameter model Empirical type I error levels when π = 0.1 for the 4 parameter model Empirical type I error levels when π = 0.3 for the 4 parameter model Empirical type I error levels when π = 0.3 for the 4 parameter model Empirical type I error levels when π = 0.5 for the 4 parameter model Empirical type I error levels when π = 0.5 for the 4 parameter model Empirical coverage levels when π = 0.1 for the 4 parameter model Empirical coverage levels when π = 0.1 for the 4 parameter model Empirical coverage levels when π = 0.3 for the 4 parameter model Empirical coverage levels when π = 0.3 for the 4 parameter model Empirical coverage levels when π = 0.5 for the 4 parameter model Empirical coverage levels when π = 0.5 for the 4 parameter model Special case example: Contour plots for n 01 = n 11 = n 21 = 0 (perfect reliability for rater 2) Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of ρ c Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of π Special case example: ρ c2 vs partial derivative of log-likelihood for varying values of ρ b Profile log-likelihood graphs of ρ b, ρ w1 and ρ w2 for n 01 = n 11 = n 21 = Profile log-likelihood graphs for mismatch data x

11 List of Symbols, Abbreviations and Nomenclature Abbreviation CCC CDF CI CIA CIE DWI FN FP GOF ICC LRT ML MLE MM MME MR MRI MSD MSE MSR MSS NPV OR Definition Concordance Correlation Coefficient Cumulative Distribution Function Confidence Interval Coefficient of Individual Agreement Coefficient of Individual Equivalence Diffusion-Weighted MRI False Negative False Positive Goodness of Fit Intraclass Correlation Coefficient Likelihood Ratio Test Maximum Likelihood Maximum Likelihood Estimate Method of Moments Method of Moments Estimate Magnetic Resonance Magnetic Resonance Imaging Mean Squared Deviation Mean Square Error Mean Square Rater Mean Square Subject Negative Predictive Value Odds Ratio xi

12 PABAK PL PPV PWI RD RMSE rmtt RR TN TP Prevalence Adjusted Bias Adjusted Kappa Profile Likelihood Positive Predictive Value Perfusion-Weighted MRI Risk Difference Root Mean Square Error Relative Mean Transit Time Risk Ratio True Negative True Positive xii

13 Chapter 1 Introduction to Agreement and Reliability In order for a patient to receive proper and appropriate health care, one requires error-free assessment of clinical measurements. These clinical measurements are used to determine the appropriate course of action, such as whether or not a patient should receive a treatment or therapy. A diagnostic test that assesses whether an individual will be classified as having the disease or not having the disease needs to have accurate (close or identical to the true value) and reliable (consistent) results in order to ensure that an individual who needs treatment receives the correct therapy. These results need to be the same whether it is recorded by one specific physician at different visits on the same patient or by two different physicians on the same patient. An example of the need for agreement and reliability studies would be introduction of a new diagnostic test or new measurement tool which may be preferred over a gold standard or old test due to practical considerations such as cost, efficiency, and invasiveness. Another example of the need for these studies is comparing the agreement between administrative data and chart data. Researchers often utilize administrative data to determine the presence or absence of a medical condition instead of examining patient charts which is much more time-consuming. However, it is necessary to determine the performance of the new diagnostic test, measurement tool, or administrative data before implementation by quantifying the level of accuracy and reliability. Agreement and reliability studies are investigations in which diagnostic tests or measurement tools are evaluated to determine their accuracy and reliability, respectively. The terms agreement and reliability are part of the broader category of association. Thus, when we have reliability and/or agreement between variables or mea- 1

14 surements, we also have association. However, when there is an association between variables or measurements, we do not necessarily have agreement and/or reliability. Differences between association, agreement and reliability will be further discussed below. 1.1 Association There are many ways to analyze data from a basic 2x2 study. In a 2x2 study, association measures such as the odds ratio, sensitivity, specificity, positive predictive value, negative predictive value, risk ratios, and risk differences [5, 6, 12, 20, 26, 33, 53, 62] are used to describe the relationship between two variables that have 2 categories each. For example, you can describe the relationship between an exposure (eg. smoking status) and disease (whether or not you develop heart disease). Consider Table 1.1, which summarizes the results collected on disease and exposure by this 2x2 study for a total of n subjects. This is a basic 2x2 contingency table. Frequencies of each cell are shown. We define the association measures mentioned above. Disease Total Yes No Exposure Yes a b a + b No c d c + d Total a + c b + d n = a + b + c + d Table 1.1: Basic 2x2 contingency table The first measure we discuss is the odds ratio [12, 20]. The estimated odds of exposed individuals developing the disease is odds 1 = a b while the estimated odds of unexposed individuals developing the disease is odds 2 = c d. 2

15 Thus, the odds ratio (OR) is defined as the ratio of these two odds [62]: odds ÔR = 1 = a/b odds 2 c/d = ad bc An odds ratio can range from 0 to [12]. For example, if there is no difference between the odds of an individual developing the disease for both exposure statuses, then the estimated OR will be 1. An OR of 1 indicates no association or relationship between disease and exposure. The further the deviation from 1, the higher the association or stronger the relationship. If you have an OR of 2, for example, then the odds of developing the disease for exposed individuals is 2 times the odds of developing the disease for unexposed individuals. An OR < 1 indicates exposed individuals have lower odds, while a OR > 1 indicates exposed individuals have higher odds. Next we discuss the risk ratio [12, 62]. The estimated risk (or proportion) of smokers developing the disease is risk 1 = a a + b while the estimated risk (or proportion) of non-smokers developing the disease is risk 2 = c c + d The risk ratio (RR) or relative risk is defined as the ratio of the proportion of smokers developing the disease to the proportion of non-smokers classified as having the disease [62], ie. RR = risk 1 risk 2 = a/(a + b) c/(c + d). It can also range from 0 to. Similar to an OR, a RR of 1 indicates no association or relationship between disease and exposure. A RR < 1 indicates smokers have a lower risk, while a RR > 1 indicates smokers have a higher risk. The risk difference (RD) is defined as the difference between the proportions RD = risk 1 risk 2 = 3 a a + b c c + d.

16 A risk difference can range from -1 to 1. A risk difference of 0 indicates no association or relationship between disease and exposure. The further the deviation from 0, the higher the association or relationship. Sensitivity and specificity [5, 53] are commonly used to assess the accuracy of a new test for diagnosing the presence or absence of a disease with respect to the true disease status as determined by a traditionally used test and a test accepted as the gold standard. They can be estimated from the 2x2 table below (Table 1.2), which is Table 1.1 with a = T P, b = F P, c = F N, and d = T N. Sensitivity is the probability Gold standard Total Yes No Test Positive (+) T P F P T P + F P Negative (-) F N T N F N + T N Total T P + F N F P + T N n = T P + F P + F N + T N TP = true positive, FP = false positive FN = false negative, TN = true negative Table 1.2: 2x2 table for diagnostic tests that the new test indicates presence of the disease when the gold standard indicates that it is present and sensitivity = P (+ disease) = T P T P + F N where TP = true positives and FN = false negatives. Specificity is the probability that the new test indicates absence of the disease when the gold standard indicates that it is absent and can be estimated as: specificity = P ( no disease) = where TN = true negatives and FP = false positives. T N T N + F P Positive and negative predictive values are similarly used in assessing the accuracy of a new test with respect to a gold standard [6]. The positive predictive value (PPV) 4

17 is the probability of the presence of disease given a positive test result and the negative predictive value (NPV) is the probability of the absence of disease given a negative test result. They are estimated as P P V = P (disease +) = T P T P + F P and NP V = P (no disease ) = T N T N + F N Sensitivity, specificity and positive and negative predictive values are all proportions, so their values range from 0 to 1. The closer these values are to 1, the better the test. Sensitivity and specificity are both properties of the test. However, positive and negative predictive values are dependent on the prevalence of disease. 1.2 Reliability and Agreement The above association measures should not be used to quantify agreement or reliability. A reliable measurement gives consistent results every time it is repeated under the same conditions, or on the same subject, for example. If a measurement is reliable, subjects can be distinguished or differentiated from each other despite measurement errors [21]. Reliability is typically defined as the ratio of variability between ratings of the same subjects (ie. by different raters or at different times) to the total variability of all ratings [21, 31], i.e.: variability between ratings of the same subjects variability between ratings of the same subjects + measurement error. (1.1) From this, we can see that reliability estimates will be low when there is little variability among the ratings obtained from the measurement instrument under evaluation, which happens when the range of ratings is small (little or no variability between ratings of the same subjects) or when prevalence is very high or very low [31]. Agree- 5

18 ment measures how close or accurate the results of repeated measurements are, or the degree to which scores or ratings are identical [21, 32]. In this thesis, we will focus on inter-rater agreement and intra-rater reliability. Intra-rater (within rater) reliability is the degree to which a measurement instrument is able to distinguish or differentiate among subjects when the same rater is repeating the measurement two or more times [21, 32]. Here, we will consider two measurements (i.e. two time points) per rater. Perfect intra-rater reliability will occur when a rater gives the same rating at both time points on all subjects. Inter-rater agreement is the degree to which two or more raters achieve identical or congruent results under similar assessment conditions [32]. To illustrate that having high association does not necessarily imply high agreement, we note that perfect agreement between raters occurs when both raters yield the exact same rating on all subjects. As an example, you could have perfect (or high) disagreement between raters which would still mean perfect (or high) association. Table 1.3 shows an example of a 2x2 study in which the raters have high disagreement, but still have high association. The 2 raters only agree on 45 out of the 200 subjects. They agree that 25 subjects have the disease and 20 subjects do not have the disease while they disagree on the remaining 155 subjects. We will introduce the simplest measure of agreement to illustrate. It is the observed proportion of agreement p 0, which can be calculated as p 0 = a+d. Further measures of agreement will be discussed n in the next chapter. Rater 1 Total Disease No disease Rater 2 Disease No disease Total Table 1.3: Example of high disagreement and high association 6

19 From the data in Table 1.3, we get ÔR = 25/75 80/20 = and RR = 25/100 80/100 = Since these are far from a OR or RR of 1, this suggests high association between the ratings of rater 1 and the ratings of rater 2 (an OR or RR equal to 1 would indicate no association). However, the observed proportion of agreement p 0 = = 0.225, which is relatively low agreement (or high disagreement). This provides an illustration to show that association measures can not necessarily be used as measures of agreement or reliability. 1.3 Concurrent Assessment of Intra-rater Reliability and Inter-rater Agreement Clustered binary data occur frequently in medical research. For example, in ophthalmology, pairs of eyes are evaluated through diagnostic procedures for the presence of certain diseases. Another example occurs in twin studies where one may evaluate whether a trait of interest has an inherent genetic and/or environmental component by assessing or measuring the trait in twins. In this setting, standard methods, such as a logistic regression, ignore the correlation between outcomes by assuming independence between individual observations [46]. However, an individual s right eye will tend to have higher correlation with an individual s left eye than with other individuals eyes. In agreement or reliability studies, a rating from rater 1 on an individual will tend to have higher correlation with a rating from rater 2 on the same individual than with other individual s ratings. Thus, these standard methods are inadequate. Significant research has been done in the analysis of correlated binary data. In agreement and reliability studies, researchers typically only use half of the data to evaluate inter-rater agreement and the other half of the data to evaluate intra-rater reliability which results in less efficient estimates [7]. However, a model developed by Shoukri and Donner [50, 51, 52] allows for the concurrent assessment 7

20 of inter-rater agreement and intra-rater reliability, by incorporating multiple measurements per rater per subject (which also increases the available sample size for estimating inter-rater agreement). This may require less patients to be enrolled in a study, for example, and thus reduce costs of enrolling patients when the cost of recruiting additional subjects is much higher than the cost of an additional measurement on existing subjects. Shoukri and Donner [50] showed that the gain in precision obtained from increasing the number of measurements per rater from 1 to 2 may allow fewer subjects to be recruited into a study with no net loss in efficiency for estimating inter-rater agreement. Slater [55] assessed the properties of the method of moments estimates from the Shoukri and Donner model [50] for some 3 parameter (inter-rater agreement, intra-rater reliability, probability of a positive rating) combinations and compared them to estimates from analysis of variance. Slater found that the estimates were negatively biased and appeared to be consistent. Slater also extended the goodness-of-fit approach used in the common correlation model [22] to the Shoukri and Donner model [50] and compared inference properties (Type I error rates) to a Wald test using a large sample variance estimate. In general, the goodness-of-fit test provided Type I error rates consistently closer to nominal type I error rates than those from a Wald test using an estimated large sample variance. There were a few exceptions of empirical type I errors being too high (ie. more than 2.5% from α = 0.05). Properties of maximum likelihood estimation were not assessed. In the case of continuous data, comparison of method of moments estimation and maximum likelihood estimation have been developed and evaluated for the concurrent assessment of intra-rater reliability and inter-rater agreement [7]. For point estimation, Amuah [7] found that method of moments and maximum likelihood estimates perform adequately with negligible empirical bias and comparable spread that is moderate. Estimates from both methods of inter-rater agreement and intra-rater 8

21 reliability are root mean square error consistent. For hypothesis testing, empirical type I errors were within nominal levels (α = 5%) and comparable. The methods used for confidence interval construction were also comparable and within nominal levels (95%). It was recommended that any of the methods could be used, unless convergence problems were encountered. Methods used in the case of binary data need to be further evaluated to include comparison of method of moments estimation to maximum likelihood estimation as well as expansion of the goodness-of-fit approach for the agreement parameter. 1.4 Example: Percentage of Diffusion-Perfusion Mismatch One example of data where subjects are assessed by more than one rater twice is from a study assessing the intra-rater reliability and inter-rater agreement of percentage of diffusion-perfusion mismatch [7, 18, 55]. Magnetic resonance (MR) images for 13 patients with acute strokes were independently assessed twice by 6 raters. The raw data is available and presented in Amuah s thesis [7]. The mismatch data is presented as a percentage (continuous data), but can be dichotomized at cutoff points of 10% or 20% as the study did in their further analyses [18]. Data from 2 of the raters will be used since the focus of this thesis is 2 raters rating n subjects twice. This study will be further described in Chapter 5. The continuous data obtained from the study [7, 18] for the 2 raters would be the binary or dichotomized data as presented in Table 1.4 if the cutoff point of 10% were chosen. So those with mismatch percentages greater than 10% are classified as having mismatch (1), those with a mismatch percentage of 10% or less are classified as having no mismatch (0). One could analyze the data separately at each time point or by each rater, depending on the question one is trying to answer. For example, analyzing the data at each time point allows you to examine inter-rater 9

22 Rater 1 Rater 2 Measurement Measurement Subject Table 1.4: Binary data for 13 subjects being rated by 2 raters twice agreement. However, typically you would only have data from one time point to use for determining agreement between raters. If you only had the data from time point 1, you would have the results as presented in Table 1.5. If you only had data from time point 2, you would have the results in Table 1.6. Using equation 2.25 from section 2.3.2, where the data follows a common correlation model (CCM), the data from time point 1 would provide an agreement estimate of 1, indicating perfect agreement. The data from time point 2 would provide an agreement estimate of 0.567, which indicates moderate agreement between raters (see Table 2.4, which provides some guidelines on benchmarks for agreement values). The agreement estimates are fairly different, which shows that this study could potentially benefit from more subjects being enrolled in the study, or 2 measurements per subject where the Shoukri and Donner model would allow concurrent estimation of inter-rater agreement and intrarater reliability. 10

23 Rater 2 Total Mismatch (1) No mismatch (0) Rater 1 Mismatch (1) No mismatch (0) Total 9 4 n = 13 Table 1.5: Mismatch data at first time point Rater 2 Total Mismatch (1) No mismatch (0) Rater 1 Mismatch (1) No mismatch (0) Total 11 2 n = 13 Table 1.6: Mismatch data at second time point 11

24 1.5 Research Objectives The objectives of this thesis are as follows: 1. In a 2x2 study (2 raters classifying n subjects into one of two categories), evaluate the methodology for the concurrent assessment of intra-rater reliability and inter-rater agreement in the 4 parameter case (assuming a common probability of a positive rating by the 2 raters) and a further reduction to the 3 parameter case (further assuming a common intra-rater reliability) by deriving: a) point estimates for intra-rater reliability and inter-rater agreement, b) hypothesis tests for the point estimates, and c) confidence intervals using maximum likelihood and comparing these results to the method of moments approach. 2. Describe the beginning steps of the methodology required when estimates of intra-rater reliability and inter-rater agreement are on the boundary. 3. Extend the methodology for the concurrent assessment of intra-rater reliability and inter-rater agreement to the more general 5 parameter case (allowing a different probability of a positive rating for each rater). 4. Provide recommendations to the medical research community on the usage of these models without jeopardizing the statistical validity of the methods when assessing intra-rater reliability and inter-rater agreement on binary data. Chapter 2 will discuss summary measures used to assess reliability and agreement in the categorical case with a brief discussion in the quantitative case. In addition models used to assess for reliability and agreement will be discussed. Chapter 3 will 12

25 discuss the 4 parameter Shoukri-Donner model and estimation from the model using method of moments estimation and maximum likelihood estimation, including derivation of point estimates, confidence intervals, and hypothesis tests. In addition, the methods used for the Monte Carlo simulation studies will be discussed and results will be provided for the 4 parameter model. The 3 parameter model which assumes a common intra-rater reliability will also be discussed and similar simulation study results will be provided. This chapter will also present special cases where the maximum likelihood procedure needs to be modified to obtain estimates within the range of possible values for agreement and reliability. Similarly, Chapter 4 will present the beginning steps and development of the 5 parameter Shoukri-Donner model. Chapter 5 will present an application of the methods to the several data sets and finally, Chapter 6 will provide a discussion of the major findings, recommendations and future directions. 13

26 Chapter 2 Measuring Agreement and Reliability The focus of this chapter will be the development of agreement and reliability measures for 2 raters rating n subjects each with a brief introduction to 2 measures commonly used in the quantitative (or continuous) case. 2.1 Quantitative Measures The two main measures of reliability and agreement in the case where the outcome is quantitative are the concordance correlation coefficient [38] and the intraclass correlation [13] Concordance Correlation Coefficient When agreement is defined as the difference of the ratings of different raters, the concordance correlation coefficient (CCC) ρ C is used. This measure was introduced by Lin [38]. When there are two raters, it measures agreement by assessing the variation of the linear relationship from the 45 degree line through the origin (the concordance line) [53]. Let X ij denote the rating for the ith subject by rater j (i = 1,..., n; j = 1, 2) and assume that E(X ij ) = µ i and var(x ij = σ 2 i. Lin [38] defined the CCC as ˆρ C = 1 E[(X i1 X i2 ) 2 ] σ1 2 + σ2 2 + (µ 1 µ 2 ) 2 2σ 12 = σ1 2 + σ2 2 + (µ 1 µ 2 ), (2.1) 2 where E[(X i1 X i2 ) 2 ] is the expected squared perpendicular deviation from the 45 degree line through the origin and σ σ (µ 1 µ 2 ) 2 is the expected squared 14

27 perpendicular deviation from the 45 degree line through the origin when X i1 and X i2 are uncorrelated. Lin [38] suggested a moment estimate of ρ C given by ˆρ C = 2S 12 S S ( x 1 x 2 ) 2, (2.2) where x j = ( ) 1 n n i=1 x ij, Sj 2 = ( ) 1 n n i=1 (x ij x j ) 2 and S 12 = ( 1 n n) i=1 (x i1 x 1 )(x i2 x 2 ). It is simple to use and its estimate using the sample counterparts is consistent and has asymptotic normality for bivariate normal data. The CCC ranges from -1 to 1. A CCC of -1 corresponds to perfect negative agreement, while a CCC of 1 corresponds to perfect positive agreement. McBride [41] proposed strength of agreement criteria for the absolute value of Lin s CCC as in Table 2.1. CCC Estimate Strength of Agreement < 0.90 Poor Moderate Substantial > 0.99 Almost perfect Table 2.1: Strength of agreement benchmarks for CCC Intraclass Correlation Coefficient The intraclass correlation ρ has emerged as a universal and widely accepted reliability index for continuous data [53] and is the ratio of the covariance among repeated observations to the total observed variance. Table 2.2 summarizes the ways in which one can estimate the intraclass correlation (ICC) from a one-way or two-way analysis of variance [7, 53, 54]. The one-way random effects model is not usually used in the cases we are considering, as subjects are usually rated by the same set of raters. As with the CCC, the ICC ranges from -1 to 1. The estimate of the intraclass kappa ˆκ I can be obtained from the formula for the intraclass correlation using the one-way random effects model with the 0-1 (binary) 15

28 Model One-way Two-way Two-way Random Effects Mixed Effects Random Effects Assumptions No random rater Raters are fixed, Raters are a random effect, each subject (only raters of sample from a is rated by a interest), each population of raters, different set of subject is rated each subject is rated raters by each rater by each rater MSS MSS MSE Estimate MSS+MSR MSS+MSE MSS = between subject mean square or subject mean square MSR = within subject mean square or rater mean square MSE = mean square error (MSS MSE)/2 MSS+MSE+2(MSR MSE)/n Table 2.2: Estimate of the intraclass correlation for 2 raters each rating n subjects data [53]. The intraclass kappa will be further discussed in section Categorical Measures The development of measures of reliability and agreement for categorical data are well traced by Shoukri [53] and will be summarized below with some additional measures also discussed. Consider Table 2.3, a basic 2x2 table similar to the 2x2 contingency table in Chapter 1 (Table 1.1) but with different notation. Here we consider the case where two raters classified n subjects into one of two categories: disease or no disease. Rater 1 Disease No disease Total Rater 2 Disease n 11 n 10 n 1. No disease n 01 n 00 n 0. Total n.1 n.0 n Table 2.3: Basic 2x2 table As mentioned in section 1.2, the observed proportion of agreement is p 0 where in the 2x2 case (using the notation from Table 2.3, p 0 = n 00 + n 11. (2.3) n 16

29 It is the proportion of the total number of subjects in which the raters are in agreement and assumes no guessing by the raters Adjusted Measures Variations similar to p 0 are presented as adjusted indices of agreement [53]. The Jacquard coefficient is the proportion of (1, 1) matches (agreement as positive ) in a set of comparisons that ignores (0, 0) matches (agreement as negative ) and is estimated as Ĵ = Another measure, the Ĝ coefficient [28], is estimated as n 11 n 11 + n 10 + n 01. (2.4) Ĝ = (n 00 + n 11 ) (n 10 + n 01 ) n (2.5) which will indicate perfect agreement when Ĝ = 1 and occurs if n 10 = n 01 = 0. It will indicate perfect disagreement when Ĝ = 1 and occurs if n 00 = n 11 = 0. Another adjusted measure of agreement is based on the concordance ratio and is estimated as Ĉ = 2n 11 n 10 + n 01 = 1 (2.6) 2n 11 + n 10 + n 01 2n 11 + n 10 + n 01 and like Ĵ, it ignores the (0, 0) cell and gives twice the weight to the (1, 1) cell [53]. The above measures include agreement which occurs from guessing. The intraclass correlation is one of the most widely known measures of agreement in a 2x2 study and is estimated by ˆρ = 4n 11 n 00 (n 10 + n 01 ) 2 (2n 11 + n 10 + n 01 )(2n 00 + n 10 + n 01 ). (2.7) The intraclass correlation will be further discussed in section when the common correlation model is introduced. 17

30 2.2.2 Chance-Corrected Measures The observed proportion of agreement and the above measures do not correct for agreement by chance. If raters were to randomly assign their ratings, they would sometimes agree by chance [61]. For example, by chance alone, one would expect better agreement on a two-category measurement scale than a five-category measurement scale. To correct for the bias in favor of measurement instruments with a smaller number of categories, Bennett proposed an index of consistency S [11], which takes into account the number of categories k. It is estimated as Ŝ = k ( p 0 1 ) k 1 k or in the 2x2 study (k = 2), ( Ŝ = 2 p 0 1 ). (2.8) 2 As the number of categories k increases, Ŝ increases for a fixed p 0 (as defined previously). However, this measure assumes that all categories are equally likely to be used by the raters, and in many cases, there may be only two or three categories that the raters use more often, or the subjects may only fall into two or three different categories. In 1955, Scott [49] proposed an index of inter-coder agreement π s, which can be estimated as where in the case of a 2x2 study, ˆπ s = p 0 p e 1 p e (2.9) ( ) 2 ( ) 2 (n0. + n.0 )/2 (n1. + n.1 )/2 p e = +. n n Here, p 0 is the observed proportion of agreement between the two raters as defined previously and p e is the proportion of agreement expected by chance. Thus, π s is the ratio of the actual difference between observed and chance agreement to the 18

31 maximum difference between observed and chance agreement. This measure corrects for the number of categories as well as the frequency with which each category is used by fixing the marginals at the observed proportion for which each category was used. It assumes that the distribution of proportions over the categories is known and is taken to be equal for all raters. Cohen [16] criticized the assumption that the distribution of proportions over the categories is equal for all raters (ie. the marginal distributions of the raters are equal). He proposed Cohen s kappa, which takes on a similar form to Scott s π s. The difference lies in how p e is defined. Cohen does not assume that the marginal distributions are equal (ie. n 1. n.1 and n 0. n.0 or p 1. p.1 and p 0. p.0 from Table 1). An estimate of Cohen s kappa [16] is where in the case of a 2x2 study, p e = ( n0. n ˆκ = p 0 p e 1 p e (2.10) ) ( n.0 ) + n ( n1. n ) ( n.1 ). n Landis and Koch [34] provided benchmarks for the kappa statistic (Table 2.4). These benchmarks vary from the benchmarks proposed by McBride [41] in the continuous case (Table 2.1). Kappa Estimate Strength of Agreement < 0.0 Poor (less than chance agreement) Slight Fair Moderate Substantial Almost perfect Table 2.4: Strength of agreement benchmarks for kappa Cohen s kappa treats all disagreements equally. Thus, Cohen developed the 19

32 weighted kappa [17], motivated by studies in which some disagreements between ratings are of greater gravity than others. The weighted kappa is a chance-corrected proportion of weighted agreement. The weights 0 w ij 1 (agreement weights for i = 1,..., k; j = 1,..., k) are assigned on rational or clinical grounds to the k 2 cells. Then, exact agreement is given maximal weight (w ii = 1), all disagreements are given less than maximal weight (0 w ij < 1 for i j), and the two raters are considered symmetrically (w ij = w ji ). The observed weighted proportion of agreement is p ow = k i=1 k j=1 w ij n ij n and the expected weighted proportion of agreement is p ew = 1 n k i=1 k j=1 w ij ( ni. n ) ( n.j ). n Thus, an estimate of the overall weighted kappa is ˆκ w = p ow p ew 1 p ew. (2.11) The overall weighted kappa is equivalent to Cohen s kappa when w ii = 1 for all i and w ij = 0 for i j. Although Aickin s α is model-based [2], it provides a summary measure of agreement. It is based on agreement for a cause and the assumption that only some units are subject to classification by chance. Part of the purpose of the constant predictive probability model is to separate the two sources of random agreement and agreement for a cause that tend to be both included in the proportion of agreement expected by chance p e, where agreement for a cause is not intended to be captured. The model is a mixture of two populations of which the first is difficult to classify (where observers will be more likely to guess) and the second is easy to classify [2]. In the first population the observers/ratings agree by chance and in the second population they always yield concordant ratings. Note that a concordant rating occurs when both observers 20

33 classify the subject into the same category. Aickin s α is defined as the fraction of the entire population which consists of subjects that are classified identically for a cause, rather than by chance. It follows the pattern of the kappa-like measures described above: α = d(i, j)p(i, j) d(i, j)p1 (i)p 2 (j) 1, (2.12) d(i, j)p 1 (i)p 2 (j) where p 1 (i) and p 2 (j) are not the overall marginal distributions, but the marginal distributions of the subpopulation of hard-to-classify subjects. Also, d(i, j) = 1 if i = j and d(i, j) = 0 otherwise, and p(i, j) is the joint probability distribution governing the classification of a subject in category i by rater 1 and category j by rater 2. The term d(i, j)p 1 (i)p 2 (j) is similarly defined (as in the kappa-like measures) under the assumption that the raters are independent, but in kappa and kappa-like measures, it is generally defined in terms of certain marginal distributions that occur in a model in which both chance and causal agreement are present. The correction in Aickin s method is made with use of the marginal distributions of the subjects that are hard to classify. Aickin notes that the proportion of chance-corrected agreement is the same for each of the agreement cells [2], which is the reason why he terms his approach the constant predictive probability model Other Measures Several measures have been developed to attempt to deal with issues related to kappa and kappa-like measures. The main issues related to kappa are its dependence on prevalence or the marginal distribution. Taking a closer look at Cohen s kappa, for example, ˆκ = p 0 p e 1 p e, for fixed p 0, kappa gets its highest value when p e is as small as possible [53]. The kappa for identical values of p 0 = 0.85 can be more than twice as high in one instance 21

34 (0.70 when p e = 0.50) as compared to the other (0.32 when p e = 0.78). The next few observations are noted by Feinstein and Cicchetti [15, 25, 53]. A low value of kappa despite a high value of p 0 will occur only when the marginal totals are highly symmetrically unbalanced. This occurs when n 0. is very different from n 1. or when n.0 is very different from n.1. They also note that unbalanced marginal totals can produce higher values of kappa than more balanced totals when n 1. is much higher than n 0. while n.1 is much smaller than n.0. Another example discussed by Shoukri [53] attributed a paradoxical difference between the good overall agreement and the weak chance-corrected agreement to the high prevalence of negative cases. A similar situation could occur if there is a high prevalence of positive cases. These examples from illustrate how kappa is dependent on prevalence (of disease, for example), or the marginal distribution. In Aickin s approach [2], the proportion of agreement attributed to chance is smaller than in Cohen s approach and thus, Aickin s α is always larger than Cohen s kappa. Aickin s α is less sensitive to unbalanced marginal distributions. Another issue is when the null hypothesis of observer independence (used when calculating the proportion of agreement expected by chance, p e ) is not met. The proportion of agreement expected by chance is determined assuming the two observers are independent (multiplying the marginal proportion for observer 1 by the marginal proportion for observer 2). When this assumption is not met, the estimate of chance agreement is inaccurate and possibly inappropriate. We are assuming both observers are guessing for every subject, which is unlikely. In addition, the proportion with which each category is used are the observed marginal proportions and may be (likely will be) different in different studies. This may make it hard to compare estimates of measures of agreement across studies. In other words, the proportion of agreement expected by chance estimates the proportion of times observers would agree if they 22

Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total

Table 2.14 : Distribution of 125 subjects by laboratory and +/ Category. Test Reference Laboratory Laboratory Total 2.5. Kappa Coefficient and the Paradoxes. - 31-2.5.1 Kappa s Dependency on Trait Prevalence On February 9, 2003 we received an e-mail from a researcher asking whether it would be possible to apply the

More information

Sample Size Formulas for Estimating Intraclass Correlation Coefficients in Reliability Studies with Binary Outcomes

Sample Size Formulas for Estimating Intraclass Correlation Coefficients in Reliability Studies with Binary Outcomes Western University Scholarship@Western Electronic Thesis and Dissertation Repository September 2016 Sample Size Formulas for Estimating Intraclass Correlation Coefficients in Reliability Studies with Binary

More information

Lecture 01: Introduction

Lecture 01: Introduction Lecture 01: Introduction Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 01: Introduction

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

Propensity Score Weighting with Multilevel Data

Propensity Score Weighting with Multilevel Data Propensity Score Weighting with Multilevel Data Fan Li Department of Statistical Science Duke University October 25, 2012 Joint work with Alan Zaslavsky and Mary Beth Landrum Introduction In comparative

More information

Chapter 19. Agreement and the kappa statistic

Chapter 19. Agreement and the kappa statistic 19. Agreement Chapter 19 Agreement and the kappa statistic Besides the 2 2contingency table for unmatched data and the 2 2table for matched data, there is a third common occurrence of data appearing summarised

More information

Lecture 25: Models for Matched Pairs

Lecture 25: Models for Matched Pairs Lecture 25: Models for Matched Pairs Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture

More information

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC

Mantel-Haenszel Test Statistics. for Correlated Binary Data. Department of Statistics, North Carolina State University. Raleigh, NC Mantel-Haenszel Test Statistics for Correlated Binary Data by Jie Zhang and Dennis D. Boos Department of Statistics, North Carolina State University Raleigh, NC 27695-8203 tel: (919) 515-1918 fax: (919)

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 3: Bivariate association : Categorical variables Proportion in one group One group is measured one time: z test Use the z distribution as an approximation to the binomial

More information

Measures of Agreement

Measures of Agreement Measures of Agreement An interesting application is to measure how closely two individuals agree on a series of assessments. A common application for this is to compare the consistency of judgments of

More information

Base Rates and Bayes Theorem

Base Rates and Bayes Theorem Slides to accompany Grove s handout March 8, 2016 Table of contents 1 Diagnostic and Prognostic Inference 2 Example inferences Does this patient have Alzheimer s disease, schizophrenia, depression, etc.?

More information

Intraclass Correlations in One-Factor Studies

Intraclass Correlations in One-Factor Studies CHAPTER Intraclass Correlations in One-Factor Studies OBJECTIVE The objective of this chapter is to present methods and techniques for calculating the intraclass correlation coefficient and associated

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

Probability and Statistics. Terms and concepts

Probability and Statistics. Terms and concepts Probability and Statistics Joyeeta Dutta Moscato June 30, 2014 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Computational Systems Biology: Biology X

Computational Systems Biology: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA L#7:(Mar-23-2010) Genome Wide Association Studies 1 The law of causality... is a relic of a bygone age, surviving, like the monarchy,

More information

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21 Sections 2.3, 2.4 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 21 2.3 Partial association in stratified 2 2 tables In describing a relationship

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke

BIOL 51A - Biostatistics 1 1. Lecture 1: Intro to Biostatistics. Smoking: hazardous? FEV (l) Smoke BIOL 51A - Biostatistics 1 1 Lecture 1: Intro to Biostatistics Smoking: hazardous? FEV (l) 1 2 3 4 5 No Yes Smoke BIOL 51A - Biostatistics 1 2 Box Plot a.k.a box-and-whisker diagram or candlestick chart

More information

Measures of Association and Variance Estimation

Measures of Association and Variance Estimation Measures of Association and Variance Estimation Dipankar Bandyopadhyay, Ph.D. Department of Biostatistics, Virginia Commonwealth University D. Bandyopadhyay (VCU) BIOS 625: Categorical Data & GLM 1 / 35

More information

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P.

Review. More Review. Things to know about Probability: Let Ω be the sample space for a probability measure P. 1 2 Review Data for assessing the sensitivity and specificity of a test are usually of the form disease category test result diseased (+) nondiseased ( ) + A B C D Sensitivity: is the proportion of diseased

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Chapter 2: Describing Contingency Tables - II

Chapter 2: Describing Contingency Tables - II : Describing Contingency Tables - II Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

Correlation and regression

Correlation and regression 1 Correlation and regression Yongjua Laosiritaworn Introductory on Field Epidemiology 6 July 2015, Thailand Data 2 Illustrative data (Doll, 1955) 3 Scatter plot 4 Doll, 1955 5 6 Correlation coefficient,

More information

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015

Probability and Statistics. Joyeeta Dutta-Moscato June 29, 2015 Probability and Statistics Joyeeta Dutta-Moscato June 29, 2015 Terms and concepts Sample vs population Central tendency: Mean, median, mode Variance, standard deviation Normal distribution Cumulative distribution

More information

Heterogeneity issues in the meta-analysis of cluster randomization trials.

Heterogeneity issues in the meta-analysis of cluster randomization trials. Western University Scholarship@Western Electronic Thesis and Dissertation Repository June 2012 Heterogeneity issues in the meta-analysis of cluster randomization trials. Shun Fu Chen The University of

More information

An introduction to biostatistics: part 1

An introduction to biostatistics: part 1 An introduction to biostatistics: part 1 Cavan Reilly September 6, 2017 Table of contents Introduction to data analysis Uncertainty Probability Conditional probability Random variables Discrete random

More information

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study

TECHNICAL REPORT # 59 MAY Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study TECHNICAL REPORT # 59 MAY 2013 Interim sample size recalculation for linear and logistic regression models: a comprehensive Monte-Carlo study Sergey Tarima, Peng He, Tao Wang, Aniko Szabo Division of Biostatistics,

More information

Chapter 2: Describing Contingency Tables - I

Chapter 2: Describing Contingency Tables - I : Describing Contingency Tables - I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

Does low participation in cohort studies induce bias? Additional material

Does low participation in cohort studies induce bias? Additional material Does low participation in cohort studies induce bias? Additional material Content: Page 1: A heuristic proof of the formula for the asymptotic standard error Page 2-3: A description of the simulation study

More information

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis.

401 Review. 6. Power analysis for one/two-sample hypothesis tests and for correlation analysis. 401 Review Major topics of the course 1. Univariate analysis 2. Bivariate analysis 3. Simple linear regression 4. Linear algebra 5. Multiple regression analysis Major analysis methods 1. Graphical analysis

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific number

More information

Variance Estimation of the Survey-Weighted Kappa Measure of Agreement

Variance Estimation of the Survey-Weighted Kappa Measure of Agreement NSDUH Reliability Study (2006) Cohen s kappa Variance Estimation Acknowledg e ments Variance Estimation of the Survey-Weighted Kappa Measure of Agreement Moshe Feder 1 1 Genomics and Statistical Genetics

More information

Doctor of Philosophy

Doctor of Philosophy MAINTAINING A COMMON ARBITRARY UNIT IN SOCIAL MEASUREMENT STEPHEN HUMPHRY 2005 Submitted in fulfillment of the requirements of the degree of Doctor of Philosophy School of Education, Murdoch University,

More information

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing.

Previous lecture. P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Previous lecture P-value based combination. Fixed vs random effects models. Meta vs. pooled- analysis. New random effects testing. Interaction Outline: Definition of interaction Additive versus multiplicative

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Chapter 20: Logistic regression for binary response variables

Chapter 20: Logistic regression for binary response variables Chapter 20: Logistic regression for binary response variables In 1846, the Donner and Reed families left Illinois for California by covered wagon (87 people, 20 wagons). They attempted a new and untried

More information

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Libraries 1997-9th Annual Conference Proceedings ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS Eleanor F. Allan Follow this and additional works at: http://newprairiepress.org/agstatconference

More information

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University Lecture 25 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University November 24, 2015 1 2 3 4 5 6 7 8 9 10 11 1 Hypothesis s of homgeneity 2 Estimating risk

More information

Institute of Actuaries of India

Institute of Actuaries of India Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2018 Examinations Subject CT3 Probability and Mathematical Statistics Core Technical Syllabus 1 June 2017 Aim The

More information

Inferences about Parameters of Trivariate Normal Distribution with Missing Data

Inferences about Parameters of Trivariate Normal Distribution with Missing Data Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 7-5-3 Inferences about Parameters of Trivariate Normal Distribution with Missing

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

Meta-Analysis for Diagnostic Test Data: a Bayesian Approach

Meta-Analysis for Diagnostic Test Data: a Bayesian Approach Meta-Analysis for Diagnostic Test Data: a Bayesian Approach Pablo E. Verde Coordination Centre for Clinical Trials Heinrich Heine Universität Düsseldorf Preliminaries: motivations for systematic reviews

More information

Describing Contingency tables

Describing Contingency tables Today s topics: Describing Contingency tables 1. Probability structure for contingency tables (distributions, sensitivity/specificity, sampling schemes). 2. Comparing two proportions (relative risk, odds

More information

Latent Trait Reliability

Latent Trait Reliability Latent Trait Reliability Lecture #7 ICPSR Item Response Theory Workshop Lecture #7: 1of 66 Lecture Overview Classical Notions of Reliability Reliability with IRT Item and Test Information Functions Concepts

More information

Assessing agreement with multiple raters on correlated kappa statistics

Assessing agreement with multiple raters on correlated kappa statistics Biometrical Journal 52 (2010) 61, zzz zzz / DOI: 10.1002/bimj.200100000 Assessing agreement with multiple raters on correlated kappa statistics Hongyuan Cao,1, Pranab K. Sen 2, Anne F. Peery 3, and Evan

More information

Comparing Group Means When Nonresponse Rates Differ

Comparing Group Means When Nonresponse Rates Differ UNF Digital Commons UNF Theses and Dissertations Student Scholarship 2015 Comparing Group Means When Nonresponse Rates Differ Gabriela M. Stegmann University of North Florida Suggested Citation Stegmann,

More information

Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement

Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement Biostatistics (2005), 6, 1,pp. 171 180 doi: 10.1093/biostatistics/kxh027 Random marginal agreement coefficients: rethinking the adjustment for chance when measuring agreement MICHAEL P. FAY National Institute

More information

Probability. Introduction to Biostatistics

Probability. Introduction to Biostatistics Introduction to Biostatistics Probability Second Semester 2014/2015 Text Book: Basic Concepts and Methodology for the Health Sciences By Wayne W. Daniel, 10 th edition Dr. Sireen Alkhaldi, BDS, MPH, DrPH

More information

High-Throughput Sequencing Course

High-Throughput Sequencing Course High-Throughput Sequencing Course DESeq Model for RNA-Seq Biostatistics and Bioinformatics Summer 2017 Outline Review: Standard linear regression model (e.g., to model gene expression as function of an

More information

Textbook Examples of. SPSS Procedure

Textbook Examples of. SPSS Procedure Textbook s of IBM SPSS Procedures Each SPSS procedure listed below has its own section in the textbook. These sections include a purpose statement that describes the statistical test, identification of

More information

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization.

Ø Set of mutually exclusive categories. Ø Classify or categorize subject. Ø No meaningful order to categorization. Statistical Tools in Evaluation HPS 41 Fall 213 Dr. Joe G. Schmalfeldt Types of Scores Continuous Scores scores with a potentially infinite number of values. Discrete Scores scores limited to a specific

More information

You can compute the maximum likelihood estimate for the correlation

You can compute the maximum likelihood estimate for the correlation Stat 50 Solutions Comments on Assignment Spring 005. (a) _ 37.6 X = 6.5 5.8 97.84 Σ = 9.70 4.9 9.70 75.05 7.80 4.9 7.80 4.96 (b) 08.7 0 S = Σ = 03 9 6.58 03 305.6 30.89 6.58 30.89 5.5 (c) You can compute

More information

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression:

Lecture Outline. Biost 518 Applied Biostatistics II. Choice of Model for Analysis. Choice of Model. Choice of Model. Lecture 10: Multiple Regression: Biost 518 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture utline Choice of Model Alternative Models Effect of data driven selection of

More information

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data

Person-Time Data. Incidence. Cumulative Incidence: Example. Cumulative Incidence. Person-Time Data. Person-Time Data Person-Time Data CF Jeff Lin, MD., PhD. Incidence 1. Cumulative incidence (incidence proportion) 2. Incidence density (incidence rate) December 14, 2005 c Jeff Lin, MD., PhD. c Jeff Lin, MD., PhD. Person-Time

More information

One-sample categorical data: approximate inference

One-sample categorical data: approximate inference One-sample categorical data: approximate inference Patrick Breheny October 6 Patrick Breheny Biostatistical Methods I (BIOS 5710) 1/25 Introduction It is relatively easy to think about the distribution

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

Lecture 5: ANOVA and Correlation

Lecture 5: ANOVA and Correlation Lecture 5: ANOVA and Correlation Ani Manichaikul amanicha@jhsph.edu 23 April 2007 1 / 62 Comparing Multiple Groups Continous data: comparing means Analysis of variance Binary data: comparing proportions

More information

Categorical Data Analysis 1

Categorical Data Analysis 1 Categorical Data Analysis 1 STA 312: Fall 2012 1 See last slide for copyright information. 1 / 1 Variables and Cases There are n cases (people, rats, factories, wolf packs) in a data set. A variable is

More information

SAMPLE SIZE AND OPTIMAL DESIGNS FOR RELIABILITY STUDIES

SAMPLE SIZE AND OPTIMAL DESIGNS FOR RELIABILITY STUDIES STATISTICS IN MEDICINE, VOL. 17, 101 110 (1998) SAMPLE SIZE AND OPTIMAL DESIGNS FOR RELIABILITY STUDIES S. D. WALTER, * M. ELIASZIW AND A. DONNER Department of Clinical Epidemiology and Biostatistics,

More information

Fitting stratified proportional odds models by amalgamating conditional likelihoods

Fitting stratified proportional odds models by amalgamating conditional likelihoods STATISTICS IN MEDICINE Statist. Med. 2008; 27:4950 4971 Published online 10 July 2008 in Wiley InterScience (www.interscience.wiley.com).3325 Fitting stratified proportional odds models by amalgamating

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

The performance of estimation methods for generalized linear mixed models

The performance of estimation methods for generalized linear mixed models University of Wollongong Research Online University of Wollongong Thesis Collection 1954-2016 University of Wollongong Thesis Collections 2008 The performance of estimation methods for generalized linear

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources

Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Constrained Maximum Likelihood Estimation for Model Calibration Using Summary-level Information from External Big Data Sources Yi-Hau Chen Institute of Statistical Science, Academia Sinica Joint with Nilanjan

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal

Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Marginal versus conditional effects: does it make a difference? Mireille Schnitzer, PhD Université de Montréal Overview In observational and experimental studies, the goal may be to estimate the effect

More information

A simulation study for comparing testing statistics in response-adaptive randomization

A simulation study for comparing testing statistics in response-adaptive randomization RESEARCH ARTICLE Open Access A simulation study for comparing testing statistics in response-adaptive randomization Xuemin Gu 1, J Jack Lee 2* Abstract Background: Response-adaptive randomizations are

More information

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback

Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback University of South Carolina Scholar Commons Theses and Dissertations 2017 Marginal Structural Cox Model for Survival Data with Treatment-Confounder Feedback Yanan Zhang University of South Carolina Follow

More information

Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design

Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design Chapter 236 Non-Inferiority Tests for the Ratio of Two Proportions in a Cluster- Randomized Design Introduction This module provides power analysis and sample size calculation for non-inferiority tests

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score

Causal Inference with General Treatment Regimes: Generalizing the Propensity Score Causal Inference with General Treatment Regimes: Generalizing the Propensity Score David van Dyk Department of Statistics, University of California, Irvine vandyk@stat.harvard.edu Joint work with Kosuke

More information

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features Yangxin Huang Department of Epidemiology and Biostatistics, COPH, USF, Tampa, FL yhuang@health.usf.edu January

More information

STAT 705: Analysis of Contingency Tables

STAT 705: Analysis of Contingency Tables STAT 705: Analysis of Contingency Tables Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Analysis of Contingency Tables 1 / 45 Outline of Part I: models and parameters Basic

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications WORKING PAPER SERIES WORKING PAPER NO 7, 2008 Swedish Business School at Örebro An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications By Hans Högberg

More information

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score

Interpret Standard Deviation. Outlier Rule. Describe the Distribution OR Compare the Distributions. Linear Transformations SOCS. Interpret a z score Interpret Standard Deviation Outlier Rule Linear Transformations Describe the Distribution OR Compare the Distributions SOCS Using Normalcdf and Invnorm (Calculator Tips) Interpret a z score What is an

More information

Covariate Balancing Propensity Score for General Treatment Regimes

Covariate Balancing Propensity Score for General Treatment Regimes Covariate Balancing Propensity Score for General Treatment Regimes Kosuke Imai Princeton University October 14, 2014 Talk at the Department of Psychiatry, Columbia University Joint work with Christian

More information

BIOS 6649: Handout Exercise Solution

BIOS 6649: Handout Exercise Solution BIOS 6649: Handout Exercise Solution NOTE: I encourage you to work together, but the work you submit must be your own. Any plagiarism will result in loss of all marks. This assignment is based on weight-loss

More information

ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS

ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS ANALYSIS OF CORRELATED DATA SAMPLING FROM CLUSTERS CLUSTER-RANDOMIZED TRIALS Background Independent observations: Short review of well-known facts Comparison of two groups continuous response Control group:

More information

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression Section IX Introduction to Logistic Regression for binary outcomes Poisson regression 0 Sec 9 - Logistic regression In linear regression, we studied models where Y is a continuous variable. What about

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Correlation and simple linear regression S5

Correlation and simple linear regression S5 Basic medical statistics for clinical and eperimental research Correlation and simple linear regression S5 Katarzyna Jóźwiak k.jozwiak@nki.nl November 15, 2017 1/41 Introduction Eample: Brain size and

More information

UNIVERSITY OF TORONTO Faculty of Arts and Science

UNIVERSITY OF TORONTO Faculty of Arts and Science UNIVERSITY OF TORONTO Faculty of Arts and Science December 2013 Final Examination STA442H1F/2101HF Methods of Applied Statistics Jerry Brunner Duration - 3 hours Aids: Calculator Model(s): Any calculator

More information

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics) Probability quantifies randomness and uncertainty How do I estimate the normalization and logarithmic slope of a X ray continuum, assuming

More information

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests Chapter 59 Two Correlated Proportions on- Inferiority, Superiority, and Equivalence Tests Introduction This chapter documents three closely related procedures: non-inferiority tests, superiority (by a

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Estimation and sample size calculations for correlated binary error rates of biometric identification devices

Estimation and sample size calculations for correlated binary error rates of biometric identification devices Estimation and sample size calculations for correlated binary error rates of biometric identification devices Michael E. Schuckers,11 Valentine Hall, Department of Mathematics Saint Lawrence University,

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3 University of California, Irvine 2017-2018 1 Statistics (STATS) Courses STATS 5. Seminar in Data Science. 1 Unit. An introduction to the field of Data Science; intended for entering freshman and transfers.

More information

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics

Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics Wooldridge, Introductory Econometrics, 4th ed. Appendix C: Fundamentals of mathematical statistics A short review of the principles of mathematical statistics (or, what you should have learned in EC 151).

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval

Part III: Unstructured Data. Lecture timetable. Analysis of data. Data Retrieval: III.1 Unstructured data and data retrieval Inf1-DA 2010 20 III: 28 / 89 Part III Unstructured Data Data Retrieval: III.1 Unstructured data and data retrieval Statistical Analysis of Data: III.2 Data scales and summary statistics III.3 Hypothesis

More information

HANDBOOK OF APPLICABLE MATHEMATICS

HANDBOOK OF APPLICABLE MATHEMATICS HANDBOOK OF APPLICABLE MATHEMATICS Chief Editor: Walter Ledermann Volume VI: Statistics PART A Edited by Emlyn Lloyd University of Lancaster A Wiley-Interscience Publication JOHN WILEY & SONS Chichester

More information

Performance Evaluation and Comparison

Performance Evaluation and Comparison Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Cross Validation and Resampling 3 Interval Estimation

More information

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation Biost 58 Applied Biostatistics II Scott S. Emerson, M.D., Ph.D. Professor of Biostatistics University of Washington Lecture 5: Review Purpose of Statistics Statistics is about science (Science in the broadest

More information