DATA NORMALIZATION AND CLUSTERING FOR BIG AND SMALL DATA AND AN APPLICATION TO CLINICAL TRIALS

Size: px
Start display at page:

Download "DATA NORMALIZATION AND CLUSTERING FOR BIG AND SMALL DATA AND AN APPLICATION TO CLINICAL TRIALS"

Transcription

1 DATA NORMALIZATION AND CLUSTERING FOR BIG AND SMALL DATA AND AN APPLICATION TO CLINICAL TRIALS BY YAYAN ZHANG A dissertation submitted to the Graduate School New Brunswick Rutgers, The State University of New Jersey in partial fulfillment of the requirements for the degree of Doctor of Philosophy Graduate Program in Statistics and Biostatistics Written under the direction of Javier Cabrera and approved by New Brunswick, New Jersey May, 2015

2 c 2015 YAYAN ZHANG ALL RIGHTS RESERVED

3 ABSTRACT OF THE DISSERTATION Data Normalization and Clustering for Big and Small Data and an Application to Clinical Trials by YAYAN ZHANG Dissertation Director: Javier Cabrera The purpose of this thesis is to propose new methodology for data normalization and cluster prediction in order to help us unravel the structure of a data set. Such data may come from many different areas, for example clinical responses, genomic multivariate data such as microarray, educational test scores, and so on. In addition and more specifically for clinical trials this thesis proposes a new cohort size adaptive design method that will adapt cohort size eventually and finally will save time and cost while still keep the accuracy to find the target maximum tolerate dose. The new normalization method is called Fishe-Yates normalization and it has the advantage of being computationally superior than the standard quantile normalization and it improved the power of the following statistical analysis. Once the data has been normalized the observations are clustered by their pattern of response and cluster prediction is used to validate the findings. We propose a new method for cluster prediction which is a natural way to predict for hierarchical clustering. Our prediction method using nonlinear boundaries between clusters. ii

4 Normalization method and clustering prediction method can help to identify subgroups of patients which has positive treatment effect. For clinical trial study, this thesis also proposes a new adaptive design which will adapt cohort size thus save time and cost to locate the target maximum tolerated dose. iii

5 Acknowledgements I am really grateful to went through my PhD in Statistics department of Rutgers. I am using this opportunity to express my gratitude to everyone who supported me throughout these years. First of all, I would like to express my deepest gratitude to my thesis advisor Professor Javier Cabrera. He always had lot of brilliant ideas about the topics in this thesis. Without his invaluable guidance and constant support, this dissertation would be impossible. To me, he is far beyond my thesis advisor. He does not only guide me on how to do research, but also share his social connection and life experience with me. He brought me to attend the Cardiovascular Institute weekly meeting, to meet with people in outside companies. All of these experiences improved my communication skills and trained me a lot for my future career. I am really grateful to have such a great advisor like him. To Professor John E. Kolassa, the Director of graduate program in our department. He is very nice and always devote his effort to help students. I thank for his guidance and help with graduate program requirements understanding. He encouraged me to take the written qualifying exam before I came to Rutgers four years ago thus I were able to save one year for my PhD study. With his help, I bypassed lot of obstacles both in my study and in my life. Special thanks go to my committee members Professor Lee Dicker and Dr.Birol Emir for their precious time and efforts. I would thank Dr. Birol Emir who has kindly provided me access to the neuropathic pain data set used in the research projects. iv

6 My gratitude also goes to my summer intern supervisor from Pfizer: Dr.David Li. He spent lot of time to work with me even after my internship ended. Without his help, I could never extend the intern project into the third part of my dissertation. A special warm thanks go to my beloved husband Xin. I cannot image how I get through all these thistles and thorns in my life without his encouragement and support. Thank you for always stayed strong and raised me up. Thank you for always being there no matter good or bad. Finally, I would like to thank the Department of Statistics and Biostatistics for offering me this opportunity to receive my degree here and providing me support. I thank Dr. Kezhen Liu, Dr. Lan Yi, Dr. Jie Liu and Dr. Ning Tang, for their suggestions and encouragements on my thesis work and my job hunting. I thank my friends Chuan Liu and Die Sun for various course topics discussion through all these years. I also thank Long Feng, Ting Yang, Xinyan Chen, Xialu Liu and other friends, I deeply appreciate their friendship that made this experience so memorable. v

7 Dedication To my father Muliang Zhang and mother Qiaoyun Wu. vi

8 Table of Contents Abstract ii Acknowledgements iv Dedication vi List of Tables xi List of Figures xiii 1. Introduction Background and Motivation Thesis Goals and Overview Data Normalization and Fisher-Yates Transformation Hierarchical Clustering Tree Prediction Dose and Cohort Size Adaptive Design Data Normalization and Fisher- Yates Transformation Introduction and Motivation Background and Introduction Case Study Why Fisher-Yates Normalization Exist Normalization and Standardization Methodology Global or Linear Normalization Intensity-Dependent Normalization Smooth Function Normalization vii

9 Stagewise Normalization Other Intensity-Dependent Normalization Methods Quantile Normalization and Fisher-Yates Transformation Quantile Normalzation Fisher-Yates Normalization Properties for Fisher-Yates Transformation Simulation Study for Two-Group Gene Comparative Experiments Notations Hypothesis Testing Simulation Study for Normal Distribution Data Simulation Study for Gamma Distribution Data Discussion Simulation Study for Scoring Scale Data Normalization Application on Neuropathic Pain Data Data Description Normalization Outcomes Skewness Reduction Normalization Application on Sialin Data Data Description Normalization Results Discussion and Conclusion Hierarchical Clustering Tree Prediction; An Unsupervised Algorithm for Predicting New Data based on the Established Clusters Introduction Motivation Two Simple Approaches to Prediction viii

10 Least Squares Nearest-Neighbor Methods Hierarchical Cluster Tree Prediction Analysis on A Set of Dissimilarities and Methods Model Construction and Algorithm Single Linkage Prediction Complete Linkage Prediction Average Linkage Prediction Ward s Clustering Comments on different Hierarchical Clustering Methods Simulation Set Up and Results Real Data Example Data Description Clustering Prediction Result Conclusions Dose and Cohort Size Adaptive Design Introduction and Background Case Study Existing Methods for Dose-finding Study Design Continuous Reassessment Method Escalation and Group Designs Toxicity Probability Intervals Methods Dose and Size Adaptive Design Stopping Rules Escalation Tables ix

11 4.4. Simulation Set-up and Results Pre-defined Parameters Model Construction and Variance Transformation Isotonic Regression Simulation Results Discussion x

12 List of Tables 2.1. H 0 : Normal(µ 1 i = 0, 1) vs H 1 : Normal(µ 2 i, 1) H 0 : Gamma(µ 1 i = 3, 1) vs H 1 : Gamma(µ 2 i, 1) Probability Vectors for Data Generation True Classification Numbers Skewness Reduction Comparison Significance Level α = Significance Level α = Prediction Table for Both Methods Cluster Ratio Table General Dose Escalation Table1: Current dose has Escalation trend General Dose Escalation Table2: Current dose has Stay trend General Dose Escalation Table3: Current dose has De-escalation trend Dose Escalation Table: d=d, current dose has Escalation trend Dose Escalation Table: d=d, current dose has Stay trend Simulation Results for D&S and TPI Common Scenarios: p T = 0.03, StartDose= Simulation Results for D&S and TPI Uncommon Scenarios: p T = 0.03, StartDose= Simulation Results for D&S and TPI Common Scenarios: p T = 0.03, StartDose= xi

13 4.9. Simulation Results for D&S and TPI Uncommon Scenarios: p T = 0.03, StartDose= Simulation Results for D&S and TPI: p T = 0.1, StartDose= Description for Function or Package Description for Function or Package xii

14 List of Figures 2.1. H 0 : Normal(µ 1 i = 0, 1) vs H 1 : Normal(µ 2 i, 1) H 0 : Gamma(µ 1 i = 3, 1) vs H 1 : Gamma(µ 2 i, 1) Probability Trend Barchart for Different Patterns Boxplots of the raw symptoms grouped by the patient and sorted by the patient median. 1 out of every 20 patients are plotted Boxplots of the z-scores on symptoms grouped by the patient and sorted by the patient median Boxplots of the Quantile Normalization of the symptoms grouped by the patient and sorted by the patient median Boxplots of the Fisher-Yates Algorithm on symptoms grouped by the patient and sorted by the patient median. This improve the skewness and gives a more satisfactory shape result Gene Expression Level in Log-scale for probes of RMA Algorithm Illustration Graph Algorithm Illustration Graph Traditional and New Prediction Methods with Single Linkage Distance Traditional and New Prediction Method with Complete Linkage Distance Traditional and New Prediction Method with Average Linkage Distance xiii

15 3.6. Traditional and New Prediction Method with Ward Linkage Distance Simulation Data. Red and Green samples are corresponding to two groups Training Data.Red and Green samples are corresponding to two groups Clusters of Training Data.Yellow and Blue samples are corresponding to two clusters Prediction Result for both Methods NPSI Cluster means by disease and individual pain dimension. CPSP= central post-stroke pain; DPN= painful diabetic peripheral neuropathy; HIV= painful HIV neuropathy; NPSI= Neuropathic Pain Symptom Inventory; PTNP=post-traumatic peripheral neuropathic pain Testing observations 1. Show some testing observations which get different prediction result under HCTP and Traditional method. Main title for each sample indicates the prediction cluster result for each method Testing observations 2. Show some testing observations which get different prediction result under HCTP and Traditional method. Main title for each sample indicates the prediction cluster result for each method Design. One of the most popular dose-escalation and deescalation scheme Scenarios when p T = Plot for Different Beta Prior densities Histograms of Log Transformation for Different Beta Prior densities 79 xiv

16 1 Chapter 1 Introduction 1.1 Background and Motivation One of the motivating examples for the statistical methodology presented in this thesis is the analysis of clinical trial data for the treatment of neuropathic pain. Neuropathic pain (NeP) is a complex, central nervous system related pain state. It is part of the disorder or dysfunction of peripheral nerves. Usually neuropathic pain is caused by diseases affecting the somatosensory system, and presents heterogeneous sensory deficits. The nerve fibers maybe damaged and thus will send incorrect information to other pain centers. Though there is no obvious cause for neuropathic pain, there are several common causes: alcoholism, diabetes, HIV infection or AIDS, Chemotherapy etc. Generally, neuropathic pain symptoms may include shooting and burning pain, tingling and numbness. It actually is not a single disease, but is a complication that is based on various underlying medical conditions. Thus the research targeted to find effective treatment for neuropathic pain is always a challenge. The Institute for Advanced Reconstruction at the plastic surgery center indicated that more than 100 types of peripheral neuropathy have been identified, each with its own particular set of symptoms, patterns of development and prognosis. To diagnosis neuropathic pain, usually the doctor will conduct a few questionnaires which evaluate different pain symptom descriptors. The patient maybe asked to record a daily pain diary on a numeric rating scale, and also record when

17 2 and how this pain arises. Blood tests and nerve tests are also necessary sometimes. Due to the limitations of current treatment, a number of neuropathic pain studies failed to meet the primary efficacy target and many patients felt no relief or even occasionally got worse instead of better over time. A research study was conducted by Freeman et al. (2014) aimed to depict any phenotype of neuropathic pain and uncover any pattern of sensory symptoms if exist. Freeman et al. (2014) mentioned that the management of neuropathic pain is challenging and many patients obtain only partial or no pain relief taking currently available pain medications. Four clinical studies were conducted by Freeman et al. (2014), for testing the treatment effect of an undisclosed drug. The results of the four studies turned out to be negative, because they found no difference between treatment and placebo. We find out in this example that the reason why the analysis didn t work is that pain acts differently in different people. It is unlikely to find a treatment that reduces pain on everybody. However, the study also provides pain scale measures based on sensory scales and pain scales that can be used to find clusters or patients with similar pain patterns some of which will respond to treatment. The analysis that was initially performed using z-score normalization was not able to detect clusters that were predictive of good response to treatment. This is because the skewness of the pain measurements is obscuring the good clustering and z-score transformations are linear transformations and do not alter skewness and the underlying structure maybe missed. This is just an example but this problem is very generalized among scale data, microarray and genomic data, education testing data, imaging data and others. The purpose of this thesis is to propose some new methodology for normalizing the data and for cluster prediction that help us unravel the structure that is presented in the data. In addition and more specifically for clinical trails this thesis propose a new adaptive design method that can save time to perform the

18 3 study. Normalization methods are very important in data sets in which the measurement scales depend on the observation. For example two patients fill up questionnaires one gives high values while the other gives low values but their pain maybe the same. Microarrays and image data are based in pictures or scans which are sensitive to the luminosity at the instance of the picture. The idea of normalization is to make the scales as similar as possible. In our example of pain clinical trials we improve the data set by standardizing or normalizing the answers using the Fisher Yates normalization and the hierarchical clustering methods, which hopefully will be able to detect subgroups with positive treatment response. For the adaptive design methodology we promoted a new cohort adaptive method which could find the maximum tolerate dose more quickly while still keep the accuracy. 1.2 Thesis Goals and Overview The goal of this thesis is to develop new techniques to pre-process the big and small data as well as demonstrate new prediction algorithm and adaptive method that could better detect subgroups with positive treatment response, and could save cohort numbers and time when locating the target maximum tolerated dose in clinical trial studies by applying to the existing challenges and problems. The remainder of the thesis document is structured as follows: In Chapter2 we present a new data normalization method with Fisher-Yates transformation that could remove as much as possible, any systematic sources effects of variation. We also build extensive simulation analyses and conduct real data application on neuropathic pain questionnaire data and Sialin data. In Chapter3 we study the incorporating of new data to an existing hierarchical clustering partition that was obtained earlier using older training data.

19 4 We derive a novel method called hierarchical clustering tree prediction method, which will use the existing hierarchical tree to insert the new observation into the appropriate cluster by inter-point distance and inter-cluster distance. Finally, in Chapter4 we demonstrate in clinical trial study, our new approach called Dose and Size adaptive design can save study time and thus save cost while keep the accuracy to find the true maximum tolerate dose, which is very exciting result in clinical trial study. 1.3 Data Normalization and Fisher-Yates Transformation Once the experiment has been done and data has been collected, it is necessary to preprocess the data before formally analyzing it. The purpose of normalization is to remove as much as possible, any systematic sources effects of variation. Data normalization is a very popular technique to remove technological noise in genomic, in test scoring, in questionnaire scoring of medicine, and social sciences etc. Early researchers noticed that there are substantial differences in intensity measurements among questionnaire data and microarrays which were treated exactly alike. In this thesis, we will show there are statistical analysis challenges in using this questionnaire data and microarray data while conducting standard analyses such as linear and non-linear modeling, factor analysis, cluster analysis etc. It has been emphasized in many literature about the importance of preprocessing the predictors and responses before modeling and analysis. We prove that the traditional transformation methods (centering, scaling, z- scores) are inadequate to fulfill this task and we propose a new data normalization method with Fisher-Yates transformation. We build extensive simulation analyses and conduct real data application on neuropathic pain questionnaire data and Sialin data to show Fisher-Yates transformation is more successful at removing

20 5 noise and reducing skewness. The idea of quantile normalization is motivated by making the common scales of the normalized data as close as possible to the true scale of the data. Conceptually it seems like the proper thing to do, but the problem is then the data analyzed using T-tests or F-tests will rely much on the assumption of normality. Suppose we have a data set X = (X 1,..., X P ) = {x ij } (i = 1,..., I; j = 1,...P ), where X j stands for the column array, and X j = x 1j,..., x Ij. The column of X represent the observations and the row of X are the variables or questions for questionnaire data or gene. The main idea of quantile normalization is to first sort each subject vector and calculate the coordinate-wise median of the vectors, say M I. Then replace each x ij in M I with the corresponding rank, thus Q n (X) = {M[r ij ]}, where i = 1,..., I and j = 1,..., P. One concern for quantile normalization is that the median array may not look similar to any of the arrays and may has shorter tailer than other arrays in the data set. Another concern is, when the number of variables or genes or questions in our data set is not large, the median array is very variable and may not be adequate for normalization. Quantile normalization tries to put all data into the same scale and this is scale as close as possible to the true scale. skewness of the original if it is very skewed. However, it does not remove the We proposed that Fisher-Yates normalization can bring data into same scale and reduce skewness of data at the same time. We also conclude that Fisher-Yates normalization handles skewness and outliers better than quantile normalization and as a result it increases the power to detect genes that are differentially expressed between arrays and also it gets better classification results by simulation study and application on real data example.

21 6 1.4 Hierarchical Clustering Tree Prediction Cluster analysis, also known as data segmentation, is an unsupervised classification method to split the data into clusters or segments, so that within cluster objects are as homogenous as possible and between clusters are as heterogeneous as possible. Clustering methods can be classified as hierarchical (nested) or partitional (un-nested). However, they all depend on a dissimilarity or a similarity measure, depict the distance of two observations: how far or how close, the observations are from each other. Here we want to study the incorporating of new data to an existing hierarchical clustering partition that was obtained earlier using older training data. The issue is on how do we use the existing hierarchical clustering to predict the clusters for the new data. The standard prediction method is to assign the new observation to the closest cluster using inter-cluster distance between that observations and the existing clusters. We derive a novel method called hierarchical clustering tree prediction method (HCTP), which will use the existing hierarchical tree to insert the new observation into the appropriate cluster. We analyzed a data set about the treatment effect of Lyrica on neuropathic pain patients data from four randomized clinical trials (Freeman et al. (2013)). After the data on baseline symptoms of Neuropathic pain was normalized by Fisher-Yates we applied hierarchical clustering we identified three clusters. Once the clusters were established, a new clinical trial data became available for us. We wanted to assign new patients from the recent trial into the established clusters from the previous four trials. The basic idea for this method is to include the new observed data with the original data and to perform hierarchical tree clustering until the new observation join a cluster A. Then we assign the new observation to cluster A which is the cluster in the original configuration where points in cluster A falls into. But for

22 7 different inter-cluster distance we may find a different way to do this. This new method depends on the inter-point distance and inter-cluster distance which was used to generate the hierarchical tree. We will study the most commonly used distance measures like Single linkage, Complete linkage, Average linkage and Ward s method for hierarchical clustering to compare our HCTP to the standard prediction method. The classification boundaries are different from HCTP and traditional method in our simulation study and also misclassification rate is reduced. 1.5 Dose and Cohort Size Adaptive Design Early-phase clinical trials are first-in-human studies for new treatment. The primary objective of phase I oncology trial is to define the recommended phase II dose of a new drug, aiming at locating the MTD. The main outcome for most existing dose-finding designs is toxicity, and escalation action is guided by ethical considerations. It is very important to estimate the MTD as accurate as possible, since it will be further investigated for efficacy in Phase II study. The study will begins at low doses and escalate to higher doses eventually due to the severity of most DLTs. However, we also want the escalation of doses to be as quick as possible since the lower doses are expected to be ineffective in most cases. A rich literature has been published for dose-finding designs of Phase I trials. The conventional 3+3 design, first introduced in the 1940s, is still the most widely utilized dose-escalation and de-escalation scheme. However, there are some limitations when applying 3+3. Statistical simulations demonstrated that 3+3 design is used to identify the MTD in as few as 30% of trials. Another very popular model-based method is Continual Reassessment Method (CRM) which estimate the MTD based on one-parameter model and eventually updated the estimator every time one cohort completes either by Bayesian methods given by

23 8 O Quigley et al. (1990), or maximum likelihood methods given by O Quigley and Shen (1996). Traditional adaptive methods will adapt dose up and down eventually depend on the toxicity from the observed data. We promoted a novel dose assignment method called dose and cohort size (D&S) adaptive design, which is based on conjugate Beta prior, and will adapt dose and cohort size at the same time, thus able to detect the true MTD with less cohorts while still keep the accuracy. For dose escalation rules, D&S follows the same principles as 3+3, TPI and CRM etc., that will Escalate if current dose has AE rate too high, Stay if around the target rate, and De-escalate if is too low. Also, we change the cohort size depending on whether the next dose is likely or unlikely to be the MTD. We will not change cohort size if we are uncertain it is MTD, add more subjects if the dose is likely to be the dose with targeted AE rate, and add much more subjects if the dose is highly likely to be the dose with targeted AE rate. Simulation results indicate that with appropriate parameters, D&S design performs better at estimating the target dose and at subject assignment of the target dose. This new method may also appeal to physicians while its implementation and computation are very simple. To implement this new method, we will need to specify the target toxicity probability p T, the number of doses D and true toxicity probabilities for each dose to start simulation study. The main distinction of this new proposed method is: it requires all information from current dose, lower dose and higher dose to decide dose assignment action. And we will adapt cohort size at the same time when some specified criteria are satisfied.

24 9 Chapter 2 Data Normalization and Fisher- Yates Transformation Data normalization is a very popular technique to remove technological noise in genomic, in test scoring, in questionnaire scoring of medicine, and social sciences etc. Early researchers noticed that there are substantial differences in intensity measurements among questionnaire data and microarrays which were treated exactly alike. There are statistical challenges in using these data when conducting standard analyses such as modeling or clustering and the main issue is to preprocess the data to construct a response score. We show that the traditional transformation methods (centering, scaling, z-scores) are inadequate to fulfill this task and we propose a new data normalization method with Fisher-Yates transformation. We build extensive simulation analyses and conduct real data application on neuropathic pain questionnaire data and Sialin data to show Fisher-Yates transformation is more successful at removing noise and reducing skewness. 2.1 Introduction and Motivation Background and Introduction Once the experiment has been done and data has been collected, it is necessary to preprocess the data before formally analyzing it. The purpose of normalization is to remove as much as possible, any systematic sources effects of variation. Normalization methods can greatly enhance the quality of any downstream analyses.

25 10 I present here two basic examples of data that are commonly needed preprocess of normalization: (I) Questionaire data example. Questionnaire data is a research tool utilized in many research areas. A questionnaire means eliciting the feelings, beliefs, experiences or attitudes from sample of individuals. Though there are economy and uniformity of questions advantages when using questionnaire data, the respondent s motivation is difficult to assess and sample bias does exist at the beginning of the study. Many questionnaires are based on scales such like Likert scales that take values in a fixed range (1-7 or 0-10) and it is common to have many such questions on the same questionnaire. The response score is very personal dependent, with same stabbing, some subjects will return generally high scores while others will return relatively low scores. This is a very common behavior among population and has been studied in many different research areas. Our concerns about this questionnaire data are: 1. The distribution of the response scores may differ substantially from subject to subject, spread and shape; 2. The boundary threshold effects at the low and high values make the distribution of the scores either left or right skewed. 3. Especially the introduction of online questionnaires or medical outcomes questionnaires that are recorded by health providers, the number of cases can grow very large and the data becomes big data. (II) Microarray Data Example. Early microarray researchers indicated there are substantial difference in intensity measurements among microarrays which were treated exactly alike. Microarray technology though popular, is well known of various technical noises due

26 11 to the limitation of technology. Though the magnitude is reduced now due to improvements of technology, difference still persist. The systematic effects introduced into intensity measurements due to the complexities of microarray experimental process can be substantial enough to dilute the effects that the experimenter is trying to detect. Sources of variability caused Systematic effects were summarized by Amaratunga and Cabrera (2004): the concentration and amount of DNA placed on the microarrays, arraying equipment such as spotting pins that wear out over time, mrna preparation, reverse transcription bias, labeling efficiency, hybridization efficiency, lack of spatial homogeneity of the hybridization on the slide, scanner setting, saturation effects, background fluorescence, linearity of detection response, and ambient conditions. In order to make valid comparisons across microarrays, we need to remove the effects of such systematic variations and bring the data onto a common scale Case Study There are many data sets used by Amaratunga and Cabrera (2004) in book Exploration and Analysis of DNA microarray and Other High-Dimensional Data. The Golub data, the Mouse5 data, the Khan data, the Sialin data, the Behavioral study data, the Spiked-In data, the APOAI study data, the Breast Cancer data, the Platinum Spike data set, and Human Epidermal Squamous Carcinoma Cell Line A431 Experiment data. Amaratunga and Cabrera (2004) applied generally quantile normalization on these data set. In particular we are looking into the Sialin Data as microarray data example to demonstrate our Fisher-Yates normalization method. The Sialin data was collected basically from two different types of mice: Slc17A5 gene knocked out mice, and the wild-type mice ( normal mice), where

27 12 Slc17A5 is the gene expression for the production of Sialin. The RNA samples then collected from newborn and 18-day-old mice from these two types mice. The final profile has 496,111 probes corresponding to 45,101 genes collected from RNA samples using Affymetrix Mouse430-2 Gene-chips. Other biological data can achieve to much higher dimension like 30 million variables. Questionnaire data used by Freeman et al. (2014) were from patients who were males or non-pregnant, non-lactating females aged 18 days with a diagnosis of NeP syndromes: CPSP, PTNP, painful HIV neuropathy and painful DPN. The NPSI questionnaire evaluated 10 different pain symptom descriptors: superficial and deep spontaneous ongoing pain; brief pain attacks or paroxysmal pain; evoked pain (pain provoked or increased by brushing, pressure, contact with cold on the painful area); and abnormal sensations in the painful area Why Fisher-Yates Normalization There are statistical analysis challenges in using this questionnaire data and microarray data while conducting standard analyses such as linear and non-linear modeling, factor analysis, cluster analysis etc. It has been emphasized in many literature about the importance of preprocessing the predictors and responses before modeling and analysis. The objective of questionnaire score normalization is to reduce as much as possible the difference in shape between set of scores belonging to the same subject. By doing so we improve the compatibility of the individual subject scales so that the variables that come out of the questionnaire are more homogeneous and could be better used in further analysis. Traditionally the questionnaire scores are replaced by z-scores obtained from each individual subject data. This calibrates each subject to have zero mean and one standard deviation but it does not affect the skewness and other shape measures of the data. However, we find an application of Fisher-Yates scoring

28 13 which we call FY normalization can remedy the shortages. Quantile normalization is widely used to analyze microarray data, but we will also demonstrate in the following sections why we prefer Fisher-Yates normalization. 2.2 Exist Normalization and Standardization Methodology Once we have collected the data, it is necessary to pre-process it before formally analyzing it. There are several issues we could solve to enhance the quality of any downstream analyses: 1. to transform the data into a scale suitable for analysis; 2. to remove the effects of systematic sources of variation; 3. to identify discrepant observations and arrays. The purpose of normalization is to remove the effects of any systematic sources of variation as much as possible by data processing. Systematic effects can dilute the effects that the experimenter is wanting to detect. Some source variability can be controlled by the experimenter, however, we can not eliminate them completely. Early researchers noticed this problem and did lots of work to remove the effects of such systematic variations Global or Linear Normalization Early methods used normalization by the sum, by mean, by the median and by Q3 (third quantile). For example, for normalization by the sum, the sums for each individual of questionnaire data are forced to be equal to one another. Suppose the k original sums were X 1+,..., X k+ and we divide i th sum by X i+, then we force the sum to be 1. Similarly, normalization by the mean will force the arithmetic

29 14 means of each individual to be equal; and normalization by the median equated the row medians. We call these examples global or linear normalization. For linear normalization, we assume the spot intensities for every pair of individual scores are linearly related without intercept. Then we can apply normalization scheme to adjust intensity for every single score by the same amount to reduce the systematic variation and make the data more comparable Intensity-Dependent Normalization We could use global normalization if the pair of our data is linearly related without intercept. But in most cases, the spot intensities are nonlinear. Different factors needed to adjust low-intensity and high-intensity measurements. We call this normalization scheme intensity-dependent normalization while the normalizing factor is a function of the intensity level. We denote the nonlinear normalization function as: X f(x). A lot of pre-work had been done for this intensity-dependent normalization including Amaratunga and Cabrera (2001), Li and Wong (2001), Schadt et al. (2001). Baseline array needs to be specified for intensity-dependent normalization. For example, the median mock array. If X ij represents the transformed spot intensity measurement for the i th individual (i=1,...,i) for the jth question (j=1,...,p), the median mock array for kth observation is: M k = median{x k1,..., X kp }. (2.1) There are several ways to perform intensity-dependent normalization.

30 15 Smooth Function Normalization For smooth function normalization, there is an inverse function g i = f 1 i is estimated by fitting the model X ij = g i (M k ) + ε ij (2.2) where ε ij is a random error term and the normalized values for the ith individual are obtained from X ij = f i (X ij ) (2.3) Stagewise Normalization Stagewise normalization is used when data is combined with technical and biological replicates. Usually smooth function normalization is applied to technical replicates, and quantile normalization is applied to biological replicates. Other Intensity-Dependent Normalization Methods Quantile Normalization and Fisher-Yates Normalization are two other very popular intensity-dependent normalization methods, we will discuss quantile normalization and propose Fisher-Yates method in details in the following sections. 2.3 Quantile Normalization and Fisher-Yates Transformation The idea of quantile normalization is motivated by making the common scales of the normalized data as close as possible to the true scale of the data. Conceptually it seems like the proper thing to do, but the problem is then the data analyzed using T-tests or F-tests will rely much on the assumption of normality. Quantile normalization and Fishr-Yates normalization are performed basically when the data is non-normal. We expect to see some power loss on the t-test with

31 16 respect to the non-normal method. Also the more skewed data is, the less reliable is the tail of the observed data. We have a data set X = (X 1,..., X P ) = {x ij } (i = 1,..., I; j = 1,...P ), where X j stands for the column array, and X j = x 1j,..., x Ij. The column of X represent the observations and the row of X are the variables or questions for questionnaire data or gene Quantile Normalzation Quantile normalization introduced by Amaratunga and Cabrera (2001) is to make the distributions of the transformed spot intensity as similar as possible, or at least to the distribution of the median mock array. For quantile normalization, the shape of the normalized data is the median shape of the original data. But the data maybe skewed and median shape is deformed on the tails. Amaratunga and Cabrera (2001) proposed the idea of standardization or normalization by quantiles for micro-array data under the name of quantile standardization and was changed to quantile normalization later by Irizarry et al. (2003). The differences between micro-array data and questionnaire data are: (a) the measurements are continuous, (b) the shapes of subject observations are more similar, and (c) the number of subjects I is usually much smaller than the number of predictors P, I < P. The algorithm for constructing the quantile normalization of the rows of a data matrix X with I observations (rows) and P genes (columns) is as follows: 1. Construct the median subject. First we need to sort each of the subject vectors and calculate the coordinate-wise median of the vectors and lets call this vector M of length P. Let X represents the sorted data, we say X = {X 1,..., X P }; (2.4)

32 17 and X j = {x (1)j,..., x (I)j }. (2.5) Where X j is the ordered vector, and x (1)j x (2)j... x (I)j. Then the median vector, M I can be derived as M[i] = median{x (i)1,..., x (i)p }; (2.6) 2. We know x ij is the i th score (question) of j th column (subject). Let r ij be the rank of i th score among j th column, 1 i I and 1 j P. Then we will replace x ij with M(r ij ) by column and the resulting array Y ij is the normalized scores. Q n (X) = {M[r ij ]} (2.7) i = 1,..., I; j = 1,..., P Fisher-Yates Normalization One concern for quantile normalization is that the median array may not look similar to any of the arrays and may has shorter tailer than other arrays in the data set. Another concern is, when the number of variables or genes or questions in our data set is not large, the median array is very variable and may not be adequate for normalization. Also we know quantile normalization tries to put all data into the same scale and this is scale as close as possible to the true scale. However, quantile normalization does not remove the skewness of the original if it is very skewed. In order to fix skewness, we need to do other. Here we propose Fisher-Yates rank transformation to normalize the data, we call Fisher-Yates Normalization. Fisher-Yates normalization can bring data into same scale and reduce skewness of data at the same time, thus we can see in our simulation study power is improved due to

33 18 skewness reduction. Also good properties of Fisher-Yates are listed and proved in the following section. Generally, the algorithm for Fisher-Yates normalization is: suppose x ij is the i th score (question) of j th column (subject). Let r ij be the rank of i th score among j th column, 1 i I and 1 j P. Then x ij will be replaced by Φ 1 (r ij /(I+1)) and the resulting array Y ij is the Fisher Yates normalization scores. Fisher-Yates (1928) proposed Fisher-Yates transformation by replacing z-scores by scores based on ranks and assign the scores the corresponding quantile of a standard normal distribution. F Y (x ij ) = Φ 1 (r ij /(I + 1)) (2.8) for Fisher-Yates Normalization F Y (X) can be proposed as F Y (X) = {F Y (X 1 ),..., F Y (X P )} (2.9) Theorem 1. If X M is a random variable from a distribution F and suppose we draw observations {x M 1,..., x M I } F. In reality we observe x ij = ψ j (x M i + ɛ ij ) where ψ j is also unobserved and strictly monotonic. Then {ψ j, x M i } are not identifiable. Proof. Suppose h is also strictly monotonic, let ˆψ j and and x M i ˆ x M i be estimators of ψ j, such that x ij = ˆψ j ( ˆ x M i ). Then if ψj = ˆψ j (h 1 ) and X M i = h( ˆX M i ), we have x ij = ψ j ( x M i ) = ˆψ j ( ˆ x M i ). Thus means that {ψ j, x M i } are not identifiable. Quantile normalization solve this pattern of non-identifiability by setting ˆX (i) M = Median{X (i)j }. j And Fisher-Yates normalization solves by setting ˆX M (i) = Φ 1 (r ij /(I + 1)).

34 Properties for Fisher-Yates Transformation Property 1. The Fisher-Yates transformation FY can be obtained from the Quantile normalization algorithm Qn by replacing M[r ij ] with Φ 1 (r ij /(I + 1)). Proof. For Quantile Normalization Q n (X) = {M[r ij ]} we replace with Fisher-Yates transformation Φ 1 (r ij /(I + 1)), then we have {Φ 1 (r ij /(I + 1))} = {F Y (X 1 ),..., F Y (X P )} = F Y (X) Property 2. If Skew(X) > 0: 1. If X has no ties, then Skew(F Y (X)) = 0; 2. If X has tie but proportion of tie goes to zero when n, then lim Skew(F Y (X)) = 0 n Proof. If the random variable X has n observations, then the estimator of population skewness can be written as: skew(x) = E(X E(X)) s 3 = 1 n n i=1 (x i X) 3 1 [ n n 1 i=1 (x i X) 2 ] 3 where X is the sample mean and s is the sample standard deviation. skew(f Y (X)) = = E(F Y (X) E(F Y (X))) s(f Y (X)) 3 n i=1 [Φ 1 (r i /(n + 1)) mean(f Y (X))] 3 1 n 1 n 1 n i=1 [Φ 1 (r i /(n + 1)) mean(f Y (X))] 23

35 20 1. If X has no ties: mean(f Y (X)) = 1 n n [Φ 1 (r i /(n + 1))] i=1 = 1 1 n {[Φ 1 ( n + 1 ) + n Φ 1 ( n + 1 )] + 2 [Φ 1 ( n + 1 ) + Φ 1 ( n 1 )] +...} n + 1 = 1 { } = 0 n Then skew(f Y (X)) = 1 n n i=1 [Φ 1 (r i /(n + 1))] 3 sd(f Y (X)) 3 = {[(Φ 1 ( 1 n+1 ))3 + (Φ 1 ( n n+1 ))3 ] + [(Φ 1 ( 2 n+1 ))3 + (Φ 1 ( n 1 n+1 )3 )] +...} n.sd(f Y (X)) 3 { } = n sd(f Y (X)) = If X has tie but proportion of tie goes to zero when n, that is O(N T ) lim n n 0 Then 1 mean(f Y (X)) = lim n n n [Φ 1 (r i /(n + 1))] i=1 2N T lim n n = lim O(N T ) n n = 0 here N T 1 n n i=1 skew(f Y (X)) = lim [Φ 1 (r i /(n + 1))] 3 n sd(f Y (X)) 3 is the number of ties. lim n 2 N T 1 3 n sd(f Y (X)) 3 = lim n O(N T ) n = 0 Property 3. If X is continuous:

36 21 1. Skew(F Y (X)) = 0 2. Suppose that the median vector M P of quantile normalization has Skew(M P ) = k where k 0, then Skew(F Y (M P )) k. Proof. 1. If X is continuous, X has no ties, then Skew(F Y (X)) = X is continuous, then the median array M P of quantile normalization is also continuous. Then Skew(F Y (M P )) = 0 thus Skew(F Y (M P )) k. Property 4. For discrete case, if vector {x i } (i=1,...,p) has only two values a 1 and a 2, then the skewness of {h(x i )} is the same as the skewness of {x i }, where h is any monotonic transformation. Proof. For any monotonic transformation h : {a 1, a 2 } R, can be represented by the linear transformation where we can show h(x) = h(a 1 ) + (x a 1) (a 2 a 1 ) (h(a 2) h(a 1 )) h(a 1 ) = h(a 1 ) + 0 = h(a 1 ) h(a 2 ) = h(a 1 ) + a 2 a 1 a 2 a 1 (h(a 2 ) h(a 1 )) = h(a 2 ) Thus h is a linear transformation when {x i } has only two values and the skewness is invariant under linear transformation.

37 22 Property 5. Fisher-Yates normalization is better for big data I: Fisher-Yates normalization is computationally faster than quantile normalization because it dose not require the computation of the median sample, or array. Therefor is more useful for big data. Property 6. Fisher-Yates normalization is better for big data II: Suppose that we obtain new observation from the same experiment or from new data that is added. Quantile normalization requires the re-computation of the median array which has changed with the new data. This may introduce changes in the analysis which are unsettling. However, Fisher-Yates normalization that s not change anything so the analysis from the prior data remains untouch. It also saves additional computing time thus good for big data. 2.4 Simulation Study for Two-Group Gene Comparative Experiments The objective of many microarray experiments is to detect different gene expression levels across two or more conditions. For example for people who have lung cancer, we would like to differentiate the gene expression levels of lung tissue with cancer cells with lung tissue with normal cells. Due to the complexity of microarray experiments, we consider a simple and most common case: a comparative experiment to compare two groups Notations Assuming we have two phenotypic groups of gene, say two groups of microarrays (Group1 and Group2); there are n 1 microarrays in Group1 and n 2 microarrays in Group2. In our simulation study, Group1 represents the normal microarrays, which will be simulated from a fixed distribution, while Group2 is the group of interest, for example, the disease group.

38 23 Let x k ij represent the intensity measure of the ith gene in the jth microarray from the kth group, where k=1,2, i=1,...,i, and j=1,...,n k. The normalized counterparts are written as x k ij. In addition, let µ k i and σ k i be the mean and standard deviation of the ith gene in the kth group. Also the normalized intensity observations are denoted as µ k i and σ k i Hypothesis Testing For microarray data, to test if there is any difference between two groups, we need to construct I null hypotheses (The same I genes in each group). In practice, gene differential expression level and variance are unlikely to be constant. Due to the complexity of the microarray experiment, the variation depends very much on the measurement accuracy. However, there is a trade off between variance and mean difference when our interest is the statistical power. In our simulation study, we thus fix the gene expression variance, and test the equality of gene expression means between two groups: H 0 : µ 1 i = µ 2 i vs H 1 : µ 1 i µ 2 i where i=1,...,i. Genes in the normal group (Group1) are simulated from a fixed distribution where µ 1 i = µ 0, i=1,...,i. Genes in the group of interest (Group2) are generated from three hypotheses: G 2 0: takes up 70% of gene with µ 2 i = µ 1 i = µ 0, i [1,...,I]. G 2+ 1 : takes up 15% of gene with µ 2 i > µ 0, i [1,...,I]. G 2 1 : takes up 15% of gene with µ 2 i < µ 0, i [1,...,I].

39 Simulation Study for Normal Distribution Data In our simulation study, each microarray has N = 10, 000 genes or probes. This number N is not as big as the number of features as most genomic data sets that can goto 50, 000 and millions, but for simulation purpose, we think it is a reasonable number. Without loss of generalization, we simulate 10, 000 observations from Normal(µ 1 i = 0, 1) for genes in Group1. For three parts of Group2, we simulate: G 2 0: ng 0 = 7, 000, x 2 ij Normal(0, 1). G 2+ 1 : ng + 1 = 1, 500, x 2 ij Normal(µ 2+ i, 1). G 2 1 : ng 1 = 1, 500, x 2 ij Normal(µ 2 i, 1). For both groups, the variance is fixed at σ 2 = 1 and the α level is controlled at 5%. To simplify but without loss of generality, we set n 1 = n 2 = n. Table 2.1 lists the power and Type I error for different mean pairs (µ 2+ i each pair lists the results with different number of observations N., µ 2 i ) and also for From Table 2.1, we compare the power of the two-sample t-test across data sets that have been normalized using Fisher-Yates or quantile normalization with the old standard which in this case is the identity transformation. Notice, the identity transformation is not a choice for real data, because the data always need to be normalized. In Table 2.1, we see that all the normalization methods produce reasonable type I error, though have a small loss of type I error for both Quantile and Fisher-Yates normalization, and the power of Fisher-Yates and quantile are almost the same to true normalization. To make Table 2.1 more intuitive, we use Figure 2.1 to capture the relationship between power and group size for each method. The red line chart is the power line for method 1. When sample size is small, power for method1 is always a little bit higher than other methods, while the powers for all three methods will go to

40 25 Table 2.1: H 0 : Normal(µ 1 i = 0, 1) vs H 1 : Normal(µ 2 i, 1) µ 2 i = 0.5, µ 2+ i = 0.5 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= µ 2 i = 1, µ 2+ i = 1 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= µ 2 i = 1.5, µ 2+ i = 1.5 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= µ 2 i = 2, µ 2+ i = 2 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n=

41 26 Figure 2.1: H 0 : Normal(µ 1 i = 0, 1) vs H 1 : Normal(µ 2 i, 1) 1 eventually when we increase our sample size Simulation Study for Gamma Distribution Data To summarize, in the normal case, all the methods are similar and work reasonably well. But what if our data is generated from a Gamma distribution? Still we simulate 10, 000 observations from Gamma(µ 1 i = 3, 1) for genes in Group1. For Group2, similarly: G 2 0: ng 0 = 7, 000, x 2 ij Gamma(µ 1 i = 3, 1). G 2+ 1 : ng + 1 = 1, 500, x 2 ij Gamma(µ 2+ i, 1). G 2 1 : ng 1 = 1, 500, x 2 ij Gamma(µ 2 i, 1).

42 27 Table 2.2: H 0 : Gamma(µ 1 i = 3, 1) vs H 1 : Gamma(µ 2 i, 1) µ 2 i = 2, µ 2+ i = 4 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= µ 2 i = 2, µ 2+ i = 5 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= µ 2 i = 2, µ 2+ i = 6 True Normalization Fisher-Yates Quantile Normalization Group Size Power TypeI Error Power TypeI Error Power TypeI Error n= n= n= n= n= (µ 2+ i Table 2.2 lists the output for power and Type I error with different mean pairs, µ 2 i ) (α level is controlled at 5%) and for each pair displays the results with different number of gene observations n. It appears from table 2.2 that when the data is not normally distributed, Fisher-Yates method is better than Quantile normalization. In this gamma case, the power for Fisher-Yates is also higher than non-normalization in most cases. Also from figure 2.2 we find the red curve (curve for method 1) is generally below the other two curves. So we could say identity transformation is optimal when our data for simulation is i.i.d normally distributed. But if it appears when

43 28 Figure 2.2: H 0 : Gamma(µ 1 i = 3, 1) vs H 1 : Gamma(µ 2 i, 1) the data is not symmetric, Fisher-Yates normalization works better. Also there is a loss of type I error in quantile and Fisher-Yates normalization due to over-fitting Discussion In this section, we compare Fisher-Yates transformation to quantile normalization and the true transformation, which in this case is identity. From our simulation example, we could see all methods are similar and work reasonably well when data is generated from normal distribution. Notice that in practice, the true transformation is always unknown and unlikely to be identity transformation. That s why we use Fisher-Yates and quantile normalization to preprocess the data (the true transformation here represents the optimal method you can do). And we can find when data is non-normal, Fisher-Yates is more successful at

44 29 reducing skewness and thus improve the power. 2.5 Simulation Study for Scoring Scale Data Fisher-Yates normalization handles skewness and outliers better than quantile normalization and as a result it increases the power to detect genes that are differentially expressed between arrays and also it gets better classification results. We generate here three different cluster patterns similar to the questionnaire scoring scale data which we use as the real data example: generally horizontal, oblique with positive slope, oblique with negative slope. We first define six probability vectors which will be used to generate data: high-top and low-top put more weight on top(large) numbers; high-middle and low-middle put more weight on middle numbers; high-bottom and low-bottom put more weight on bottom(small) numbers. The specific probability vectors we used are listed in Table 2.3 (p i is the probability used to generate scoring scale i), also in Figure 2.3 gives the probability trend for each pattern. Table 2.3: Probability Vectors for Data Generation Name p 0 p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 p 9 p 10 High-top Low-top High-middle Low-middle High-bottom Low-bottom observations with 10 columns are generated from the above six probability vectors, where each column corresponds to one out of 10 questions. Observations are generated similarly to the patterns of the three clusters that we obtained from our pain scale data set. After we generate this data we need to

45 30 Figure 2.3: Probability Trend Barchart for Different Patterns. perform the following three steps: 1. First normalize the data separately with the Z-score method, quantile normalization method and Fisher-Yates normalization method; 2. Perform a hierarchical clustering with Ward distance, and set the number of clusters to be three; 3. We compare the cluster predictions that were obtained from the hierarchical clustering to the real clusters, and then calculate the number of true classification. The numbers of true classification for each method are listed in Table 2.4. Table 2.4 indicates quantile normalization and Fisher-Yates normalization are better classifiers than Z-score and Identity transformation for our simulated data.

46 31 Table 2.4: True Classification Numbers Normalization Method True Classification Number Identity Transformation 491 Z-score Transformation 610 Quantile Normalization 706 Fisher-Yates Normalization Normalization Application on Neuropathic Pain Data Data Description Neuropathic Pain Symptoms Inventory (NPSI) questionnaire consisted of 10 descriptors on a 0-10 pain scale, 0 means no pain and 10 means worse imaginable pain. The descriptors measured a range of symptoms: Burning, Pressure, Squeezing, Electric Shocks, Stabbing, Evoked by Brushing, Evoked by Pressure, Evoked by Cold Stimuli, Pins and Needles, Tingling. Four clinical studies were conducted for testing the treatment effect of an undisclosed drug. The results of the three studies turned out to be negative, because they found no difference between treatment and placebo, and one was positive. This was followed by an attempt to find subgroups of the data here the pain scale questionnaire data by standardizing or normalizing the answers using z-sores but that was also unsuccessful. Our objective here is to find out if we improve the data by applying our F-Y normalization and hopefully we will be able to find subgroups with positive treatment response. Clinicians and physicians believe that the treatment effect should work for large subgroups of population that follow the specific pain pattern. Our hope is that by using FY normalization of this questionnaire data we will be able to find subpopulation that are very responsive to the treatment.

47 Normalization Outcomes Figure 2.4 displays boxplots of the raw symptom descriptors grouped by patient and sorted by the mean NPSI. To make the figure workable we skip 19 out of 20 patients. We observe differences in shape and strong skewness on the boundaries. When we applied the z-scores normalization to this data in Figure 2.5 we could see the skewness is not removed. Figure 2.6 shows the boxplots of the quantile normalized symptom descriptors grouped by patient and sorted by the mean NPSI. It seems that quantile normalization could improve the skewness but it has a problem with the median subject because of the small number of predictors the median subject has a small range(score from 0 to 8). This effect happens also on microarray data but is less pronounced as the number of predictors is much larger. Figure 2.7 shows the boxplots of the Fisher-Yates normalization grouped by patient and sorted by the patient median. Fisher-Yates can improve the skewness while doesn t suffer from the median subject problem as quantile normalization does Skewness Reduction As show in Property 3, the skewness of data after Fisher-Yate s transformation is always 0 if the random variable is continuous. But in reality we may have data with many ties like this example, and therefore FY normalization not always produce data with 0 skewness. But we expect that overall the skewness of the data should be reduced more than other methods by FY. In our example data, you can see this is the case The skewness improvement (or normalization result) between Quantile and Fisher-Yates is not obvious from boxplots listed(figure 2.4 to Figure 2.7). Thus we calculate the skewness for each method and summarized in Table 2.5.

48 33 Figure 2.4: Boxplots of the raw symptoms grouped by the patient and sorted by the patient median. 1 out of every 20 patients are plotted. Table 2.5: Skewness Reduction Comparison Sk(Z) > Sk(F Y ) Sk(Qn) > Sk(F Y ) Sk(Z) > Sk(Qn) Ture False where Sk(Z) is the absolute skewness value of z-score transformation, Sk(Qn) is the absolute skewness of Quantile normalization and Sk(F Y ) is the absolute skewness of Fisher-Yates. We can see the Skewness Comparison Ratio between these three normalization methods: R( Skew(Z) > Skew(F Y ) ) 75% R( Skew(Qn) > Skew(F Y ) ) 66% R( Skew(Z) > Skew(Qn) ) 67% So Fisher-Yates transformation is most successful at reducing skewness effect: 75% times the skewness of Fisher-Yates is smaller than Z-score, and 66% times

49 34 Figure 2.5: Boxplots of the z-scores on symptoms grouped by the patient and sorted by the patient median. the skewness of Fisher-Yates is smaller than Quantile normalization. 2.7 Normalization Application on Sialin Data Data Description As described by Amaratunga and Cabrera (2004), Sialin data are gene expressions collected from a group of mice whose Slc17A5 gene was knocked out compared to gene expression of a group of normal mice. Slc17A5 is the gene responsible for Sialin production which is involved in the development of the mice. In the experiment, RNA samples were derived for each group from newborn and 18- day-old mice. There are total 24 observations which corresponds to 2 groups by 2 time points by 6 biological observations. The gene expressions were generated by hybridization of the observations using 24 Affymetrix Mouse430-2 gene chips. Each chip generated the gene expression profile of the sample, which contains 45,101 gene expressions.

50 35 Figure 2.6: Boxplots of the Quantile Normalization of the symptoms grouped by the patient and sorted by the patient median Normalization Results We conducted quantile normalization and Fisher-Yates transformation separately on the Sialin data, want to test the significance of the gene expression based on each method. There are two groups of 6 observations each for the 18 day data. In order to perform t-tests to compare the two groups after the normalization, we need to assume the observations in each group are normally distributed. This is actually not likely to be always true, because quantile normalization approximately preserves the shape of the original distribution which is unlikely to be normal. Thus our hypotheses will be: H 0 : µ 1 i = µ 2 i vs H 1 : µ 1 i µ 2 i where i=1,...,i, and µ 1 is the mean for group 1, and µ 2 is the mean for group 2. We calculate the p-value at significance level α for testing for differential expression of each gene between group 1 and group 2. The most basic statistical

51 36 Figure 2.7: Boxplots of the Fisher-Yates Algorithm on symptoms grouped by the patient and sorted by the patient median. This improve the skewness and gives a more satisfactory shape result. test for comparing two groups is the two-sample t-test: T e = x 1 x 2 (2.10) 1 s p n n 2 where is the pooled estimate of variance. s 2 p = (n 1 1)s (n 2 1)s 2 2 n 1 + n 2 2 (2.11) After normalization, under the assumptions that the populations are Gaussian and the variances are homoscedastic, the null distribution of T e is then a t-distribution with degrees of freedom v = n 1 + n 2 2. The p-value for each gene is calculated as p e = P rob( T e > T eobs ), where T eobs is the observed value of T e. A gene is declared as significant at level α if p e < α. The significance table at significance level α = 0.01 is given in Table 2.6 and significance level α = 0.05 is given in Table 2.7. Tables listed below indicate Fisher-Yates transformation could detect more significantly differentially expressed gene than quantile normalization.

52 37 Table 2.6: Significance Level α = 0.01 Fisher-Yate s Quantile Normalization Transformation Not Significant Significant Not Significant Significant Table 2.7: Significance Level α = 0.05 Fisher-Yates Quantile Normalization Transformation Not Significant Significant Not Significant Significant We also present the boxplots of gene expression level in log-scale for probes of RMA18 under raw data and normalized data. From figure 2.8 we could see most of the log-scaled gene levels are in range 4 to 8 with median array around 6. Z-score normalization makes the data more normally distributed but very skewed. Fisher- Yate s transformation handles the skewness and the normalization result is quite good. In figure 2.8 here quantile normalization result is the same as raw data, because our data set has already been pre-processed with quantile normalization. 2.8 Discussion and Conclusion The main reason to use quantile normalization is that the median chip is similar in shape to the individual chips, and it seems a reasonable idea to normalize to a function of shape that is close to the shape of the real data. But in the case of small number of genes or small number of questions per subject the median chip is not informative because it does not necessarily have similar shape than the individual observations. Fisher- Yates normalization is a very simple algorithm which normalizes each subject with multiple questions. It is not uncommon that data comes measured

53 38 Figure 2.8: Gene Expression Level in Log-scale for probes of RMA18 on different scales. Here we consider the case when multivariate observations of similar quantities are measured in scales that are dependent on the observation. For example questionnaire data where the scale of the answers depends on the individuals perception, or microarray data where each microarray has its own scale or images where the luminosity of the image depends on the light level of the picture. We compare traditional z-scores normalization with quantile normalization that is the standard in genomic and with Fisher-Yates normalization which is our new proposal. We show that Fisher-Yates is more efficient when testing hypothesis following the normalization procedure when compared to the other two. For micro-array normalization, quantile normalization is standard but there

54 39 maybe situations where Fisher-Yates is a better alternative. In our questionnaire data example it appears that the Fisher-Yates algorithm does a better job at normalizing data. For future work, we could concentrate on applying Fisher-Yates normalization to imaging data and educational testing data, or we can also explore the application on combined date sets from different sources.

55 40 Chapter 3 Hierarchical Clustering Tree Prediction; An Unsupervised Algorithm for Predicting New Data based on the Established Clusters In the previous chapter we applied Fisher-Yates normalization to the clinical outcomes data which ended up generating a set of clusters, some of which had a positive response to treatment. In this Chapter we validate our method by adding data from a new study that was obtained after performing the previous analysis. For this we develop a new method for incorporating new data to the existing hierarchical clustering partition that was obtained earlier using older training data. The issue is on how to use the existing hierarchical clustering to predict the clusters for the new data. The standard prediction method is to assign the new observation to the closest cluster using inter-cluster distance between that observations and the existing clusters. Here we derive a novel method called hierarchical clustering tree prediction method (HCTP), which will use the existing hierarchical tree to insert the new observation into the appropriate cluster. This new method depends on the distance and inter-cluster distance which was used to generate the hierarchical tree. We will study the most commonly used distance measures used for hierarchical clustering to compare our HCTP to the standard prediction method.

56 Introduction Cluster analysis, also known as data segmentation, is an unsupervised classification method to split the data into clusters or segments, so that within cluster objects are as homogenous as possible and between clusters are as heterogeneous as possible. Clustering methods can be classified as hierarchical (nested) or partition (unnested). However, they all depend on a dissimilarity or a similarity measure, depict the distance of two observations: how far or how close, the observations are from each other. Given two observations, x 1 and x 2, there are many choices of the dissimilarity measure D(x 1, x 2 ), generally D obeys the following rules (Amaratunga and Cabrera (2004)): 1. D 0; 2. D = 0 if and only if x 1 = x 2 ; 3. D increase if x 1 and x 2 are further apart; 4. D(x 1, x 2 ) = D(x 2, x 1 ). Some choices for D also satisfy either 1. the triangle inequality, D(x 1, x 2 ) D(x 1, x 3 ) + D(x 3, x 2 ); OR 2. the ultra-metric inequality, D(x 1, x 2 ) max(d(x 1, x 3 ), D(x 2, x 3 )). The most widely used dissimilarity measure is the Euclidean distance, D E : p D E (x 1, x 2 ) = (x 1j x 2j ) 2 (3.1) In this documnet we apply hierarchical clustering i based on Euclidean distance D E. j=1

57 Motivation We recently analyzed a data set about the treatment effect of Lyrical on neuropathic pain patients data from four randomized clinical trials (Freeman et al. (2014)). After the data on baseline symptoms of Neuropathic pain was normalized by Fisher-Yates we applied hierarchical clustering, then identified three clusters. Once the clusters were established, a new clinical trial data became available for us. We wanted to assign new patients from the recent trial into the established clusters from the previous four trials. Here we present a novel prediction method based on the usual hierarchical clustering methodology which will return a dendrogram with the original hierarchical cluster plus the insertion points of the new data, allowing us to directly observe and interpret our prediction results. The standard clustering prediction method is to use inter-cluster distance between new observations and the final cluster configuration for the training data, and assign to the new observation the clusters with the smallest distance. Also in some cases we use supervised classification methods where the response is the cluster number and the predictors are the cluster variables. Our method consists of inserting the new observation into the hierarchical tree without changing the tree structure. The basic idea is to include the new observation data with the original data and to perform hierarchical tree clustering until the new observation joins a cluster A. Then we assign the new observation to cluster A, which is the cluster in the original configuration where cluster A falls into. As with the clustering itself, different inter-cluster distances may result in different predictions.

58 Two Simple Approaches to Prediction Before we goto HCTP, let s first look at two simple approaches to prediction. Suppose we got 200 random points in R 2 from an unknown distribution, and suppose they are clustered into 2 groups, G = {red, blue}. Then how to predict the color of future points? There are two simple approaches to prediction [Hastie et al. (2009)]: Least squares and Nearest neighbors. Least Squares For linear models, we know given a vector of inputs X T = (X 1,..., X p ), we predict a response variable Y via p Ŷ = ˆβ 0 + X j ˆβj = X T ˆβ (3.2) j=1 The prediction method for least square is: we code Y = 1 if G is red and Y = 0 if G is blue. Then the classification will be Red, if Ĝ= Ŷ > 0.5; Blue, if Ŷ < 0.5. In this case, the decision boundary is linear given by function (3.3) {x x T ˆβ = 0.5} (3.3) Nearest-Neighbor Methods For k Nearest-Neighbor (knn) prediction, we predict response variable Y via Ŷ (x) = 1 y i (3.4) k x i N k (x) The classification function is the same: Red, if Ĝ= Ŷ > 0.5; Blue, if Ŷ < 0.5.

59 44 The results of these two methods show knn has far fewer misclassified training observations than linear model Hastie et al. (2009), and the error of misclassification is an increasing function of k. Means if k = 1, the misclassification error is minimized. 3.2 Hierarchical Cluster Tree Prediction Analysis on A Set of Dissimilarities and Methods Hierarchical clustering is one of the most widely used data analysis tools. The idea is to build a tree configuration of data that successively merges similar groups of points. Usually hierarchical clustering will fall into two types: bottom-up or top-down. Bottom-up clustering (also known as agglomerative hierarchical clustering) algorithms are started with each unit situated in its own cluster, and the algorithm grows the hierarchical tree that puts together the observations into different clusters. At each step, the closest pair of clusters is combined and finally the data will falls into one super cluster. At each stage, when agglomerating two clusters, distance between the new cluster and all the others rest will be recalculated according to the particular clustering method being used (for ward method, we use Lance-Williams dissimilarity). Top-down clustering (also known as divisive hierarchical clustering) algorithms are initiated all units put together as one cluster. The cluster then is split into two clusters at the next step. Process can be continued until each unit is alone in its own cluster. A serious disadvantage of top-down clustering method is, at the early stage, there are a huge number of ways of splitting the initial cluster (e.g., 2 G 1 1 in the first stage). Agglomerative (bottom-up) is more popular and simpler than divisive (topdown), but less accurate. There are some advantages of hierarchical clustering:

60 45 I) Partitions can be visualized using a tree structure (a dendrogram), which will provide a useful summary of the data. II) Does not need the number of clusters as input, it can be decided by visually analyzing the hierarchical tree. III) Can give different partitions depending on the level-of-resolution we are looking at. However, hierarchical clustering can be very slow, need to make lot of merge/split decisions. In following sections, we will look into different dissimilarity measures of hierarchical clustering Model Construction and Algorithm We here note our training data as X tr = {x ij }, i = 1, 2,..., N; j = 1, 2,..., P. Where n is the number of independent variables and p is the number of features. Likewise, let X te be our testing data. The objective of supervised classification is to use the training data for training purposes, that is, to develop a classification rule. The idea is given a new sample x, the classification rule is used to predict, as accurate as possible, the true class of the new sample. The main idea of our HCTP is: 1. Perform hierarchical clustering on our training data; 2. Add a new observation into the training data and re-perform hierarchical clustering method, up to the point when the new observation joins a cluster A. Follow cluster A in the original hierarchical just adding the new observation until we stop a cluster A ; 3. The new observation will be included in cluster A ; 4. Repeat the previous steps for all new observations. The resulting hierarchical tree will include the old tree with the insertions of the new observations.

61 46 The idea is illustrated roughly in Figure 3.1 and Figure 3.2. We perform hierarchical clustering on the training data and cut into four clusters as show in Figure 3.1. Now if we have a new point, which cluster should we assign it to? For the traditional method, we calculate the distance between the new point to the four clusters, assigning it to the closest cluster. For our HCTP, we perform hierarchical clustering again on the combined data set (training data and the new point). For example x 1 and x 2 join together as one cluster first, then x 3 joins with them. And then our new point joins with x 8, we stop and assign this new point to cluster 4 (the same cluster as x 8 ) with respect to the tree configuration of the training data. Figure 3.1: Algorithm Illustration Graph 1. If a new point joins a cluster below the red line stage (before growing to 4 clusters), we take action as illustrated in Figure 3.1. But what if this new point

62 47 doesn t join any existing clusters while we already established four clusters, like the new point 3 in Figure 3.2? This may happen in real occasion, but we still need a strategy. In this situation, we assign this new point to the cluster with minimum distance. That is we use the same algorithm as the traditional method in this special case. Figure 3.2: Algorithm Illustration Graph Single Linkage Prediction The single linkage hierarchical clustering (also known as nearest neighbor clustering), is one of the oldest methods among cluster analysis and was suggested by researchers in Poland in The definition is: the distance between two clusters is the smallest dissimilarity measure between any two objects in different

63 Traditional Method for 'Single' Linkage X1 X New Prediction Method for 'Single' Linkage X1 X2 Figure 3.3: Traditional and New Prediction Methods with Single Linkage Distance.

64 49 clusters. Mathematically, the linkage function D(X, Y ) can be expressed as: D(X, Y ) = min d(x, y) (3.5) x X,y Y where X and Y represent two clusters, while d(x, y) represents distance between two elements x X and y Y. The merge criterion is local, that means we only need to focus on the area where two clusters are closest to each other at each stage, and ignore the overall clusters or other more further parts. There is a well known drawback called chaining phenomenon: two clusters may be forced together due to some single elements being close to each other, though actually some elements in each may be very far away to each other. This disadvantage promoted a bunch of other hierarchic and non-hierarchic methods (Lance and Williams 1957; Sneath 1957). So in this case, our new observation will always cluster with the closest observation. We don t need to run the tree again, but just look for the cluster of the nearest neighbor to the new data. That means there is no difference between our HCTP and traditional method while the distance is Single linkage. Look at Figure 3.3, there are two groups G = {red, blue}, also these two groups correspond to two well separated clusters under single linkage, say red=cluster1 and blue=cluster2. The red and blue points are our training data, and yellow and green dots are the prediction results. Both methods predict yellow dots to cluster1 and blue dots to cluster2. From Figure 3.3, we could see the boundary is the same for two different prediction methods Complete Linkage Prediction In complete linkage hierarchical clustering (also known as farthest neighbor clustering), the distance between two clusters is the largest dissimilarity measure

65 Traditional Method for 'Complete' Linkage X1 X New Prediction Method for 'Complete' Linkage X1 X2 Figure 3.4: Traditional and New Prediction Method with Complete Linkage Distance.

66 51 between any two objects in different clusters. Mathematically, we define the complete linkage function D(X, Y ) as: D(X, Y ) = max d(x, y) (3.6) x X,y Y similar to the single linkage function, X and Y are two clusters, and d(x, y) is the distance between two elements x X and y Y. This complete linkage merge is non-local. The whole structure of the clustering has influence on our final decisions. Also complete linkage clustering avoids chaining phenomenon caused by single linkage. From Figure 3.4 we could see for Complete linkage, our HCTP is different from traditional method, and seems better at classification. We will use the same training set as Single linkage, G = {red, blue}, and also red=cluster1, blue=cluster2. From the first graph in Figure 3.4, we could see there is one blue point falls into yellow dots region, which is the prediction region of cluster1. While in the second graph, the boundary classifies two clusters more reasonable Average Linkage Prediction In average linkage hierarchical clustering, the distance between two clusters A and B is the arithmetic mean of the dissimilarity measures between all pairs of members in different clusters. 1 d(x, y) (3.7) n 1. n 2 x Ay B here n 1 is the number of members in A and n 2 is number of members in B. From Figure 3.5 and Figure 3.5 we could see for Average linkage, our HCTP is also different from traditional method. The training set is unchanged, G = {red, blue}, and red=cluster1, blue=cluster2. The boundary for traditional method is linear, while nonlinear for our method.

67 Traditional Method for 'Average' Linkage X1 X New Prediction Method for 'Average' Linkage X1 X2 Figure 3.5: Traditional and New Prediction Method with Average Linkage Distance.

68 Ward s Clustering In ward s clustering, the distance between two clusters is calculated as the sum of squares between clusters divided by the total sum of squares, or equivalently, the change in R 2 when a cluster is split into two clusters. Here R 2 is the coefficient of determination, which is the percent of the variation that can be explained by the clustering. Our prediction method for ward clustering, may not correspond to the same cluster as the prediction with the traditional method. The traditional prediction method calculate the distance between the new point and all the cluster centers and assign this point to the cluster with the closest distance. While our method will calculate the inter-cluster distance using bottom-up clustering algorithms until this new point joints to a cluster, say A. Finally We will assign the new point to the original cluster of the first point in A, namely A. The difference is showed in Figure 3.6. We could see the boundary is linear for traditional clustering while nonlinear for our HCTP. In this case, the results for Ward are the same as Average linkage Comments on different Hierarchical Clustering Methods All the hierarchical clustering methods we mentioned above, though have lot of similarities, they do persist different properties and will generally cluster the data in quite different ways and may even impose a structure of their own. The single linkage is setting up to maximize the connectedness of a cluster and highly prefer to find chain-like clusters. A sequence of close observations in different groups may cause early merge by single linkage. The complete linkage has the opposite problem: it sets up to minimize the maximum within-cluster

69 Traditional Method for 'Ward' Linkage X1 X New Prediction Method for 'Ward' Linkage X1 X2 Figure 3.6: Traditional and New Prediction Method with Ward Linkage Distance.

70 55 distance and tends to find compact clusters but may overemphasize small differences between clusters. Complete linkage might not merge close groups if outlier members are far apart. 3.3 Simulation Set Up and Results We simulated data from two normal distributions but constrained to some certain regions of the space. The distributions we used for simulation study are: G = {G 1, G 2 }, where G 1 = {(x i1, y i1 )}, i = 1,..., n 1, and G 2 = {(x i 2, y i 2 )}, i = 1,..., n 2. Here we use (x i1, y i1 ) to stand for the coordinate for i th observation in group 1, and (x i2, y i2 ) to be the coordinate for i th observation in group 2. Observations in G 1 satisfy conditions: Figure 3.7: Simulation Data. Red and Green samples are corresponding to two groups. x i1 0 (3.8) x 2 i1 + yi1 2 < 2.25

71 56 Or x i1 < 0 y i1 < 1.5 (3.9) for each i {1,..., n 1 }, here we set n 1 = In contrast, conditions for G 2 are: x i2 0 y i2 1 (3.10) x 2 i2 + yi2 2 > 2.25 for each i {1,..., n 2 }, we set n 2 = Data we generated above is plotted in Figure 3.7. We could see the raw data is clearly divided into two groups: colored with Red and Green. Figure 3.8: groups. Training Data.Red and Green samples are corresponding to two There are 1000 observations for group 1 (Red) and 2000 observations for the group 2 (Green) in Figure 3.7. We took the first 50 observations in each group to be our training data as in Figure 3.8, and use the rest to be the validation set.

72 57 Hierarchical clustering with Ward method clustered our training observations into two clusters (colored as yellow and blue in Figure 3.9). We could see from Figure 3.9, yellow observations (cluster1) and red observations (group1) are perfectly matched, also blue observations (cluster2) and green observations (group2) are perfectly matched. So our training set is well clustered into two clusters under Ward method. Figure 3.9: Clusters of Training Data.Yellow and Blue samples are corresponding to two clusters. Both traditional method and our HCTP method were applied to the simulated data to assess their performance. We run 1000 reiteration and pick one to be an illustration example as show in Figure We could see the classification boundaries are different it is linear for the traditional method while nonlinear for our HCTP. And it follows the data density. Then which one is better? We will compute the misclassification ratio to measure the performance for both methods. The prediction scheme for each method assigned the simulated test observations into two clusters: C = {C 1, C 2 }. Let G 1 = {gi1 : (x i1, yi1)}, and G 2 = {gi2 : (x, i 2 y )} be testing observations in group 1 and group 2. We use r i 2 1 to

73 Traditional Method with 'Ward' Linkage X1 X New Prediction Method with 'Ward' Linkage X1 X2 Figure 3.10: Prediction Result for both Methods.

74 59 present our HCTP misclassification ratio, and r 2 be the traditional prediction misclassification ratio. The formula for misclassification ratio r k (k=1,2) is: r k = #P k(g 1 ) C 2 + #P k (G 2 ) C 1 n 1 + n 2 (3.11) where P 1 is our HCTP result, P 2 is the traditional prediction result, n 1 is the number of test observations in group 1 and n 2 is the number of testing observations in group 2. With 1000 reiteration, the average values for r 1 and r 2 are: r 1 = , r 2 = So the misclassification ratio r 1 < r 2 indicating misclassification error of our HCTP is smaller than traditional method. We will go further in next section by application of each method on real data example. 3.4 Real Data Example Data Description The NPSI questionnaire consists of 10 different pain symptom descriptors. The data came from 4 randomized, double-blind, placebo-controlled clinical studies of pregabalin ( mg/day) in patients with neuropathic pain (NeP) syndromes: central post-stoke pain, post-traumatic peripheral pain, painful HIV Neuropathy, and painful diabetic peripheral neuropathy. Patients enrolled were males or nonpregnant, non-lactating females aged 18 with a diagnosis of NeP syndromes: CPSP (219), PTNP (254), painful HIV neuropathy (302), and painful DPN (450). Patients with specific NeP syndrome were enrolled in each study, and were asked to record their daily pain score on Numeric Rating Scale (NRS) with 11- points, where 0= no pain and 10= worst possible pain. The average of the NRS scores over the 7 days prior to randomization was used as mean pain score at baseline. We will use NPSI as our training data set, which has 1161 observations 11 variables. The columns of NPSI are 10 different pain symptom descriptors and

75 60 1 mean vector: superficial and deep spontaneous ongoing pain (Questions 1 and 2); brief pain attacks or paroxysmal pain (Questions 5 and 6); evoked pain (pain provoked or increased by bushing, pressure, contact with cold on the painful area; Questions 8 through 10); abnormal sensations in the painful area (dysesthesia/paresthesia; Questions 11 and 12); and duration of spontaneous ongoing pain assessment (Question 3). Our testing data is a new clinical trial NPSI data set, with 210 observations and same 11 variables Clustering Prediction Result The objective of this exercise is to try our new clustering prediction method on the fifth study which was finalized after we had performed the analysis of the four previous studies. Once the new data has been normalized by Fisher Yate s, we predict the cluster number of each for the new observation. Table 3.1 and Table 3.2 summarize the new prediction results compared to the traditional method. Figure 3.11 indicates the patients overall NPSI scores means are: cluster1=5.54, cluster2=2.41, and cluster3=4.29. Perform our HCTP and traditional prediction method on NPSI data, we have Table 3.1 to capture the clustering prediction results. We could see our HCTP predict more observations (102 observations) to cluster1, while traditional method predict more to cluster 2 and 3. Also the sum of off diagonal numbers 63 means our HCTP and traditional method assign 63 observations in testing data to different clusters. Figure 3.12 and Figure 3.13 list some testing observations which will get different prediction result under different clustering prediction method. For example, our HCTP predicts sample 1 to cluster 1, while the traditional clustering method predicts it to cluster 2 as shown in Figure We summarize the clustering ratio results for these two methods in Table 3.2, indicating that the predicted cluster ratio under our HCTP is more closer to the ratio of training data.

76 61 Figure 3.11: NPSI Cluster means by disease and individual pain dimension. CPSP= central post-stroke pain; DPN= painful diabetic peripheral neuropathy; HIV= painful HIV neuropathy; NPSI= Neuropathic Pain Symptom Inventory; PTNP=post-traumatic peripheral neuropathic pain. 3.5 Conclusions In this chapter we propose a cluster prediction method for hierarchical clustering. The idea is to I) Make predictions that follow the structure of the tree; II) Produce a tree which is the original tree but where we have incorporated all the new observations, and therefor this new tree is more complete than just the predictions because it also includes the level at which the prediction happens. We also show the results of using the new prediction method with a new clinical trial that was completed after the initial 4 trials were analyzed. We found that the cluster that had significant treatment in the initial 4 studies also showed

77 62 HCTP 1,Tradition Sample 1 HCTP 1,Tradition Sample 5 HCTP 3,Tradition Sample 9 Figure 3.12: Testing observations 1. Show some testing observations which get different prediction result under HCTP and Traditional method. Main title for each sample indicates the prediction cluster result for each method.

78 63 HCTP 2,Tradition Sample 28 HCTP 1,Tradition Sample 29 HCTP 1,Tradition Sample 9 Figure 3.13: Testing observations 2. Show some testing observations which get different prediction result under HCTP and Traditional method. Main title for each sample indicates the prediction cluster result for each method.

79 64 Table 3.1: Prediction Table for Both Methods Traditional HCTP Table 3.2: Cluster Ratio Table Training Set HCTP Tradition significant treatment effect in the new trial.

80 65 Chapter 4 Dose and Cohort Size Adaptive Design The primary objective of phase I clinical trial is aiming at locating the maximumtolerated dose (MTD). The Food and Drug Administration published the guidance for industry in 2010 indicating that an adaptive design clinical study is defined as a study that includes a prospective planned opportunity for modification of one or more specified aspects of the study design and hypotheses based on analysis of data from subjects in the study. Traditional adaptive methods will adapt dose up and down eventually depend on the toxicity from observed data. In this chapter, we are going to introduce a novel dose assignment method called dose and cohort size adaptive design, which will adapt dose and cohort size at the same time, thus able to detect the true MTD with less cohorts while still keep the accuracy. 4.1 Introduction and Background Early-phase clinical trials are first-in-human studies for new treatment. The primary objective of phase I oncology trial is to define the recommended phase II dose of a new drug, aiming at locating the MTD: the dose with a dose-limiting toxicity (DLT)is closest to a predefined target toxicity rate η(0 < η < 1). The main outcome for most existing dose-finding designs is toxicity, and escalation action is guided by ethical considerations. It is very important to estimate the MTD as accurate as possible, since it will be further investigated for efficacy in Phase II study. The study will begins at low doses and escalate to higher doses

81 66 eventually due to the severity of most DLTs. However, we also want the escalation of doses to be as quick as possible since the lower doses are expected to be ineffective in most cases. A rich literature has been published for dose-finding designs of Phase I trials. The conventional 3+3 design, first introduced in the 1940s, is still the most widely utilized dose-escalation and de-escalation scheme. However, there are some limitations when applying 3+3. Statistical simulations demonstrated that 3+3 design is used to identify the MTD in as few as 30% of trials. Another very popular model-based method is Continual Reassessment Method (CRM) which estimate the MTD based on one-parameter model and eventually updated the estimator every time one cohort completes either by Bayesian methods given by O Quigley et al. (1990), or maximum likelihood methods given by O Quigley and Shen (1996). We developed a new approach called Dose and Size(D&S) adaptive design. Here we are going to introduce algorithms of our approach with comparison to previous designs through a simulation study. Comparisons are focused on accuracy, safety and benefits of the procedures. Results show general advantages of our new method, and also show saving on time and cost, which is the most exciting advantage of our approach Case Study AAB003 STUDY AAB003 from Pfizer is a backup compound of Bapineuzumab. Preclinical evidence suggested that AAB003 may have a reduced risk of VE (vasogenic edema) compared to Bapineuzumab. An AAB003 dose higher than Bapineuzumab dose may result in good efficacy, while maintain the same or lower VE rate( 5%)So the First-in-Human (FIH) study of AAB003 in subjects with mild to moderate Alzheimer s disease was conducted, to assess the safety and tolerability of AAB003 at different dose levels (0.5,1,2,4,8 mg/kg).this FIH study

82 67 was also compared to an alternative, more traditional design of a single ascending dose study (SAD) followed by a multiple ascending dose (MAD) study. The objective of the trial is to establish as efficiently as possible whether AAB003 has a better safety profile than bapineuzumab(aab001). The current safety profile of bapineuzumab includes vasogenic edema (VE) of the brain, a radiographic finding that has been reported among some subjects treated with Bapineuzumb. Final dose selection will depend on the full package of preclinical safety data. DIABETES COMPOUND A multiple ascending dose (MAD) study was conducted for Diabetes Compound project. During the study, a particular adverse event emerged and raised concern. The team planned to narrow down the dose range with a parallel study with AE target rate 3%. For both projects, the AE rates are very small (3% and 5%). That means the sample size for each dose will be very big: need at least n = 33(1/3%) and n = 20(1/5%) to observe one AE in average. So the study will be very big, if go with parallel study, take lot of time and cost. We are highly motivated to develop a new study to save time and money. Lot of adaptive models are given in different literatures, but no one adapts the cohort size. Our D&S design is focusing on cohort size adaptive method and we will show the advantages of this new model. 4.2 Existing Methods for Dose-finding Study Before going to introduce some well applied existing methods, let s first give the definition of MTD. Rosenberger and Haines (2002) mentioned there are two different definitions of MTD. It could be defined as the dose just below the lowest dose level with unacceptable toxicity rate Γ U, or can be defined as the dose with the toxicity probability equal to some acceptable toxicity level Γ, where Γ < Γ U.

83 68 Figure 4.1: 3+3 Design. One of the most popular dose-escalation and deescalation scheme Design The conventional 3+3 design, was originally mentioned in 1940s and re-promoted by Storer (1989). It is among the earliest dose-escalation and de-escalation schemes. We showed the method of 3+3 as in Figure 4.1, which can be more easily to present. The basic idea of 3+3 is, treat 3 patients per dose level. If one Dose Limiting Toxicity (DLT) occur, dose is escalated for the next cohort of 3 patients. If 1 DLT, put 3 more patients at this levle with dose escalation only if no additional DLTs. If 2 DLTs, we define prior dose level as MTD. 3+3 is very simple and very easy to implement, however, there are some limitations. First, statistical simulations show that 3+3 is mostly used when MTD is as few as 30% of the trails. Furthermore, this may put very large population being treated at subtherapeutic doses. 3+3 design has been widely criticized due to its escalation decision is made only based on the most recent recruited patients. Promoted by this disadvantage, O Quigley et al. (1990) developed a new model-based adaptive design CRM(continual reassessment method) which made decisions based on posterior distributions and likelihoods formed from all accumulated data.

Bayesian Optimal Interval Design for Phase I Clinical Trials

Bayesian Optimal Interval Design for Phase I Clinical Trials Bayesian Optimal Interval Design for Phase I Clinical Trials Department of Biostatistics The University of Texas, MD Anderson Cancer Center Joint work with Suyu Liu Phase I oncology trials The goal of

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Bradley Broom Department of Bioinformatics and Computational Biology UT M. D. Anderson Cancer Center kabagg@mdanderson.org bmbroom@mdanderson.org

More information

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI

ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI ESTIMATING STATISTICAL CHARACTERISTICS UNDER INTERVAL UNCERTAINTY AND CONSTRAINTS: MEAN, VARIANCE, COVARIANCE, AND CORRELATION ALI JALAL-KAMALI Department of Computer Science APPROVED: Vladik Kreinovich,

More information

Microarray Preprocessing

Microarray Preprocessing Microarray Preprocessing Normaliza$on Normaliza$on is needed to ensure that differences in intensi$es are indeed due to differen$al expression, and not some prin$ng, hybridiza$on, or scanning ar$fact.

More information

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry

Normalization. Example of Replicate Data. Biostatistics Rafael A. Irizarry This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

Two-sample Categorical data: Testing

Two-sample Categorical data: Testing Two-sample Categorical data: Testing Patrick Breheny April 1 Patrick Breheny Introduction to Biostatistics (171:161) 1/28 Separate vs. paired samples Despite the fact that paired samples usually offer

More information

Comparing Group Means When Nonresponse Rates Differ

Comparing Group Means When Nonresponse Rates Differ UNF Digital Commons UNF Theses and Dissertations Student Scholarship 2015 Comparing Group Means When Nonresponse Rates Differ Gabriela M. Stegmann University of North Florida Suggested Citation Stegmann,

More information

INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS. Tao Jiang. A Thesis

INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS. Tao Jiang. A Thesis INFORMATION APPROACH FOR CHANGE POINT DETECTION OF WEIBULL MODELS WITH APPLICATIONS Tao Jiang A Thesis Submitted to the Graduate College of Bowling Green State University in partial fulfillment of the

More information

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Introduction Edps/Psych/Stat/ 584 Applied Multivariate Statistics Carolyn J Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN c Board of Trustees,

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probability and Probability Distributions. Dr. Mohammed Alahmed Probability and Probability Distributions 1 Probability and Probability Distributions Usually we want to do more with data than just describing them! We might want to test certain specific inferences about

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

Data Preprocessing. Data Preprocessing

Data Preprocessing. Data Preprocessing Data Preprocessing 1 Data Preprocessing Normalization: the process of removing sampleto-sample variations in the measurements not due to differential gene expression. Bringing measurements from the different

More information

Vocabulary: Samples and Populations

Vocabulary: Samples and Populations Vocabulary: Samples and Populations Concept Different types of data Categorical data results when the question asked in a survey or sample can be answered with a nonnumerical answer. For example if we

More information

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements:

Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis. Acknowledgements: Tutorial 6: Tutorial on Translating between GLIMMPSE Power Analysis and Data Analysis Anna E. Barón, Keith E. Muller, Sarah M. Kreidler, and Deborah H. Glueck Acknowledgements: The project was supported

More information

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm

Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Gene Expression Data Classification with Revised Kernel Partial Least Squares Algorithm Zhenqiu Liu, Dechang Chen 2 Department of Computer Science Wayne State University, Market Street, Frederick, MD 273,

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

An Introduction to Path Analysis

An Introduction to Path Analysis An Introduction to Path Analysis PRE 905: Multivariate Analysis Lecture 10: April 15, 2014 PRE 905: Lecture 10 Path Analysis Today s Lecture Path analysis starting with multivariate regression then arriving

More information

Stat 20 Midterm 1 Review

Stat 20 Midterm 1 Review Stat 20 Midterm Review February 7, 2007 This handout is intended to be a comprehensive study guide for the first Stat 20 midterm exam. I have tried to cover all the course material in a way that targets

More information

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information: Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information: aanum@ug.edu.gh College of Education School of Continuing and Distance Education 2014/2015 2016/2017 Session Overview In this Session

More information

Lecture 5 Processing microarray data

Lecture 5 Processing microarray data Lecture 5 Processin microarray data (1)Transform the data into a scale suitable for analysis ()Remove the effects of systematic and obfuscatin sources of variation (3)Identify discrepant observations Preprocessin

More information

Written Exam (2 hours)

Written Exam (2 hours) M. Müller Applied Analysis of Variance and Experimental Design Summer 2015 Written Exam (2 hours) General remarks: Open book exam. Switch off your mobile phone! Do not stay too long on a part where you

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should

11 CHI-SQUARED Introduction. Objectives. How random are your numbers? After studying this chapter you should 11 CHI-SQUARED Chapter 11 Chi-squared Objectives After studying this chapter you should be able to use the χ 2 distribution to test if a set of observations fits an appropriate model; know how to calculate

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Pubh 8482: Sequential Analysis

Pubh 8482: Sequential Analysis Pubh 8482: Sequential Analysis Joseph S. Koopmeiners Division of Biostatistics University of Minnesota Week 10 Class Summary Last time... We began our discussion of adaptive clinical trials Specifically,

More information

Correlation and regression

Correlation and regression NST 1B Experimental Psychology Statistics practical 1 Correlation and regression Rudolf Cardinal & Mike Aitken 11 / 12 November 2003 Department of Experimental Psychology University of Cambridge Handouts:

More information

Descriptive Statistics-I. Dr Mahmoud Alhussami

Descriptive Statistics-I. Dr Mahmoud Alhussami Descriptive Statistics-I Dr Mahmoud Alhussami Biostatistics What is the biostatistics? A branch of applied math. that deals with collecting, organizing and interpreting data using well-defined procedures.

More information

Review of Multiple Regression

Review of Multiple Regression Ronald H. Heck 1 Let s begin with a little review of multiple regression this week. Linear models [e.g., correlation, t-tests, analysis of variance (ANOVA), multiple regression, path analysis, multivariate

More information

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model

Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model Course Introduction and Overview Descriptive Statistics Conceptualizations of Variance Review of the General Linear Model PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 1: August 22, 2012

More information

Improving a safety of the Continual Reassessment Method via a modified allocation rule

Improving a safety of the Continual Reassessment Method via a modified allocation rule Improving a safety of the Continual Reassessment Method via a modified allocation rule Pavel Mozgunov, Thomas Jaki Medical and Pharmaceutical Statistics Research Unit, Department of Mathematics and Statistics,

More information

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang

Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features. Yangxin Huang Bayesian Inference on Joint Mixture Models for Survival-Longitudinal Data with Multiple Features Yangxin Huang Department of Epidemiology and Biostatistics, COPH, USF, Tampa, FL yhuang@health.usf.edu January

More information

Advanced Statistical Methods: Beyond Linear Regression

Advanced Statistical Methods: Beyond Linear Regression Advanced Statistical Methods: Beyond Linear Regression John R. Stevens Utah State University Notes 3. Statistical Methods II Mathematics Educators Worshop 28 March 2009 1 http://www.stat.usu.edu/~jrstevens/pcmi

More information

Probe-Level Analysis of Affymetrix GeneChip Microarray Data

Probe-Level Analysis of Affymetrix GeneChip Microarray Data Probe-Level Analysis of Affymetrix GeneChip Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley University of Minnesota Mar 30, 2004 Outline

More information

Effect of investigator bias on the significance level of the Wilcoxon rank-sum test

Effect of investigator bias on the significance level of the Wilcoxon rank-sum test Biostatistics 000, 1, 1,pp. 107 111 Printed in Great Britain Effect of investigator bias on the significance level of the Wilcoxon rank-sum test PAUL DELUCCA Biometrician, Merck & Co., Inc., 1 Walnut Grove

More information

Harvard University. Rigorous Research in Engineering Education

Harvard University. Rigorous Research in Engineering Education Statistical Inference Kari Lock Harvard University Department of Statistics Rigorous Research in Engineering Education 12/3/09 Statistical Inference You have a sample and want to use the data collected

More information

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL

H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL H-LIKELIHOOD ESTIMATION METHOOD FOR VARYING CLUSTERED BINARY MIXED EFFECTS MODEL Intesar N. El-Saeiti Department of Statistics, Faculty of Science, University of Bengahzi-Libya. entesar.el-saeiti@uob.edu.ly

More information

Proteomics and Variable Selection

Proteomics and Variable Selection Proteomics and Variable Selection p. 1/55 Proteomics and Variable Selection Alex Lewin With thanks to Paul Kirk for some graphs Department of Epidemiology and Biostatistics, School of Public Health, Imperial

More information

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 Introduction to Generalized Univariate Models: Models for Binary Outcomes EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7 EPSY 905: Intro to Generalized In This Lecture A short review

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Whether to use MMRM as primary estimand.

Whether to use MMRM as primary estimand. Whether to use MMRM as primary estimand. James Roger London School of Hygiene & Tropical Medicine, London. PSI/EFSPI European Statistical Meeting on Estimands. Stevenage, UK: 28 September 2015. 1 / 38

More information

Multiplex network inference

Multiplex network inference (using hidden Markov models) University of Cambridge Bioinformatics Group Meeting 11 February 2016 Words of warning Disclaimer These slides have been produced by combining & translating two of my previous

More information

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Fundamentals to Biostatistics Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur Statistics collection, analysis, interpretation of data development of new

More information

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH The First Step: SAMPLE SIZE DETERMINATION THE ULTIMATE GOAL The most important, ultimate step of any of clinical research is to do draw inferences;

More information

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE 1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 4 Problems with small populations 9 II. Why Random Sampling is Important 10 A myth,

More information

9 Correlation and Regression

9 Correlation and Regression 9 Correlation and Regression SW, Chapter 12. Suppose we select n = 10 persons from the population of college seniors who plan to take the MCAT exam. Each takes the test, is coached, and then retakes the

More information

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models

Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models Individualized Treatment Effects with Censored Data via Nonparametric Accelerated Failure Time Models Nicholas C. Henderson Thomas A. Louis Gary Rosner Ravi Varadhan Johns Hopkins University July 31, 2018

More information

Liang Li, PhD. MD Anderson

Liang Li, PhD. MD Anderson Liang Li, PhD Biostatistics @ MD Anderson Behavioral Science Workshop, October 13, 2014 The Multiphase Optimization Strategy (MOST) An increasingly popular research strategy to develop behavioral interventions

More information

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling

The Lady Tasting Tea. How to deal with multiple testing. Need to explore many models. More Predictive Modeling The Lady Tasting Tea More Predictive Modeling R. A. Fisher & the Lady B. Muriel Bristol claimed she prefers tea added to milk rather than milk added to tea Fisher was skeptical that she could distinguish

More information

Analytical Graphing. lets start with the best graph ever made

Analytical Graphing. lets start with the best graph ever made Analytical Graphing lets start with the best graph ever made Probably the best statistical graphic ever drawn, this map by Charles Joseph Minard portrays the losses suffered by Napoleon's army in the Russian

More information

HEREDITY AND VARIATION

HEREDITY AND VARIATION HEREDITY AND VARIATION OVERVIEW Students often do not understand the critical role of variation to evolutionary processes. In fact, variation is the only fundamental requirement for evolution to occur.

More information

Part 8: GLMs and Hierarchical LMs and GLMs

Part 8: GLMs and Hierarchical LMs and GLMs Part 8: GLMs and Hierarchical LMs and GLMs 1 Example: Song sparrow reproductive success Arcese et al., (1992) provide data on a sample from a population of 52 female song sparrows studied over the course

More information

An Introduction to Mplus and Path Analysis

An Introduction to Mplus and Path Analysis An Introduction to Mplus and Path Analysis PSYC 943: Fundamentals of Multivariate Modeling Lecture 10: October 30, 2013 PSYC 943: Lecture 10 Today s Lecture Path analysis starting with multivariate regression

More information

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts

A Study of Statistical Power and Type I Errors in Testing a Factor Analytic. Model for Group Differences in Regression Intercepts A Study of Statistical Power and Type I Errors in Testing a Factor Analytic Model for Group Differences in Regression Intercepts by Margarita Olivera Aguilar A Thesis Presented in Partial Fulfillment of

More information

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination

McGill University. Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II. Final Examination McGill University Faculty of Science MATH 204 PRINCIPLES OF STATISTICS II Final Examination Date: 20th April 2009 Time: 9am-2pm Examiner: Dr David A Stephens Associate Examiner: Dr Russell Steele Please

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Exam Applied Statistical Regression. Good Luck!

Exam Applied Statistical Regression. Good Luck! Dr. M. Dettling Summer 2011 Exam Applied Statistical Regression Approved: Tables: Note: Any written material, calculator (without communication facility). Attached. All tests have to be done at the 5%-level.

More information

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective Second Edition Scott E. Maxwell Uniuersity of Notre Dame Harold D. Delaney Uniuersity of New Mexico J,t{,.?; LAWRENCE ERLBAUM ASSOCIATES,

More information

Adaptive designs beyond p-value combination methods. Ekkehard Glimm, Novartis Pharma EAST user group meeting Basel, 31 May 2013

Adaptive designs beyond p-value combination methods. Ekkehard Glimm, Novartis Pharma EAST user group meeting Basel, 31 May 2013 Adaptive designs beyond p-value combination methods Ekkehard Glimm, Novartis Pharma EAST user group meeting Basel, 31 May 2013 Outline Introduction Combination-p-value method and conditional error function

More information

Calorimeter Design Project HASPI Medical Chemistry Lab 6a

Calorimeter Design Project HASPI Medical Chemistry Lab 6a Name(s): Period: Date: Calorimeter Design Project HASPI Medical Chemistry Lab 6a Background Engineering in Medicine Although you are in a science class, you might not realize that everything you are learning

More information

DEALING WITH MULTIVARIATE OUTCOMES IN STUDIES FOR CAUSAL EFFECTS

DEALING WITH MULTIVARIATE OUTCOMES IN STUDIES FOR CAUSAL EFFECTS DEALING WITH MULTIVARIATE OUTCOMES IN STUDIES FOR CAUSAL EFFECTS Donald B. Rubin Harvard University 1 Oxford Street, 7th Floor Cambridge, MA 02138 USA Tel: 617-495-5496; Fax: 617-496-8057 email: rubin@stat.harvard.edu

More information

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution A Thesis Submitted to the Faculty of Drexel University by Scott Haney in partial fulfillment of the requirements

More information

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis

Statistics 262: Intermediate Biostatistics Regression & Survival Analysis Statistics 262: Intermediate Biostatistics Regression & Survival Analysis Jonathan Taylor & Kristin Cobb Statistics 262: Intermediate Biostatistics p.1/?? Introduction This course is an applied course,

More information

CSE 546 Final Exam, Autumn 2013

CSE 546 Final Exam, Autumn 2013 CSE 546 Final Exam, Autumn 0. Personal info: Name: Student ID: E-mail address:. There should be 5 numbered pages in this exam (including this cover sheet).. You can use any material you brought: any book,

More information

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2)

Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) Hypothesis Testing, Power, Sample Size and Confidence Intervals (Part 2) B.H. Robbins Scholars Series June 23, 2010 1 / 29 Outline Z-test χ 2 -test Confidence Interval Sample size and power Relative effect

More information

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science

EXAMINATION: QUANTITATIVE EMPIRICAL METHODS. Yale University. Department of Political Science EXAMINATION: QUANTITATIVE EMPIRICAL METHODS Yale University Department of Political Science January 2014 You have seven hours (and fifteen minutes) to complete the exam. You can use the points assigned

More information

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph

Volume vs. Diameter. Teacher Lab Discussion. Overview. Picture, Data Table, and Graph 5 6 7 Middle olume Length/olume vs. Diameter, Investigation page 1 of olume vs. Diameter Teacher Lab Discussion Overview Figure 1 In this experiment we investigate the relationship between the diameter

More information

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS NOTES FROM PRE- LECTURE RECORDING ON PCA PCA and EFA have similar goals. They are substantially different in important ways. The goal

More information

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation

Full versus incomplete cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation cross-validation: measuring the impact of imperfect separation between training and test sets in prediction error estimation IIM Joint work with Christoph Bernau, Caroline Truntzer, Thomas Stadler and

More information

Relationships Regression

Relationships Regression Relationships Regression BPS chapter 5 2006 W.H. Freeman and Company Objectives (BPS chapter 5) Regression Regression lines The least-squares regression line Using technology Facts about least-squares

More information

Uni- and Bivariate Power

Uni- and Bivariate Power Uni- and Bivariate Power Copyright 2002, 2014, J. Toby Mordkoff Note that the relationship between risk and power is unidirectional. Power depends on risk, but risk is completely independent of power.

More information

Academic Affairs Assessment of Student Learning Report for Academic Year

Academic Affairs Assessment of Student Learning Report for Academic Year Academic Affairs Assessment of Student Learning Report for Academic Year 2017-2018. Department/Program Chemistry Assessment Coordinator s Name: Micheal Fultz Assessment Coordinator s Email Address: mfultz@wvstateu.edu

More information

Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects

Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects Bayesian Nonparametric Accelerated Failure Time Models for Analyzing Heterogeneous Treatment Effects Nicholas C. Henderson Thomas A. Louis Gary Rosner Ravi Varadhan Johns Hopkins University September 28,

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations

Basics of Experimental Design. Review of Statistics. Basic Study. Experimental Design. When an Experiment is Not Possible. Studying Relations Basics of Experimental Design Review of Statistics And Experimental Design Scientists study relation between variables In the context of experiments these variables are called independent and dependent

More information

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro Methods and Criteria for Model Selection CS57300 Data Mining Fall 2016 Instructor: Bruno Ribeiro Goal } Introduce classifier evaluation criteria } Introduce Bias x Variance duality } Model Assessment }

More information

Sensitivity study of dose-finding methods

Sensitivity study of dose-finding methods to of dose-finding methods Sarah Zohar 1 John O Quigley 2 1. Inserm, UMRS 717,Biostatistic Department, Hôpital Saint-Louis, Paris, France 2. Inserm, Université Paris VI, Paris, France. NY 2009 1 / 21 to

More information

Causal Inference with Big Data Sets

Causal Inference with Big Data Sets Causal Inference with Big Data Sets Marcelo Coca Perraillon University of Colorado AMC November 2016 1 / 1 Outlone Outline Big data Causal inference in economics and statistics Regression discontinuity

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Machine Learning (CS 567) Lecture 2

Machine Learning (CS 567) Lecture 2 Machine Learning (CS 567) Lecture 2 Time: T-Th 5:00pm - 6:20pm Location: GFS118 Instructor: Sofus A. Macskassy (macskass@usc.edu) Office: SAL 216 Office hours: by appointment Teaching assistant: Cheol

More information

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles

A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles A Bayesian Nonparametric Model for Predicting Disease Status Using Longitudinal Profiles Jeremy Gaskins Department of Bioinformatics & Biostatistics University of Louisville Joint work with Claudio Fuentes

More information

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann

Machine Learning! in just a few minutes. Jan Peters Gerhard Neumann Machine Learning! in just a few minutes Jan Peters Gerhard Neumann 1 Purpose of this Lecture Foundations of machine learning tools for robotics We focus on regression methods and general principles Often

More information

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics

Originality in the Arts and Sciences: Lecture 2: Probability and Statistics Originality in the Arts and Sciences: Lecture 2: Probability and Statistics Let s face it. Statistics has a really bad reputation. Why? 1. It is boring. 2. It doesn t make a lot of sense. Actually, the

More information

MATH 1150 Chapter 2 Notation and Terminology

MATH 1150 Chapter 2 Notation and Terminology MATH 1150 Chapter 2 Notation and Terminology Categorical Data The following is a dataset for 30 randomly selected adults in the U.S., showing the values of two categorical variables: whether or not the

More information

Tackling Statistical Uncertainty in Method Validation

Tackling Statistical Uncertainty in Method Validation Tackling Statistical Uncertainty in Method Validation Steven Walfish President, Statistical Outsourcing Services steven@statisticaloutsourcingservices.com 301-325 325-31293129 About the Speaker Mr. Steven

More information

COMPOSITIONAL IDEAS IN THE BAYESIAN ANALYSIS OF CATEGORICAL DATA WITH APPLICATION TO DOSE FINDING CLINICAL TRIALS

COMPOSITIONAL IDEAS IN THE BAYESIAN ANALYSIS OF CATEGORICAL DATA WITH APPLICATION TO DOSE FINDING CLINICAL TRIALS COMPOSITIONAL IDEAS IN THE BAYESIAN ANALYSIS OF CATEGORICAL DATA WITH APPLICATION TO DOSE FINDING CLINICAL TRIALS M. Gasparini and J. Eisele 2 Politecnico di Torino, Torino, Italy; mauro.gasparini@polito.it

More information

( ) P A B : Probability of A given B. Probability that A happens

( ) P A B : Probability of A given B. Probability that A happens A B A or B One or the other or both occurs At least one of A or B occurs Probability Review A B A and B Both A and B occur ( ) P A B : Probability of A given B. Probability that A happens given that B

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

MATH 10 INTRODUCTORY STATISTICS

MATH 10 INTRODUCTORY STATISTICS MATH 10 INTRODUCTORY STATISTICS Tommy Khoo Your friendly neighbourhood graduate student. Week 1 Chapter 1 Introduction What is Statistics? Why do you need to know Statistics? Technical lingo and concepts:

More information

Prediction of double gene knockout measurements

Prediction of double gene knockout measurements Prediction of double gene knockout measurements Sofia Kyriazopoulou-Panagiotopoulou sofiakp@stanford.edu December 12, 2008 Abstract One way to get an insight into the potential interaction between a pair

More information

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression

Other likelihoods. Patrick Breheny. April 25. Multinomial regression Robust regression Cox regression Other likelihoods Patrick Breheny April 25 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/29 Introduction In principle, the idea of penalized regression can be extended to any sort of regression

More information

Residuals in the Analysis of Longitudinal Data

Residuals in the Analysis of Longitudinal Data Residuals in the Analysis of Longitudinal Data Jemila Hamid, PhD (Joint work with WeiLiang Huang) Clinical Epidemiology and Biostatistics & Pathology and Molecular Medicine McMaster University Outline

More information

The Design of a Survival Study

The Design of a Survival Study The Design of a Survival Study The design of survival studies are usually based on the logrank test, and sometimes assumes the exponential distribution. As in standard designs, the power depends on The

More information

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification 1: Linear regression of indicators, linear discriminant analysis Classification 1: Linear regression of indicators, linear discriminant analysis Ryan Tibshirani Data Mining: 36-462/36-662 April 2 2013 Optional reading: ISL 4.1, 4.2, 4.4, ESL 4.1 4.3 1 Classification

More information

Bioconductor Project Working Papers

Bioconductor Project Working Papers Bioconductor Project Working Papers Bioconductor Project Year 2004 Paper 6 Error models for microarray intensities Wolfgang Huber Anja von Heydebreck Martin Vingron Department of Molecular Genome Analysis,

More information

cdna Microarray Analysis

cdna Microarray Analysis cdna Microarray Analysis with BioConductor packages Nolwenn Le Meur Copyright 2007 Outline Data acquisition Pre-processing Quality assessment Pre-processing background correction normalization summarization

More information