Mixture distributions with application to microarray data analysis

Size: px

Start display at page:

Download "Mixture distributions with application to microarray data analysis"

Curtis Whitehead
6 years ago
Views:

University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2009 Mixture distributions with application to microarray data analysis O'Neil Lynch University of South

edu/etd Part of the American Studies Commons Scholar Commons Citation Lynch, O'Neil, "Mixture distributions with application to microarray data analysis" (2009). Graduate Theses and Dissertations.

1 University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2009 Mixture distributions with application to microarray data analysis O'Neil Lynch University of South Florida Follow this and additional works at: Part of the American Studies Commons Scholar Commons Citation Lynch, O'Neil, "Mixture distributions with application to microarray data analysis" (2009). Graduate Theses and Dissertations. This Dissertation is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact

2 Mixture Distributions with Application to Microarray Data Analysis by O Neil Lynch A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics and Statistics College of Arts and Sciences University of South Florida Co-Major Professor: Kandethody Ramachandran, Ph.D. Co-Major Professor: Wonkuk Kim, Ph.D. Chris Tsokos, Ph.D. Tapas Das, Ph.D. Date of Approval: September 4, 2008 Keywords: Likelihood ratio test; Modified likelihood, Penalized likelihood; Asymptotic chi-square distribution; Consistency c Copyright 2009, O Neil Lynch

3 Dedication To my wife Lonnette

4 Acknowledgements It is a great pleasure for me to express my profound respect, gratitude, and admiration for my major supervisor Dr. K. Ramachandran and co-major supervisor Dr. W. Kim. They have given unstintingly of their time, advise, and encouragement ever since they both undertook the task to make me further understand the fields of microarray data analysis and mixture distribution. I also would like to thank the other members of my committee namely Dr. C. Tsokos, and Dr. T. K. Das who all have given of their time and have always encouraged me to do my best. The staff at the Department of Mathematics and Statistics is deserving of my thanks for the help and advice they have given me over the years. I owe thanks to my family and a large number of friends for the encouragement and complete understanding in my quest to excel in the field of Statistics.

5 Table of Contents List of Figures List of Tables Abstract iv vi viii 1 Introduction 1 2 Microarray Data and Some Statistical Analysis DNA Microarray Experiments Genetic Background cdna Microarray Experiment Oligonucleotide Microarray Experiment Some Statistical Challenges With Analyzing Microarray Data Methods of Analyzing Microarray Data Cluster Analysis The T-Test Regression Model Significance Analysis of Microarrays (SAM) Mixture Model Method (MMM) A Mixture Model Approach Using P -Value Conclusion Finite Mixture Distribution Definition and Preliminary Examples of Mixture Distributions i

6 3.1.2 Mean and Variance of Mixtures Comparison of Two Groups: Iris Data Parameter Estimation Expectation Maximization Algorithm Robust Parameter Estimation Penalized Maximum Likelihood Estimator for Normal Mixture Models Estimating the Number of Components g Simulation Approach Modified Likelihood Ratio Test for Homogeneity in Finite Mixture Models Regularity Conditions Box-Cox transformation Conclusion The Penalized Modified Likelihood for Normal Mixture Model Penalized Modified Likelihood Parameter Estimation of Penalized Modified Likelihood Inverse Gamma Penalty Function for σ Inverse Chi-Square Penalty Function for σ Consistency and Asymptotic Normality Preliminary Results Main results Conclusion Penalized Modified Likelihood Approach to Microarray Data Analysis Penalized Modified Mixture Model (PMMM) PMMM Simulated Null Distribution Simulating Microarray Data Application of PMMM to Simulated Microarray Data Application of PMMM to the Rat data Conclusion ii

7 6 Modified P -Value Approach to Microarray Data Analysis The modified P -Value Approach Simulated Null Distribution of the Modified P -Value Approach Application of Modified P -Value to Simulated Microarray Data Application of Modified P -Value to simulated Prostate Data Application of Modified P -Value to the Prostate Data Conclusion Summary and Concluding Remarks 102 References 105 About the Author End Page iii

8 List of Figures 3.1 Histogram of Sepal length of the two species of flowers Q-Q plots of Sepal lengths for versicolor flowers Q-Q plots of Sepal lengths for verginica flowers Histogram and known mixture distribution Histogram and estimated mixture distribution Histogram with known and estimated mixture distribution Graphical representations of two component normal with equal variance Histogram, heteroscedastic and homoscedastic fit for simulated data from the mixture 0.5φ(y 4, 1) + 0.5φ(y 8, 1) Histogram, heteroscedastic and homoscedastic fit for simulated data from the mixture 0.5φ(y 4, 1) + 0.5φ(y 8, 2) Histograms of z, Z and fitted models for the simulated data The likelihood ratio statistic as a function of Z value for the simulated data The values of FDR from our method and SAM for the simulated data Histograms of z, Z and fitted models for the rat data The likelihood ratio curve for the rat data The values of FDR from our method and SAM for the rat data Various Beta Distributions Histogram of p-value and beta mixture distributions for 2000 simulated genes iv

9 6.3 Histogram of p-value and beta mixture distributions for simulated prostate data Histogram of p-value and beta mixture distributions for the prostate data v

10 List of Tables 3.1 Summary statistics of data Mean, variance and percentiles for the penalized modified likelihood, based on 1000 replicates for each sample for testing the hypothesis a 1-component against 2-components Mean, variance and percentiles for the penalized modified likelihood, based on 1000 replicates for each sample for testing the hypothesis 2-components against 3-components Hypothesis test for the number of components for the fitted normal mixture models of z and Z, for the simulated microarray data Values of TP, FP and FDR from PMMM for the simulated data Values of TP, FP and FDR from SAM for the simulated data Hypothesis test for the number of components for the fitted normal mixture models of z and Z, for the rat data Values of TP, FP and FDR from PMMM for the rat data Values of TP, FP and FDR from SAM for the rat data Mean, variance and percentiles for the likelihood ratio test for the modified p-value, based on 1000 replicates for each sample for testing the hypothesis a uniform against a uniform and one beta distribution vi

11 6.2 Mean, variance and percentiles for the likelihood ratio test for the modified p-value, based on 1000 replicates for each sample for testing the hypothesis a uniform against a uniform and two beta distributions Hypothesis test for the number of components for the fitted uniform-beta mixture models for the prostate data vii

12 Mixture Distributions with Application to Microarray Data Analysis O Neil Lee Lynch ABSTRACT The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. In this dissertation we proposed two methods to determine differentially expressed genes. For the penalized normal mixture model (PMMM) to determine genes that are differentially expressed, we penalized both the variance and the mixing proportion parameters simultaneously. The variance parameter was penalized so that the log-likelihood will be bounded, while the mixing proportion parameter was penalized so that its estimates are not on the boundary of its parametric space. The null distribution of the likelihood ratio test statistic (LRTS) was simulated so that we could perform a hypothesis test for the number of components of the penalized normal mixture model. In addition to simulating the null distribution of the LRTS for the penalized normal mixture model, we showed that the maximum likelihood estimates were asymptotically normal, which is a first step that is necessary to prove the asymptotic null distribution of the LRTS. This result is a significant contribution to field of normal mixture model. The modified p-value approach for detecting differentially expressed genes was also discussed in this dissertation. The modified p-value approach was implemented so that a hypothesis test for the number of components can be conducted by using the modified likelihood ratio test. In the modified p-value approach we penalized the mixing proportion so that the estimates of the mixing proportion are not on the boundary of its viii

13 parametric space. The null distribution of the (LRTS) was simulated so that the number of components of the uniform beta mixture model can be determined. Finally, for both modified methods, the penalized normal mixture model and the modified p-value approach were applied to simulated and real data. ix

14 1 Introduction In recent years microarray technology has made it possible to simultaneously analyze thousands of genes. Although an enormous volume of data is being produced by microarray technologies (Schena et al., 1995; DeRisi et al., 1997; Hughes et al., 2001; Lockhart et al., 1996), one remaining challenge is how to analyze and interpret the large amounts of data. A major challenge is to detect genes with differentially expressed profiles under two different experimental conditions, which may refer to samples drawn from two types of tissues, tumors or cell lines, or at two time points during important biological processes. Many of the methods used for such analysis, including the method of identifying genes with fold changes are known to be unreliable because in such methods the statistical variability of the data is not properly addressed [8]. While various parametric methods and tests such as the two-sample t-test [12] and regression model have been applied for microarray data analysis, strong parametric assumptions made in these methods as well as strong dependency on large sample sets restrict the reliability of such techniques in microarray problems where only a small number of replications are available. The non parametric statistical methods, including the Empirical Bayes (EB) method [14], the significance analysis for microarray data (SAM [39]) and mixture model method (MMM) [27, 42, 25] have been applied to microarray data analysis. It is claimed and argued that the new extensions of the (MMM) are among the available methods producing biologically-meaningful results [27, 43]. In this dissertation we extended the mixture model method (MMM) by penalizing the mixing proportions and the component variances simultaneously. The mixing proportion was penalized so that a modified likelihood ratio test similar to that of Chen et al. (2001, 2004) for testing the number of components of the fitted normal mixture model can be implemented. The variance was penalized so that the log-likelihood is bounded resulting 1

15 in the existence of the MLE s. In a similar fashion the p-value approach (Allison et al. (2002)) for the detection of differentially expressed genes of microarray data was also modified. For the p-value approach only the mixing proportion was modified so that the MLE of the mixing proportion was not on the boundary of its parametric space. This modification was done so that a modified likelihood ratio test similarly to what was done by Chen et al. may be implemented so that we may test the hypothesis for the number of components. This dissertation is organized as follows. Chapter 2 describes in some detail the genetic background of DNA and two of the leading microarray experiments, cdna and Oligonucleotide. In Chapter 2 we also discussed some of the statistical challenges we have in analyzing microarray data and gives a description of some of the methods used to analyze microarray data. The methods that were discussed are (1) Cluster analysis (2) T-test (3) Regression analysis (4) Significant analysis of microarray (SAM) (5) Mixture model method (MMM) and (6) A p-value approach for detecting differentially expressed microarray data. In Chapter 3 we present the theory of finite mixture methods and discussed how the parameters can be estimated by (1) expectation maximization algorithm (EM) and (2) the robust parameter estimation - which is of interest if the data contains outliers. One of the challenges of finite mixture distributions is to determine the number of components therefore we discussed some techniques used to determine the number of components which are namely AIC, BIC, simulation and the modified likelihood ratio test. The boxcox transformation for distinquishing skewed distributions from commingled distributions was also presented in chapter 3. The penalized modified approach will be discussed in chapter 4. The estimators of the parameters of the penalized normal mixture model when both the variance and mixing proportion were simultaneously penalized was illustrated. The evaluation of the estimators for the two penalty functions for the variance, the inverse gamma and inverse chi-square distributions were addressed. The asymptotic property namely asymptotic normality of the normal mixture model was also proved in Chapter 4. Chapter 5 discussed the applications of the penalized/modified approach of the normal mixture model to detecting differentially expressed genes and illustrated its applications to simulated and 2

16 real data. The results of the penalized/modified normal mixture model approach were compared to that of SAM and was shown to out perform SAM. Chapter 6 discussed the modified p-value approach for detecting differentially expressed genes. Similar to the work done in Chapter 6 we applied our method to simulated and real data. The motivation for modifying the p-value approach of Allison et al. was that the MLE of the mixing proportion was on the boundary point of its parametric space, therefore we applied the technique of Chen et al., that is, we applied a penalty function for the mixing proportion so that the MLE of the mixing proportion will not be on the boundary points of its parametric space. The conclusions of this study were summarized in Chapter 7. 3

17 2 Microarray Data and Some Statistical Analysis 2.1 DNA Microarray Experiments Genetic Background The double-stranded molecules deoxyribonucleic acid (DNA) (Watson and Crick, 1953) contains all the genetic information of living organisms. Each strand or helix of DNA is a chain of nucleotides that consists of a sugar, a phosphate and a nitrogenous base molecule. The information in DNA is stored as a code made up of four chemical bases: Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). These four bases are responsible for the DNA molecule having four distinct types of nucleotides. The bases are coupled in the following manner: A with T and C with G, by a hydrogen bond which is called complementary base pairing. The nucleotides are arranged in two long strands that form a spiral called a double helix. The double helix structure of DNA is similar to a ladder, with the base pairs forming the ladder s rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder. In cells, genes consist of a long strand of DNA that contains a promoter, which controls the activity of a gene. Additionally, all living cells contain chromosomes, that are, large pieces of genes containing hundreds or thousands of genes, each of which specifies the composition and structure of a protein (or several related proteins). The workhorse molecules of the cell are protein polymers of amino acids which are responsible for cellular structure, producing energy and important biomolecules like DNA and proteins, and for reproducing the cell chromosomes. The cohort of chromosomes are almost the same in every cell in an organism, and contains the same repertoire of proteins. However, cells have remarkably distinct properties, such as the difference between human eye cells, hair cells, and liver cells; these distinctions are the result of differences in the abundance, 4

18 distribution, and state of the cell proteins. When a gene is active, the coding and non-coding sequence is copied in a process called transcription, producing messenger RNA (mrna) which is a copy of the gene s information. The mrna, a small and relatively unstable nucleic acid polymers, can then direct the synthesis of proteins through the genetic code. However, mrnas can also be used directly, for example as part of the ribosome. The resulting molecules from the gene expression, mrna or protein are known as gene products. There is therefore a logical connection between the state of a cell and the details of its protein and mrna composition. Whereas it remains difficult to measure the abundance of a cell s proteins, DNA microarray makes it possible to quickly and efficiently measure the relative representation of each mrna species in the total cellular mrna population, or in more familiar terms, to measure gene expression levels. There are several types of microarray systems including the cdna microarray (Schena et al., 1995; DeRisi et al., 1997: Hughes et al., 2001) and oligonucleotide array (Lockhart et al., 1996) cdna Microarray Experiment In this experiment, the cdna sequence corresponding to a set of genes pertinent to the biological question under investigation are obtained and printed onto a glass slide or substrate using a robotic arrayer. Second, the sample RNA is isolated, a critical step in the experiment in order to ensure that a sufficient amount of each cdna clone is printed on the array where each clone is amplified by a technique called polymerase chain reaction (PCR). In practice the printed amount of cdna is not the same, therefore the cdna on the array, which is a double-stranded probe, needs to be denatured and this is achieved by heating the array so that a target cdna can bind to it. In the third step the cdna is synthesized, a procedure that also involves labeling the isolated mrna from the biological samples. Usually in the most current cdna microarray experiments, cdnas from the experimental and reference samples are labeled with red-fluorescent dye, Cy5 and green-fluorescent dye, Cy3 respectively, mixed and hybridized on the slide. There are several different labeling methods including Primer 5

19 Tagging, Direct Incorporation Labeling and Amino-Modified Nucleotide. Nguyen et al. (2002), Wong et al. (2001) and Stears et al. (2000) discuss the advantages and disadvantages of these methods. Fourth, the labeled probe cdna is hybridized to target the cdna on the microarray. That is, if a particular gene is expressed in the target cell, where the cdnas corresponding to this gene are found in the target cdna pool, these cdnas will bind with the complementary cdna probes printed on a specific spot on the microarray. Hybridization refers to the binding ability of two complementary DNA strands by the base-pairing rule thus reforming the DNA double helix. Finally, the hybridization results are imaged and analyzed using a fluorescent microscope, the log(red/green) intensities of mrna hybridization at each site is measured. The result is tens of thousands of gene expressions, typically ranging from -4 to 4, which is a measure of the expression level of each gene in the experimental sample relative to the reference sample. Positive values indicate higher expression in the target versus the reference, and vice versa for negative values Oligonucleotide Microarray Experiment Another widely used microarray technology is high density oligonucleotide arrays known as Affymetrix (Lockhart et al., 1996). This method is based on the fact that each gene is represented by 14 to 20 features (Lipshutz et al., 1999). for example, Affymetrix array used 20 features. Each feature is a short sequence of nucleotides, an oligonucleotide, and it is a perfect match (PM) to a segment of the gene. Paired with the 20 PM oligonucleotides to the gene sequence are 20 other oligonucleotides having the same sequence corresponding to the 20 PMs except for a single mismatch (MM) at the central base of the nucleotide. When the gene is expressed in the cell sample, high intensity is expected for the PM feature and low intensity for the MM feature. Given the 20 PM and MM feature pairs for the gene, many methods have been proposed to quantify the expression level of the gene. For example, Affymetrix originally proposed the average difference x = avg{d k = (P M k MM k ), k = 1, 2,..., 20 = K} to quantify expression level of a gene in a particular array. The average is usually based only on the differences, d k, with 6

20 3 standard deviations from the mean of d (2),..., d (K 1), where d (k) is the k th smallest difference, but there are various other ways to filter the outliers, Efron et al. (2001) suggested x = avg{d k = log(p M k ) c log(mm k ), k = 1, 2,..., 20} for several different scale factors c. Naef et al. (2001) proposed to use only the PM features. In an attempt to obtain more sensitive measure of gene expression, Li and Wong (2001) proposed a model-based estimate of the expression level using the least square method. The method for sample labeling and image processing in the Affymetrix arrays are found in Lockart et al. (1996). Refer to The Chipping Forecast (Lander et al., 1999) for more details on cdna microarrays and oligonucleotide chips. 2.2 Some Statistical Challenges With Analyzing Microarray Data Microarray technologies allow scientists to monitor the mrna transcript levels of thousands of genes in a single experiment. However, the tremendous amount of data that is obtained from microarray studies presents challenges for data analysis. One challenge in the development of statistical methods for microarray data analysis is that sample sizes under two different experimental conditions are typically small. We can depict this situation by defining the data as follows: for each gene i, i = 1, 2,..., N, we have expression levels (Y i1,..., Y im ) from m microarrays under condition 1, and (Y i,m+1,..., Y i,m+n ) from n arrays under condition 2. Usually the total number of genes N is large (> 1000) whereas the number of replications, m and n are small (typically < 20). Since statisticians are primarily interested in genes that are differentially expressed across two different experimental conditions, which may refer to samples drawn from two types of tissues, tumors or cell lines, or at two time points during important biological processes, we need to make an adjustment for the type I error rate when doing simultaneous hypothesis tests. This adjustment is done by means of the Bonferroni method, to deal with multiple comparisons. If we use α as the significance level then the test or gene specific significance level for a two sided test is therefore α = α/2n. Investigators may need to have the answer for the following question Is the difference in expression level for a particular gene statistically significant? However, there are a number of equally important questions that need to be answered (Allison et al. (2001)): 7

21 1. Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the conditions? 2. What is the best estimate of the number of genes for which there is a true difference in gene expression? 3. What is the confidence interval around that particular estimate? 4. If we set some threshold for which we expect particular genes to be interesting and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads? 5. What proportion of those genes not declared interesting are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)? In analyzing microarray data the assumptions made are (1) For each gene, the measurements of gene expression have a finite population mean and variance; (2) For each gene under study, there is a measure of expression level available for each sample and this measure has sufficient reliability and validity (i.e. the measurements of the expression levels are a true reflection of the true state of nature); (3) The most important assumption that is made is that gene expression levels across the two groups are independent - which implies that we may able to evaluate the likelihood function which will become important later in this dissertation. 2.3 Methods of Analyzing Microarray Data Cluster Analysis One method used in the analysis of microarray data is Cluster analysis. Cluster analysis groups genes or samples into clusters based on similar expression profiles and provides clues to the function or regulation of genes or similarity of samples via shared cluster membership [34, 35, 18]. Several clustering methods have been usefully applied to analyzing genome-wide expression data and can be classified largely into three categories. The three-based approach uses distance measures between genes such as correlation co- 8

22 efficients to group genes into a hierarchical tree [15]. The second category clusters genes so that within-cluster variation is minimized and between-cluster variation is maximized [34, 35]. The third category group genes into blocks, in which the correlation is maximized and between which the correlation is minimized [3]. The power of cluster analysis in the analysis of microarray data lies in discovering gene transcripts or samples that show similar expression profiles. However, identification of like groups is not necessarily the objective in a microarray study, because the interest is to discover genes that are differentially expressed between predefined sample groups, such as normal versus cancerous tissues. Data Let Y ik be the expression level of gene i in array k (i = 1,..., N; k = 1,..., m, m + 1,..., m + n). Suppose that the first m and the last n arrays are both obtained under two different conditions, that is Y i(1) = (Y i1,..., Y im ) and Y i(2) = (Y i,m+1,..., Y i,m+n ). Since we are interested to determine which genes are differentially express between Y i(1) and Y i(2), we let Y ik = a i + b i x k, (2.3.1) where 1 for 1 k m x k = 0 for m + 1 k m + n. Therefore the mean expression levels of gene i under the two conditions are a i + b i and a i respectively. Hence to determine the genes that are differentially expressed is equivalent to testing the hypothesis H 0 : b i = 0, there is no gene with altered expression H 1 : b i 0, otherwise (2.3.2) Using the data construction of equation (2.3.1) for Y ik we will now present the t-test and regression models used in microarray data analysis. 9

23 2.3.2 The T-Test There are several versions of the two-sample t-test, depending on whether the sample size (i.e m and n) is large and whether it is reasonable to assume that the gene expression levels have an equal variance under the two conditions. Both m and n are usually small, and there is evidence to support unequal variance (Thomas et. al. 2001), we will only discus the t-test with two independent small Normal samples with unequal variances. Let the sample means and variances of Y ik for gene i under the two conditions be Ȳ i(1) = m k=1 Y ik m, Ȳ i(2) = m+n k=m+1 Y ik n (2.3.3) and s 2 i(1) = s 2 i(2) = m k=1 (Y ik Ȳi(1)) 2, m 1 m+n k=m+1 (Y ik Ȳi(2)) 2. (2.3.4) n 1 The t-statistic is Z i = Ȳi(1) Ȳi(2) s 2 i(1) + s2 i(2) m n, (2.3.5) Under the normality assumption for Y ik, Z i degrees of freedom d j = This t-test was proposed by Welch (1947). ( s 2 ) 2 i(1) + s2 i(2) m n ( s ) 2 2 ( s ) 2 2 i(1) i(2) m n + m 1 n 1 freedom is similar to the idea of the Satterthwaite approximation. approximately has a t-distribution with Its method of calculating the degrees of Regression Model The regression model estimates the values of (a i, b i ) using the weighted least square method, and then estimates the variance of ˆb i using the robust or sandwich variance 10

24 estimator. V ar(ˆb i ) = ( s 2 i(1) m )( m 1 ) + m ( s 2 i(2) n )( n 1 and the estimate of ˆb i = Ȳi(1) Ȳi(2). Therefore the test statistics is n ), Z i = ˆbi V ar(ˆb i ) = s 2 i(1) m Ȳ i(1) Ȳi(2) m 1 m + s2 i(2) n. (2.3.6) n 1 n This test statistic compares well with that of the t-test. In the case of the t-test the test statistic is Z i = Ȳi(1) Ȳi(2) s 2 i(1) + s2 i(2) m n, (2.3.7) where Ȳi(1), Ȳi(2), s 2 i(1) and s2 i(2) are defined as in (2.3.3) and (2.3.4). Note that the two tests are the same as m, n, however in microarray data analysis both m, n are small, which makes the t-test better because of the unbiasedness of its variance estimator. Note that the strong parametric assumptions that needs to be made to use both the t-test and the regression approach is often times violated for microarray data analysis. Therefore, the Significance Analysis of Microarrays (SAM) is an important method developed for microarray data analysis that seeks to over theses strong parametric assumptions Significance Analysis of Microarrays (SAM) The significance analysis of microarrays (SAM) is one statistical technique for finding significant genes in a set of micoarray experiments. It was proposed by Tusher, Tibshirani and Chu [39]. This approach was based on analysis of random fluctuations in the data. However, even for a given level of expression, the fluctuations were gene specific. To account for gene-specific fluctuations, a statistic based on the ratio of change in gene expression to standard deviation in the data for that gene was defined. The relative difference d(i) in the gene expression is: d(i) = Ȳi(1) Ȳi(2) s(i) + s 0 (2.3.8) 11

25 where Ȳi(1) and Ȳi(2) are defined as the average expression levels of the i th gene from conditions 1 and 2, respectively. The gene-specific scatter s(i) is the standard deviation of repeated expression measurements: s(i) = 1/m + 1/n ( m (Y ik m + n 2 Ȳi(1)) 2 + k=1 m+n k=m+1 (Y ik Ȳi(1)) 2 ) where m and n are the numbers of measurements in conditions 1 and 2 respectively. (2.3.9) In order to compare values of d(i) across all genes, the distribution of d(i) should be independent of the level of gene expression. At low expression levels, variance in d(i) can be high because of small values of s(i). To ensure that the variance of d(i) is independent of the gene expression, a small positive constant s 0 (exchangeability factor) was added to the denominator of equation (2.3.8). The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s 0 was chosen to minimize the coefficient of variation. To minimize the effects of potential confounders between the conditions, the data was analyzed by taking B sets of permutations. For each permutation b the statistic d b i and the corresponding order statistics d b (1) d b (2)... d b (N) was computed. The expected relative difference, d b i = d b i, was defined as the average over the set of B permutations. B To identify potentially significant changes in expression levels, they used a scatter plot of the observed relative difference d(i) versus the expected relative difference d i. For a fixed threshold, starting at the origin, and moving up to the right find the first i = i 1 such that d i d i >. All genes pass i 1 are called significant positive. Similarly, start at the origin, move down to the left and find the first i = i 2 such that d i d i >. All genes pass i 2 are called significant negative. For each the upper cutoff point cut up ( ) was defined as the smallest d i among the significant positive genes, and similarly defining the lower cutoff point cut low ( ). To determine the number of falsely significant genes generated by SAM, the total number of falsely significant genes corresponding to each permutation was computed by counting the number of genes that exceeded the cutoffs cut up ( ) and cut low ( ). The estimated number of falsely significant genes was the median (or 90 th percentile) of the number of genes called significant from the B sets of permutations. Such genes are called 12

26 false positive (F P ). This information will then be used to estimate the false Discovery Rate (F DR) F DR = π 0 F P/T P (2.3.10) where π 0 is the true proportion of equivalent expressed (EE) genes in the data set and T P is the number of total (true) positives discovered from the test statistic, that is, T P is the total number of genes claimed to be differentially expressed (DE) Mixture Model Method (MMM) The mixture model method (MMM) was introduced to handle the problem when a small number of replications under two experimental conditions exist, which is exactly the case for the data in a microarray experiment. The main purpose of the MMM is to estimate the distribution of a t-type test statistic and its null statistic using finite normal mixture models, which results in the method being non-parametric. Additionally, the strong parametric assumption made when analyzing microarray when the traditional statistical test is applied is often violated, hence this make the MMM statistically safer because the assumption of normality is not made. Consider the situation where, for each gene i, i = 1, 2,..., N, we have expression levels Y i(1) = (Y i1,..., Y im ) from m microarrays under condition 1, and Y i(2) = (Y i,m+1,..., Y i,m+n ) from n arrays under condition 2. Here we need to assume that both m and n are even integers, this will become obvious later. The goal is to identify genes such that (Y i1,..., Y im ) and (Y i,m+1,..., Y i,m+n ) have different means. This appears to be a two sample comparison however, in microarray data, that has small m and n with a large N, renders the traditional statistical tests such as the t-test or rank-based nonparametric tests, ineffective. One alternative is to draw statistical inference based on the distributions of quantities related to (Y i1,..., Y im ) or (Y i,m+1,..., Y i,m+n ), for 1 i N, to take advantage of the large population size N. The model assumes a nonparametric approach for gene expression data: Y i(1) = µ i(1) + ε i(1) Y i(2) = µ i(2) + e i(2) 13

27 where µ i(1) and µ i(2) are the mean expression levels for gene i under the two conditions respectively, and ε i(1) and e i(2) are independent random errors with means and variances, such that E(ε i(1) ) = E(e i(2) ) = 0, V ar(ε i(1) ) = σ 2 i(1), V ar(e i(2) ) = σ 2 i(2), for any j = 1,..., m, m+1,..., m+n and i = 1,..., N. Note, we do not assume equality of variance of the gene expression levels, because the variance σi(c) 2 of gene expression level depends on the mean expression µ i(c). Also, we do not assume µ i(1) = µ i(2). The basis of the model is to compare two distributions of two similar statistics (after being suitably standardized) to infer whether some genes are differentially expressed. Let m and n be even such that p i (q i ) is a column vector containing random permutation of m/2 1 s and m/2-1 s (n/2 1 s and n/2-1 s). Let Y i(1) = (Y i1,..., Y im ) and Y i(2) = (Y i,m+1,..., Y i,m+n ) then assume that z i = Y i(1)p i /m + Y i(2) q i /n s 2i(1) /m + s2i(2) /n f 0, (2.3.11) which does not depend on µ i(1) and µ i(2) since its mean is 0. Furthermore, suppose that Z i = = m k=1 Y ik/m m+n k=m+1 Y ik/n s 2 i(1) /m + s2 i(2) /n Ȳ i(1) Ȳi(2) s 2i(1) /m + s2i(2) /n f 1. (2.3.12) The hypothesis is of the form H 0 : f 0 = f 1, there is no gene with altered expression H 1 : f 0 f 1, otherwise (2.3.13) and is valid only if the random errors are independent and their distribution is symmetric about 0. Since m, n > 1 then we can estimate s 2 i(1) and s2 i(2) using the sample variances s 2 i(1) = m k=1 (Y ik Ȳi(1)) 2 and s 2 m 1 i(2) = m+n k=m+1 (Y ik Ȳi(2)) 2 respectively. Note the data z n 1 i s and 14

28 Z i s are used to estimate f 0 and f 1 by normal mixture model respectively, which will be discussed in more details in chapter 3. To test the null hypothesis H 0 that Z is from f 0 (which is equivalent to testing the hypothesis (2.3.13)), we can construct a likelihood ratio test (LRT) based on the following statistic: LR(Z) = f 0(Z) f 1 (Z). (2.3.14) A large value of LR(Z) gives no evidence against H 0, whereas a too small value of LR(Z) leads to rejecting H 0. determine the rejection region. bisection method [29] to solve With the normal mixture model, it is possible to numerically For any given false positive rate α, we can use the α = LR(z)<s f 0 (z)dz to obtain the suitable cut off point s. Then the rejection region is RR(α) = {Z : LR(Z) < s}. We call the method of using the LRT in MMM as MMM-LRT. Similar to SAM (Tusher et al. 2001), we can estimate the numbers of false positive (F P ) and total (T P ) directly. In MMM-LRT, for any given s, we have: F P (s) = 1 B B b=1 n(i : LR(z (b) i ) < s), T P (s) = n(i : LR(Z i ) < s) where n(i) represents the number of genes. In estimating F P, one can also use median, rather than mean, F P over the permuted data. Based on the estimated F P and T P, one can also calculate the false discovery rate as F DR = F P/T P (Benjamini and Hochberg 1995; Storey 2001; Tusher et al. 2001) A Mixture Model Approach Using P -Value In is well known that the distribution of the p-values is uniformly distributed on the interval [0, 1], regardless of the statistical test used and the sample size. Therefore if investigators uses a valid statistical test to produce p-values for testing the null hypothesis 15

29 H 0 there is no difference between the two experiments for the i th gene, i = 1,..., N, then the distribution of the p-value can be used to determine the genes that were differentially expressed. The assumption of independence of the gene expression levels across genes was made under the null hypothesis. Additionally, under the alternative hypothesis, the distribution of p-values will tend to cluster closer to zero than to one, as opposed to be uniformly distributed under the null hypothesis. Then, the question Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the two experimental conditions? can be answered by conducting a test to determine if the observed p-values are significantly different from the uniform distribution. This is done by mixture model approach [2]. The mixture model is a g-component of beta distributions β(r j, s j ) for j = 1,..., g with the parameters r j and s j, where the beta distribution is defined as follows β(y r, s) = Γ(r + s)yr 1 (1 y) s 1. Γ(r)Γ(s) The reason for the choice of the beta distribution is because of its great flexibility in modeling any shaped distribution on the interval [0, 1]. Note that the uniform distribution is a special case of the beta distribution with r = s = 1. The likelihood for the collection of N, p-values from a model with g components is given as L g = [ N p 1 β(y i 1, 1) ] g p j β(y i r j, s j ), Therefore the log likelihood for the N p-values can be expressed as l g = j=2 [ N ln p 1 β(y i 1, 1) + ] g p j β(y i r j, s j ), where y i is the p-value for the i th test, p 1 is the probability that a randomly chosen test from the collection of tests is for a gene where there is no population difference in gene expression (i.e., a test of a true null hypothesis), and p j is the probability that a randomly chosen test from the collection of tests is for a gene where there is a population difference in gene expression (i.e., a test of a false null hypothesis). The above model now requires j=2 16

30 the calculation of the MLE of the parameters p j, r j and s j through iterative procedure subject to the constraints g p j = 1 and 0 p j 1 for all j = 1,..., g. The estimate of the number of genes for which there is a difference in gene expression is evaluated as N(1 ˆp 1 ), where ˆp 1 is the MLE of p 1. Let T be some threshold below which the results for particular genes are declared interesting and worthy of follow-up study, the proportion of those genes that are likely to be genes for which there is a real difference is P ( H 0,i y i T ) = 1 P (H 0,i y i T ) = 1 P (H 0,i y i T ), P (y i T ) where g T P (y i T ) = p 1 T + p j j=2 0 Γ(r j + s j )y rj 1 (1 y) s j 1 dy Γ(r j )Γ(s j ) and P ( H 0,i y i T ) = p 1 T. The estimated proportion of genes declared interesting that are likely to be false leads is simply P (H 0,i y i T ) = P (H 0,i y i T ). P (y i T ) Similarly the proportion of those genes not declared interesting that are likely to be genes for which there is a real difference is P ( H 0,i y i T ) = 1 P (H 0,i y i T ) = 1 P (H 0,i y i T ), P (y i T ) where g 1 Γ(r j + s j )y rj 1 (1 y) s j 1 P (y i T ) = p 1 (1 T ) + p j dy Γ(r j )Γ(s j ) j=2 T and P ( H 0,i y i T ) = p 1 (1 T ). 17

31 2.4 Conclusion This chapter discussed a few of the methods used to analyze microarray data. An introduction to cluster analysis was presented, but, cluster analysis was not an effective method to determine differentially express genes. Hence the need to make use of the more classical statistical methods such as the t-test and regression analysis. However, with strong parametric assumptions that will be necessary for microarray analysis, these methods has some limitations. Microarray data are many times consist of a few replications for case and control groups, although the number of genes are usually greater than The assumption that the genes are independent is one assumption that is typical in the analysis of microarray data. Note that in chapters 3 and 5 the development of the modified approaches use the independence assumption, therefore we are prepared to deal with the consequences of assuming the genes are independent. The Significance Analysis of Microarrays (SAM) and the Mixture Model Method (MMM) presented in this chapter uses a t-type statistics to determine the number of differentially expressed genes. However, the MMM has one advantage in that it is a non-parametric approach. The MMM determines the distributions under the null and alternative and then uses these distributions to determine the number of differentially expressed genes by means of a likelihood ratio test. The p-value approach of Allision relies on parametric assumptions that are made to determine the p-values. If the p-values are not valid then its distributions under the null hypothesis may not be uniform on the interval [0,1]. In discussing the modified p-value approach presented in chapter 6, we are aware that the t-test used to determine the p-values must be valid for the modified p-value approach to be valid. However, for this dissertation we assume all the assumptions are satisfied with respect to the modified p-value approach. In addition, to the method used to analyze microarray data, we presented the biological background that the reader needs so that he may fully understand the challenges statisticians have in the analysis of microarray data. 18

32 3 Finite Mixture Distribution In this chapter we will give a brief background on mixture distributions. Mixture models are vital in statistical practice and research because many problems in statistics have mixture structures. Furthermore they are useful in describing complex population with observed or unobserved heterogeneity. Some examples are that human heights may be modeled as a two-component mixture, one component for men and one for women. Substructures in galaxy may be modeled as contaminations of big initial galaxy; the evidence of substructures is important in modern galaxy formation theory (Sun, Morrison, Harding and Woodroofe 2002). There are also applications in actuarial science, biological science, econometrics, medicine, agriculture, zoology, population studies and microarray data analysis. K. Pearson (1894) was the first to study mixture of two normal distributions, where he modeled the mixing of different crab species. Mixture model has become popular because: (1) they provide a simple mechanism to incorporate extra variation and correlation in the model (2) they add model flexibility and (3) they are a natural approach for modeling data that arise in multiple stages or when populations are composed of sub populations. In addition the theory, applications, history and importance of mixture models have been discussed in journal articles, monographs and textbooks. Everitt and Hand (1981), Titterington, Smith and Makov (1985), Böhning (1999), and McLachlan and Peel (2000) provided models, statistical methods and references for finite mixtures problems. 19

33 3.1 Definition and Preliminary Definition A stochastic variable {Y i : 1 i n} with density function f(y i θ j ) follows a finite mixture distribution if Y i π 1 f i1 (y i θ 1 ) + π 2 f i2 (y i θ 2 ) π g f ig (y i θ g ) g = π j f ij (y i θ j ), (3.1.1) where f i1 (y i θ 1 ),..., f ig (y i θ g ) are g density functions and π 1,..., π g are called mixing proportions, satisfying the following properties 0 π j 1 and g π j = 1. The densities f ij (y) for j = 1,..., g may be continuous or discrete, or a combination of both. From Definition we observe that finite mixture distribution is the marginal distribution of a random variable which follows different distributions in different subpopulations of a general population. Therefore, if a population S is defined as S = {S 1, S 2,..., S g }, such that S j S k =, j k. Then the distribution in each sub-population is given to be In S 1 : Y S 1 f 1 (Y θ 1 ) In S 2 : Y S 2 f 2 (Y θ 2 )... In S g : Y S g f g (Y θ g ) Furthermore, let X represent the statistic in each sub-population i.e., X = x 1, if ins 1 ; X = x 2, if ins 2 ;...,... ; X = x 3, if ins 3. Then X follows a discrete distribution with support {x 1, x 2,..., x g } and corresponding probabilities (weights) {π 1, π 2,..., π g }, that is P (X = x j ) = π j for j = 1,..., g. 20

34 Therefore for the finite mixture Y i g π j f ij (y i θ j ), we have Y i (X = x j ) f ij (y i θ j ), j = 1, 2,..., g where X is denoted as follow, X x 1 x 2... x g π 1 π 2... π g Note the random variable X is called latent because, in most applications, it is not observed. We now present some examples of finite mixture distributions Examples of Mixture Distributions Example Normal with common variance, that is, Y g π j N(µ j, σ 2 ) where the parameters for this mixture are θ j = (µ j, σ 2 ) and π j for j = 1,..., g. Note that Y (X = µ j ) N(µ j, σ 2 ) where X µ 1 µ 2... µ g π 1 π 2... π g. Example Normal with common mean, that is, Y g π j N(µ, σj 2 ) 21

35 where the parameters for this mixture are θ j = (µ, σ 2 j ) and π j for j = 1,..., g. Note that Y (X = σ 2 j ) N(µ, σ 2 j ) where X σ2 1 σ σg 2 π 1 π 2... π g. Example Normal with general mean and variance, that is, Y g π j N(µ j, σj 2 ) where the parameters for this mixture are θ j = (µ j, σ 2 j ) and π j for j = 1,..., g. Note that Y (X 1 = µ j, X 2 = σ 2 j ) N(µ j, σ 2 j ) where X = (X 1, X 2 ) (µ, σ2 1) (µ, σ2) 2... (µ, σg) 2 π 1 π 2... π g Mean and Variance of Mixtures Let Y g π ijf ij (y i θ j ) be a random variable that has a mixture distribution. Using the latent variable definition above, the mean and variance have the following known basic probability results for any random variables. Proposition E(Y ) = E(E(Y X)) Proposition V ar(y ) = V ar(e(y X)) + E(V ar(y X)) This implies that the mean and variance of Examples (3.1.2), (3.1.3) and (3.1.4) are given as: For Example (3.1.2) we have E(Y ) = E(E(Y X)) = E(X) = g π j µ j 22

36 and = Example (3.1.3) results in V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(x) + E(V ar(x)) g ( g ) 2 = π j µ 2 j π j µ j + E(σ 2 ) g ( g ) 2 π j µ 2 j π j µ j + σ 2. E(Y ) = E(E(Y X)) = E(µ) = µ and For Example (3.1.4) we have V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(µ) + E(σj 2 ) g = π j σj 2. E(Y ) = E(E(Y X)) = E(X 1 ) = g π j µ j and V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(x 1 ) + E(X 2 ) g ( g ) 2 = π j µ 2 j π j µ j + E(σ 2 j ) = g ( g ) 2 π j µ 2 j π j µ j + g π j σj 2. 23

Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis

University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 3-29-2010 Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray