Mixture distributions with application to microarray data analysis

Size: px
Start display at page:

Download "Mixture distributions with application to microarray data analysis"

Transcription

1 University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 2009 Mixture distributions with application to microarray data analysis O'Neil Lynch University of South Florida Follow this and additional works at: Part of the American Studies Commons Scholar Commons Citation Lynch, O'Neil, "Mixture distributions with application to microarray data analysis" (2009). Graduate Theses and Dissertations. This Dissertation is brought to you for free and open access by the Graduate School at Scholar Commons. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons. For more information, please contact

2 Mixture Distributions with Application to Microarray Data Analysis by O Neil Lynch A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Mathematics and Statistics College of Arts and Sciences University of South Florida Co-Major Professor: Kandethody Ramachandran, Ph.D. Co-Major Professor: Wonkuk Kim, Ph.D. Chris Tsokos, Ph.D. Tapas Das, Ph.D. Date of Approval: September 4, 2008 Keywords: Likelihood ratio test; Modified likelihood, Penalized likelihood; Asymptotic chi-square distribution; Consistency c Copyright 2009, O Neil Lynch

3 Dedication To my wife Lonnette

4 Acknowledgements It is a great pleasure for me to express my profound respect, gratitude, and admiration for my major supervisor Dr. K. Ramachandran and co-major supervisor Dr. W. Kim. They have given unstintingly of their time, advise, and encouragement ever since they both undertook the task to make me further understand the fields of microarray data analysis and mixture distribution. I also would like to thank the other members of my committee namely Dr. C. Tsokos, and Dr. T. K. Das who all have given of their time and have always encouraged me to do my best. The staff at the Department of Mathematics and Statistics is deserving of my thanks for the help and advice they have given me over the years. I owe thanks to my family and a large number of friends for the encouragement and complete understanding in my quest to excel in the field of Statistics.

5 Table of Contents List of Figures List of Tables Abstract iv vi viii 1 Introduction 1 2 Microarray Data and Some Statistical Analysis DNA Microarray Experiments Genetic Background cdna Microarray Experiment Oligonucleotide Microarray Experiment Some Statistical Challenges With Analyzing Microarray Data Methods of Analyzing Microarray Data Cluster Analysis The T-Test Regression Model Significance Analysis of Microarrays (SAM) Mixture Model Method (MMM) A Mixture Model Approach Using P -Value Conclusion Finite Mixture Distribution Definition and Preliminary Examples of Mixture Distributions i

6 3.1.2 Mean and Variance of Mixtures Comparison of Two Groups: Iris Data Parameter Estimation Expectation Maximization Algorithm Robust Parameter Estimation Penalized Maximum Likelihood Estimator for Normal Mixture Models Estimating the Number of Components g Simulation Approach Modified Likelihood Ratio Test for Homogeneity in Finite Mixture Models Regularity Conditions Box-Cox transformation Conclusion The Penalized Modified Likelihood for Normal Mixture Model Penalized Modified Likelihood Parameter Estimation of Penalized Modified Likelihood Inverse Gamma Penalty Function for σ Inverse Chi-Square Penalty Function for σ Consistency and Asymptotic Normality Preliminary Results Main results Conclusion Penalized Modified Likelihood Approach to Microarray Data Analysis Penalized Modified Mixture Model (PMMM) PMMM Simulated Null Distribution Simulating Microarray Data Application of PMMM to Simulated Microarray Data Application of PMMM to the Rat data Conclusion ii

7 6 Modified P -Value Approach to Microarray Data Analysis The modified P -Value Approach Simulated Null Distribution of the Modified P -Value Approach Application of Modified P -Value to Simulated Microarray Data Application of Modified P -Value to simulated Prostate Data Application of Modified P -Value to the Prostate Data Conclusion Summary and Concluding Remarks 102 References 105 About the Author End Page iii

8 List of Figures 3.1 Histogram of Sepal length of the two species of flowers Q-Q plots of Sepal lengths for versicolor flowers Q-Q plots of Sepal lengths for verginica flowers Histogram and known mixture distribution Histogram and estimated mixture distribution Histogram with known and estimated mixture distribution Graphical representations of two component normal with equal variance Histogram, heteroscedastic and homoscedastic fit for simulated data from the mixture 0.5φ(y 4, 1) + 0.5φ(y 8, 1) Histogram, heteroscedastic and homoscedastic fit for simulated data from the mixture 0.5φ(y 4, 1) + 0.5φ(y 8, 2) Histograms of z, Z and fitted models for the simulated data The likelihood ratio statistic as a function of Z value for the simulated data The values of FDR from our method and SAM for the simulated data Histograms of z, Z and fitted models for the rat data The likelihood ratio curve for the rat data The values of FDR from our method and SAM for the rat data Various Beta Distributions Histogram of p-value and beta mixture distributions for 2000 simulated genes iv

9 6.3 Histogram of p-value and beta mixture distributions for simulated prostate data Histogram of p-value and beta mixture distributions for the prostate data v

10 List of Tables 3.1 Summary statistics of data Mean, variance and percentiles for the penalized modified likelihood, based on 1000 replicates for each sample for testing the hypothesis a 1-component against 2-components Mean, variance and percentiles for the penalized modified likelihood, based on 1000 replicates for each sample for testing the hypothesis 2-components against 3-components Hypothesis test for the number of components for the fitted normal mixture models of z and Z, for the simulated microarray data Values of TP, FP and FDR from PMMM for the simulated data Values of TP, FP and FDR from SAM for the simulated data Hypothesis test for the number of components for the fitted normal mixture models of z and Z, for the rat data Values of TP, FP and FDR from PMMM for the rat data Values of TP, FP and FDR from SAM for the rat data Mean, variance and percentiles for the likelihood ratio test for the modified p-value, based on 1000 replicates for each sample for testing the hypothesis a uniform against a uniform and one beta distribution vi

11 6.2 Mean, variance and percentiles for the likelihood ratio test for the modified p-value, based on 1000 replicates for each sample for testing the hypothesis a uniform against a uniform and two beta distributions Hypothesis test for the number of components for the fitted uniform-beta mixture models for the prostate data vii

12 Mixture Distributions with Application to Microarray Data Analysis O Neil Lee Lynch ABSTRACT The main goal in analyzing microarray data is to determine the genes that are differentially expressed across two types of tissue samples or samples obtained under two experimental conditions. In this dissertation we proposed two methods to determine differentially expressed genes. For the penalized normal mixture model (PMMM) to determine genes that are differentially expressed, we penalized both the variance and the mixing proportion parameters simultaneously. The variance parameter was penalized so that the log-likelihood will be bounded, while the mixing proportion parameter was penalized so that its estimates are not on the boundary of its parametric space. The null distribution of the likelihood ratio test statistic (LRTS) was simulated so that we could perform a hypothesis test for the number of components of the penalized normal mixture model. In addition to simulating the null distribution of the LRTS for the penalized normal mixture model, we showed that the maximum likelihood estimates were asymptotically normal, which is a first step that is necessary to prove the asymptotic null distribution of the LRTS. This result is a significant contribution to field of normal mixture model. The modified p-value approach for detecting differentially expressed genes was also discussed in this dissertation. The modified p-value approach was implemented so that a hypothesis test for the number of components can be conducted by using the modified likelihood ratio test. In the modified p-value approach we penalized the mixing proportion so that the estimates of the mixing proportion are not on the boundary of its viii

13 parametric space. The null distribution of the (LRTS) was simulated so that the number of components of the uniform beta mixture model can be determined. Finally, for both modified methods, the penalized normal mixture model and the modified p-value approach were applied to simulated and real data. ix

14 1 Introduction In recent years microarray technology has made it possible to simultaneously analyze thousands of genes. Although an enormous volume of data is being produced by microarray technologies (Schena et al., 1995; DeRisi et al., 1997; Hughes et al., 2001; Lockhart et al., 1996), one remaining challenge is how to analyze and interpret the large amounts of data. A major challenge is to detect genes with differentially expressed profiles under two different experimental conditions, which may refer to samples drawn from two types of tissues, tumors or cell lines, or at two time points during important biological processes. Many of the methods used for such analysis, including the method of identifying genes with fold changes are known to be unreliable because in such methods the statistical variability of the data is not properly addressed [8]. While various parametric methods and tests such as the two-sample t-test [12] and regression model have been applied for microarray data analysis, strong parametric assumptions made in these methods as well as strong dependency on large sample sets restrict the reliability of such techniques in microarray problems where only a small number of replications are available. The non parametric statistical methods, including the Empirical Bayes (EB) method [14], the significance analysis for microarray data (SAM [39]) and mixture model method (MMM) [27, 42, 25] have been applied to microarray data analysis. It is claimed and argued that the new extensions of the (MMM) are among the available methods producing biologically-meaningful results [27, 43]. In this dissertation we extended the mixture model method (MMM) by penalizing the mixing proportions and the component variances simultaneously. The mixing proportion was penalized so that a modified likelihood ratio test similar to that of Chen et al. (2001, 2004) for testing the number of components of the fitted normal mixture model can be implemented. The variance was penalized so that the log-likelihood is bounded resulting 1

15 in the existence of the MLE s. In a similar fashion the p-value approach (Allison et al. (2002)) for the detection of differentially expressed genes of microarray data was also modified. For the p-value approach only the mixing proportion was modified so that the MLE of the mixing proportion was not on the boundary of its parametric space. This modification was done so that a modified likelihood ratio test similarly to what was done by Chen et al. may be implemented so that we may test the hypothesis for the number of components. This dissertation is organized as follows. Chapter 2 describes in some detail the genetic background of DNA and two of the leading microarray experiments, cdna and Oligonucleotide. In Chapter 2 we also discussed some of the statistical challenges we have in analyzing microarray data and gives a description of some of the methods used to analyze microarray data. The methods that were discussed are (1) Cluster analysis (2) T-test (3) Regression analysis (4) Significant analysis of microarray (SAM) (5) Mixture model method (MMM) and (6) A p-value approach for detecting differentially expressed microarray data. In Chapter 3 we present the theory of finite mixture methods and discussed how the parameters can be estimated by (1) expectation maximization algorithm (EM) and (2) the robust parameter estimation - which is of interest if the data contains outliers. One of the challenges of finite mixture distributions is to determine the number of components therefore we discussed some techniques used to determine the number of components which are namely AIC, BIC, simulation and the modified likelihood ratio test. The boxcox transformation for distinquishing skewed distributions from commingled distributions was also presented in chapter 3. The penalized modified approach will be discussed in chapter 4. The estimators of the parameters of the penalized normal mixture model when both the variance and mixing proportion were simultaneously penalized was illustrated. The evaluation of the estimators for the two penalty functions for the variance, the inverse gamma and inverse chi-square distributions were addressed. The asymptotic property namely asymptotic normality of the normal mixture model was also proved in Chapter 4. Chapter 5 discussed the applications of the penalized/modified approach of the normal mixture model to detecting differentially expressed genes and illustrated its applications to simulated and 2

16 real data. The results of the penalized/modified normal mixture model approach were compared to that of SAM and was shown to out perform SAM. Chapter 6 discussed the modified p-value approach for detecting differentially expressed genes. Similar to the work done in Chapter 6 we applied our method to simulated and real data. The motivation for modifying the p-value approach of Allison et al. was that the MLE of the mixing proportion was on the boundary point of its parametric space, therefore we applied the technique of Chen et al., that is, we applied a penalty function for the mixing proportion so that the MLE of the mixing proportion will not be on the boundary points of its parametric space. The conclusions of this study were summarized in Chapter 7. 3

17 2 Microarray Data and Some Statistical Analysis 2.1 DNA Microarray Experiments Genetic Background The double-stranded molecules deoxyribonucleic acid (DNA) (Watson and Crick, 1953) contains all the genetic information of living organisms. Each strand or helix of DNA is a chain of nucleotides that consists of a sugar, a phosphate and a nitrogenous base molecule. The information in DNA is stored as a code made up of four chemical bases: Adenine (A), Guanine (G), Cytosine (C) and Thymine (T). These four bases are responsible for the DNA molecule having four distinct types of nucleotides. The bases are coupled in the following manner: A with T and C with G, by a hydrogen bond which is called complementary base pairing. The nucleotides are arranged in two long strands that form a spiral called a double helix. The double helix structure of DNA is similar to a ladder, with the base pairs forming the ladder s rungs and the sugar and phosphate molecules forming the vertical sidepieces of the ladder. In cells, genes consist of a long strand of DNA that contains a promoter, which controls the activity of a gene. Additionally, all living cells contain chromosomes, that are, large pieces of genes containing hundreds or thousands of genes, each of which specifies the composition and structure of a protein (or several related proteins). The workhorse molecules of the cell are protein polymers of amino acids which are responsible for cellular structure, producing energy and important biomolecules like DNA and proteins, and for reproducing the cell chromosomes. The cohort of chromosomes are almost the same in every cell in an organism, and contains the same repertoire of proteins. However, cells have remarkably distinct properties, such as the difference between human eye cells, hair cells, and liver cells; these distinctions are the result of differences in the abundance, 4

18 distribution, and state of the cell proteins. When a gene is active, the coding and non-coding sequence is copied in a process called transcription, producing messenger RNA (mrna) which is a copy of the gene s information. The mrna, a small and relatively unstable nucleic acid polymers, can then direct the synthesis of proteins through the genetic code. However, mrnas can also be used directly, for example as part of the ribosome. The resulting molecules from the gene expression, mrna or protein are known as gene products. There is therefore a logical connection between the state of a cell and the details of its protein and mrna composition. Whereas it remains difficult to measure the abundance of a cell s proteins, DNA microarray makes it possible to quickly and efficiently measure the relative representation of each mrna species in the total cellular mrna population, or in more familiar terms, to measure gene expression levels. There are several types of microarray systems including the cdna microarray (Schena et al., 1995; DeRisi et al., 1997: Hughes et al., 2001) and oligonucleotide array (Lockhart et al., 1996) cdna Microarray Experiment In this experiment, the cdna sequence corresponding to a set of genes pertinent to the biological question under investigation are obtained and printed onto a glass slide or substrate using a robotic arrayer. Second, the sample RNA is isolated, a critical step in the experiment in order to ensure that a sufficient amount of each cdna clone is printed on the array where each clone is amplified by a technique called polymerase chain reaction (PCR). In practice the printed amount of cdna is not the same, therefore the cdna on the array, which is a double-stranded probe, needs to be denatured and this is achieved by heating the array so that a target cdna can bind to it. In the third step the cdna is synthesized, a procedure that also involves labeling the isolated mrna from the biological samples. Usually in the most current cdna microarray experiments, cdnas from the experimental and reference samples are labeled with red-fluorescent dye, Cy5 and green-fluorescent dye, Cy3 respectively, mixed and hybridized on the slide. There are several different labeling methods including Primer 5

19 Tagging, Direct Incorporation Labeling and Amino-Modified Nucleotide. Nguyen et al. (2002), Wong et al. (2001) and Stears et al. (2000) discuss the advantages and disadvantages of these methods. Fourth, the labeled probe cdna is hybridized to target the cdna on the microarray. That is, if a particular gene is expressed in the target cell, where the cdnas corresponding to this gene are found in the target cdna pool, these cdnas will bind with the complementary cdna probes printed on a specific spot on the microarray. Hybridization refers to the binding ability of two complementary DNA strands by the base-pairing rule thus reforming the DNA double helix. Finally, the hybridization results are imaged and analyzed using a fluorescent microscope, the log(red/green) intensities of mrna hybridization at each site is measured. The result is tens of thousands of gene expressions, typically ranging from -4 to 4, which is a measure of the expression level of each gene in the experimental sample relative to the reference sample. Positive values indicate higher expression in the target versus the reference, and vice versa for negative values Oligonucleotide Microarray Experiment Another widely used microarray technology is high density oligonucleotide arrays known as Affymetrix (Lockhart et al., 1996). This method is based on the fact that each gene is represented by 14 to 20 features (Lipshutz et al., 1999). for example, Affymetrix array used 20 features. Each feature is a short sequence of nucleotides, an oligonucleotide, and it is a perfect match (PM) to a segment of the gene. Paired with the 20 PM oligonucleotides to the gene sequence are 20 other oligonucleotides having the same sequence corresponding to the 20 PMs except for a single mismatch (MM) at the central base of the nucleotide. When the gene is expressed in the cell sample, high intensity is expected for the PM feature and low intensity for the MM feature. Given the 20 PM and MM feature pairs for the gene, many methods have been proposed to quantify the expression level of the gene. For example, Affymetrix originally proposed the average difference x = avg{d k = (P M k MM k ), k = 1, 2,..., 20 = K} to quantify expression level of a gene in a particular array. The average is usually based only on the differences, d k, with 6

20 3 standard deviations from the mean of d (2),..., d (K 1), where d (k) is the k th smallest difference, but there are various other ways to filter the outliers, Efron et al. (2001) suggested x = avg{d k = log(p M k ) c log(mm k ), k = 1, 2,..., 20} for several different scale factors c. Naef et al. (2001) proposed to use only the PM features. In an attempt to obtain more sensitive measure of gene expression, Li and Wong (2001) proposed a model-based estimate of the expression level using the least square method. The method for sample labeling and image processing in the Affymetrix arrays are found in Lockart et al. (1996). Refer to The Chipping Forecast (Lander et al., 1999) for more details on cdna microarrays and oligonucleotide chips. 2.2 Some Statistical Challenges With Analyzing Microarray Data Microarray technologies allow scientists to monitor the mrna transcript levels of thousands of genes in a single experiment. However, the tremendous amount of data that is obtained from microarray studies presents challenges for data analysis. One challenge in the development of statistical methods for microarray data analysis is that sample sizes under two different experimental conditions are typically small. We can depict this situation by defining the data as follows: for each gene i, i = 1, 2,..., N, we have expression levels (Y i1,..., Y im ) from m microarrays under condition 1, and (Y i,m+1,..., Y i,m+n ) from n arrays under condition 2. Usually the total number of genes N is large (> 1000) whereas the number of replications, m and n are small (typically < 20). Since statisticians are primarily interested in genes that are differentially expressed across two different experimental conditions, which may refer to samples drawn from two types of tissues, tumors or cell lines, or at two time points during important biological processes, we need to make an adjustment for the type I error rate when doing simultaneous hypothesis tests. This adjustment is done by means of the Bonferroni method, to deal with multiple comparisons. If we use α as the significance level then the test or gene specific significance level for a two sided test is therefore α = α/2n. Investigators may need to have the answer for the following question Is the difference in expression level for a particular gene statistically significant? However, there are a number of equally important questions that need to be answered (Allison et al. (2001)): 7

21 1. Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the conditions? 2. What is the best estimate of the number of genes for which there is a true difference in gene expression? 3. What is the confidence interval around that particular estimate? 4. If we set some threshold for which we expect particular genes to be interesting and worthy of follow-up study, what proportion of those genes are likely to be genes for which there is a real difference in expression and what proportion are likely to be false leads? 5. What proportion of those genes not declared interesting are likely to be genes for which there is a real difference in expression (i.e., misses or false negatives)? In analyzing microarray data the assumptions made are (1) For each gene, the measurements of gene expression have a finite population mean and variance; (2) For each gene under study, there is a measure of expression level available for each sample and this measure has sufficient reliability and validity (i.e. the measurements of the expression levels are a true reflection of the true state of nature); (3) The most important assumption that is made is that gene expression levels across the two groups are independent - which implies that we may able to evaluate the likelihood function which will become important later in this dissertation. 2.3 Methods of Analyzing Microarray Data Cluster Analysis One method used in the analysis of microarray data is Cluster analysis. Cluster analysis groups genes or samples into clusters based on similar expression profiles and provides clues to the function or regulation of genes or similarity of samples via shared cluster membership [34, 35, 18]. Several clustering methods have been usefully applied to analyzing genome-wide expression data and can be classified largely into three categories. The three-based approach uses distance measures between genes such as correlation co- 8

22 efficients to group genes into a hierarchical tree [15]. The second category clusters genes so that within-cluster variation is minimized and between-cluster variation is maximized [34, 35]. The third category group genes into blocks, in which the correlation is maximized and between which the correlation is minimized [3]. The power of cluster analysis in the analysis of microarray data lies in discovering gene transcripts or samples that show similar expression profiles. However, identification of like groups is not necessarily the objective in a microarray study, because the interest is to discover genes that are differentially expressed between predefined sample groups, such as normal versus cancerous tissues. Data Let Y ik be the expression level of gene i in array k (i = 1,..., N; k = 1,..., m, m + 1,..., m + n). Suppose that the first m and the last n arrays are both obtained under two different conditions, that is Y i(1) = (Y i1,..., Y im ) and Y i(2) = (Y i,m+1,..., Y i,m+n ). Since we are interested to determine which genes are differentially express between Y i(1) and Y i(2), we let Y ik = a i + b i x k, (2.3.1) where 1 for 1 k m x k = 0 for m + 1 k m + n. Therefore the mean expression levels of gene i under the two conditions are a i + b i and a i respectively. Hence to determine the genes that are differentially expressed is equivalent to testing the hypothesis H 0 : b i = 0, there is no gene with altered expression H 1 : b i 0, otherwise (2.3.2) Using the data construction of equation (2.3.1) for Y ik we will now present the t-test and regression models used in microarray data analysis. 9

23 2.3.2 The T-Test There are several versions of the two-sample t-test, depending on whether the sample size (i.e m and n) is large and whether it is reasonable to assume that the gene expression levels have an equal variance under the two conditions. Both m and n are usually small, and there is evidence to support unequal variance (Thomas et. al. 2001), we will only discus the t-test with two independent small Normal samples with unequal variances. Let the sample means and variances of Y ik for gene i under the two conditions be Ȳ i(1) = m k=1 Y ik m, Ȳ i(2) = m+n k=m+1 Y ik n (2.3.3) and s 2 i(1) = s 2 i(2) = m k=1 (Y ik Ȳi(1)) 2, m 1 m+n k=m+1 (Y ik Ȳi(2)) 2. (2.3.4) n 1 The t-statistic is Z i = Ȳi(1) Ȳi(2) s 2 i(1) + s2 i(2) m n, (2.3.5) Under the normality assumption for Y ik, Z i degrees of freedom d j = This t-test was proposed by Welch (1947). ( s 2 ) 2 i(1) + s2 i(2) m n ( s ) 2 2 ( s ) 2 2 i(1) i(2) m n + m 1 n 1 freedom is similar to the idea of the Satterthwaite approximation. approximately has a t-distribution with Its method of calculating the degrees of Regression Model The regression model estimates the values of (a i, b i ) using the weighted least square method, and then estimates the variance of ˆb i using the robust or sandwich variance 10

24 estimator. V ar(ˆb i ) = ( s 2 i(1) m )( m 1 ) + m ( s 2 i(2) n )( n 1 and the estimate of ˆb i = Ȳi(1) Ȳi(2). Therefore the test statistics is n ), Z i = ˆbi V ar(ˆb i ) = s 2 i(1) m Ȳ i(1) Ȳi(2) m 1 m + s2 i(2) n. (2.3.6) n 1 n This test statistic compares well with that of the t-test. In the case of the t-test the test statistic is Z i = Ȳi(1) Ȳi(2) s 2 i(1) + s2 i(2) m n, (2.3.7) where Ȳi(1), Ȳi(2), s 2 i(1) and s2 i(2) are defined as in (2.3.3) and (2.3.4). Note that the two tests are the same as m, n, however in microarray data analysis both m, n are small, which makes the t-test better because of the unbiasedness of its variance estimator. Note that the strong parametric assumptions that needs to be made to use both the t-test and the regression approach is often times violated for microarray data analysis. Therefore, the Significance Analysis of Microarrays (SAM) is an important method developed for microarray data analysis that seeks to over theses strong parametric assumptions Significance Analysis of Microarrays (SAM) The significance analysis of microarrays (SAM) is one statistical technique for finding significant genes in a set of micoarray experiments. It was proposed by Tusher, Tibshirani and Chu [39]. This approach was based on analysis of random fluctuations in the data. However, even for a given level of expression, the fluctuations were gene specific. To account for gene-specific fluctuations, a statistic based on the ratio of change in gene expression to standard deviation in the data for that gene was defined. The relative difference d(i) in the gene expression is: d(i) = Ȳi(1) Ȳi(2) s(i) + s 0 (2.3.8) 11

25 where Ȳi(1) and Ȳi(2) are defined as the average expression levels of the i th gene from conditions 1 and 2, respectively. The gene-specific scatter s(i) is the standard deviation of repeated expression measurements: s(i) = 1/m + 1/n ( m (Y ik m + n 2 Ȳi(1)) 2 + k=1 m+n k=m+1 (Y ik Ȳi(1)) 2 ) where m and n are the numbers of measurements in conditions 1 and 2 respectively. (2.3.9) In order to compare values of d(i) across all genes, the distribution of d(i) should be independent of the level of gene expression. At low expression levels, variance in d(i) can be high because of small values of s(i). To ensure that the variance of d(i) is independent of the gene expression, a small positive constant s 0 (exchangeability factor) was added to the denominator of equation (2.3.8). The coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data. The value for s 0 was chosen to minimize the coefficient of variation. To minimize the effects of potential confounders between the conditions, the data was analyzed by taking B sets of permutations. For each permutation b the statistic d b i and the corresponding order statistics d b (1) d b (2)... d b (N) was computed. The expected relative difference, d b i = d b i, was defined as the average over the set of B permutations. B To identify potentially significant changes in expression levels, they used a scatter plot of the observed relative difference d(i) versus the expected relative difference d i. For a fixed threshold, starting at the origin, and moving up to the right find the first i = i 1 such that d i d i >. All genes pass i 1 are called significant positive. Similarly, start at the origin, move down to the left and find the first i = i 2 such that d i d i >. All genes pass i 2 are called significant negative. For each the upper cutoff point cut up ( ) was defined as the smallest d i among the significant positive genes, and similarly defining the lower cutoff point cut low ( ). To determine the number of falsely significant genes generated by SAM, the total number of falsely significant genes corresponding to each permutation was computed by counting the number of genes that exceeded the cutoffs cut up ( ) and cut low ( ). The estimated number of falsely significant genes was the median (or 90 th percentile) of the number of genes called significant from the B sets of permutations. Such genes are called 12

26 false positive (F P ). This information will then be used to estimate the false Discovery Rate (F DR) F DR = π 0 F P/T P (2.3.10) where π 0 is the true proportion of equivalent expressed (EE) genes in the data set and T P is the number of total (true) positives discovered from the test statistic, that is, T P is the total number of genes claimed to be differentially expressed (DE) Mixture Model Method (MMM) The mixture model method (MMM) was introduced to handle the problem when a small number of replications under two experimental conditions exist, which is exactly the case for the data in a microarray experiment. The main purpose of the MMM is to estimate the distribution of a t-type test statistic and its null statistic using finite normal mixture models, which results in the method being non-parametric. Additionally, the strong parametric assumption made when analyzing microarray when the traditional statistical test is applied is often violated, hence this make the MMM statistically safer because the assumption of normality is not made. Consider the situation where, for each gene i, i = 1, 2,..., N, we have expression levels Y i(1) = (Y i1,..., Y im ) from m microarrays under condition 1, and Y i(2) = (Y i,m+1,..., Y i,m+n ) from n arrays under condition 2. Here we need to assume that both m and n are even integers, this will become obvious later. The goal is to identify genes such that (Y i1,..., Y im ) and (Y i,m+1,..., Y i,m+n ) have different means. This appears to be a two sample comparison however, in microarray data, that has small m and n with a large N, renders the traditional statistical tests such as the t-test or rank-based nonparametric tests, ineffective. One alternative is to draw statistical inference based on the distributions of quantities related to (Y i1,..., Y im ) or (Y i,m+1,..., Y i,m+n ), for 1 i N, to take advantage of the large population size N. The model assumes a nonparametric approach for gene expression data: Y i(1) = µ i(1) + ε i(1) Y i(2) = µ i(2) + e i(2) 13

27 where µ i(1) and µ i(2) are the mean expression levels for gene i under the two conditions respectively, and ε i(1) and e i(2) are independent random errors with means and variances, such that E(ε i(1) ) = E(e i(2) ) = 0, V ar(ε i(1) ) = σ 2 i(1), V ar(e i(2) ) = σ 2 i(2), for any j = 1,..., m, m+1,..., m+n and i = 1,..., N. Note, we do not assume equality of variance of the gene expression levels, because the variance σi(c) 2 of gene expression level depends on the mean expression µ i(c). Also, we do not assume µ i(1) = µ i(2). The basis of the model is to compare two distributions of two similar statistics (after being suitably standardized) to infer whether some genes are differentially expressed. Let m and n be even such that p i (q i ) is a column vector containing random permutation of m/2 1 s and m/2-1 s (n/2 1 s and n/2-1 s). Let Y i(1) = (Y i1,..., Y im ) and Y i(2) = (Y i,m+1,..., Y i,m+n ) then assume that z i = Y i(1)p i /m + Y i(2) q i /n s 2i(1) /m + s2i(2) /n f 0, (2.3.11) which does not depend on µ i(1) and µ i(2) since its mean is 0. Furthermore, suppose that Z i = = m k=1 Y ik/m m+n k=m+1 Y ik/n s 2 i(1) /m + s2 i(2) /n Ȳ i(1) Ȳi(2) s 2i(1) /m + s2i(2) /n f 1. (2.3.12) The hypothesis is of the form H 0 : f 0 = f 1, there is no gene with altered expression H 1 : f 0 f 1, otherwise (2.3.13) and is valid only if the random errors are independent and their distribution is symmetric about 0. Since m, n > 1 then we can estimate s 2 i(1) and s2 i(2) using the sample variances s 2 i(1) = m k=1 (Y ik Ȳi(1)) 2 and s 2 m 1 i(2) = m+n k=m+1 (Y ik Ȳi(2)) 2 respectively. Note the data z n 1 i s and 14

28 Z i s are used to estimate f 0 and f 1 by normal mixture model respectively, which will be discussed in more details in chapter 3. To test the null hypothesis H 0 that Z is from f 0 (which is equivalent to testing the hypothesis (2.3.13)), we can construct a likelihood ratio test (LRT) based on the following statistic: LR(Z) = f 0(Z) f 1 (Z). (2.3.14) A large value of LR(Z) gives no evidence against H 0, whereas a too small value of LR(Z) leads to rejecting H 0. determine the rejection region. bisection method [29] to solve With the normal mixture model, it is possible to numerically For any given false positive rate α, we can use the α = LR(z)<s f 0 (z)dz to obtain the suitable cut off point s. Then the rejection region is RR(α) = {Z : LR(Z) < s}. We call the method of using the LRT in MMM as MMM-LRT. Similar to SAM (Tusher et al. 2001), we can estimate the numbers of false positive (F P ) and total (T P ) directly. In MMM-LRT, for any given s, we have: F P (s) = 1 B B b=1 n(i : LR(z (b) i ) < s), T P (s) = n(i : LR(Z i ) < s) where n(i) represents the number of genes. In estimating F P, one can also use median, rather than mean, F P over the permuted data. Based on the estimated F P and T P, one can also calculate the false discovery rate as F DR = F P/T P (Benjamini and Hochberg 1995; Storey 2001; Tusher et al. 2001) A Mixture Model Approach Using P -Value In is well known that the distribution of the p-values is uniformly distributed on the interval [0, 1], regardless of the statistical test used and the sample size. Therefore if investigators uses a valid statistical test to produce p-values for testing the null hypothesis 15

29 H 0 there is no difference between the two experiments for the i th gene, i = 1,..., N, then the distribution of the p-value can be used to determine the genes that were differentially expressed. The assumption of independence of the gene expression levels across genes was made under the null hypothesis. Additionally, under the alternative hypothesis, the distribution of p-values will tend to cluster closer to zero than to one, as opposed to be uniformly distributed under the null hypothesis. Then, the question Is there statistically significant evidence that any of the genes under study exhibit a difference in expression across the two experimental conditions? can be answered by conducting a test to determine if the observed p-values are significantly different from the uniform distribution. This is done by mixture model approach [2]. The mixture model is a g-component of beta distributions β(r j, s j ) for j = 1,..., g with the parameters r j and s j, where the beta distribution is defined as follows β(y r, s) = Γ(r + s)yr 1 (1 y) s 1. Γ(r)Γ(s) The reason for the choice of the beta distribution is because of its great flexibility in modeling any shaped distribution on the interval [0, 1]. Note that the uniform distribution is a special case of the beta distribution with r = s = 1. The likelihood for the collection of N, p-values from a model with g components is given as L g = [ N p 1 β(y i 1, 1) ] g p j β(y i r j, s j ), Therefore the log likelihood for the N p-values can be expressed as l g = j=2 [ N ln p 1 β(y i 1, 1) + ] g p j β(y i r j, s j ), where y i is the p-value for the i th test, p 1 is the probability that a randomly chosen test from the collection of tests is for a gene where there is no population difference in gene expression (i.e., a test of a true null hypothesis), and p j is the probability that a randomly chosen test from the collection of tests is for a gene where there is a population difference in gene expression (i.e., a test of a false null hypothesis). The above model now requires j=2 16

30 the calculation of the MLE of the parameters p j, r j and s j through iterative procedure subject to the constraints g p j = 1 and 0 p j 1 for all j = 1,..., g. The estimate of the number of genes for which there is a difference in gene expression is evaluated as N(1 ˆp 1 ), where ˆp 1 is the MLE of p 1. Let T be some threshold below which the results for particular genes are declared interesting and worthy of follow-up study, the proportion of those genes that are likely to be genes for which there is a real difference is P ( H 0,i y i T ) = 1 P (H 0,i y i T ) = 1 P (H 0,i y i T ), P (y i T ) where g T P (y i T ) = p 1 T + p j j=2 0 Γ(r j + s j )y rj 1 (1 y) s j 1 dy Γ(r j )Γ(s j ) and P ( H 0,i y i T ) = p 1 T. The estimated proportion of genes declared interesting that are likely to be false leads is simply P (H 0,i y i T ) = P (H 0,i y i T ). P (y i T ) Similarly the proportion of those genes not declared interesting that are likely to be genes for which there is a real difference is P ( H 0,i y i T ) = 1 P (H 0,i y i T ) = 1 P (H 0,i y i T ), P (y i T ) where g 1 Γ(r j + s j )y rj 1 (1 y) s j 1 P (y i T ) = p 1 (1 T ) + p j dy Γ(r j )Γ(s j ) j=2 T and P ( H 0,i y i T ) = p 1 (1 T ). 17

31 2.4 Conclusion This chapter discussed a few of the methods used to analyze microarray data. An introduction to cluster analysis was presented, but, cluster analysis was not an effective method to determine differentially express genes. Hence the need to make use of the more classical statistical methods such as the t-test and regression analysis. However, with strong parametric assumptions that will be necessary for microarray analysis, these methods has some limitations. Microarray data are many times consist of a few replications for case and control groups, although the number of genes are usually greater than The assumption that the genes are independent is one assumption that is typical in the analysis of microarray data. Note that in chapters 3 and 5 the development of the modified approaches use the independence assumption, therefore we are prepared to deal with the consequences of assuming the genes are independent. The Significance Analysis of Microarrays (SAM) and the Mixture Model Method (MMM) presented in this chapter uses a t-type statistics to determine the number of differentially expressed genes. However, the MMM has one advantage in that it is a non-parametric approach. The MMM determines the distributions under the null and alternative and then uses these distributions to determine the number of differentially expressed genes by means of a likelihood ratio test. The p-value approach of Allision relies on parametric assumptions that are made to determine the p-values. If the p-values are not valid then its distributions under the null hypothesis may not be uniform on the interval [0,1]. In discussing the modified p-value approach presented in chapter 6, we are aware that the t-test used to determine the p-values must be valid for the modified p-value approach to be valid. However, for this dissertation we assume all the assumptions are satisfied with respect to the modified p-value approach. In addition, to the method used to analyze microarray data, we presented the biological background that the reader needs so that he may fully understand the challenges statisticians have in the analysis of microarray data. 18

32 3 Finite Mixture Distribution In this chapter we will give a brief background on mixture distributions. Mixture models are vital in statistical practice and research because many problems in statistics have mixture structures. Furthermore they are useful in describing complex population with observed or unobserved heterogeneity. Some examples are that human heights may be modeled as a two-component mixture, one component for men and one for women. Substructures in galaxy may be modeled as contaminations of big initial galaxy; the evidence of substructures is important in modern galaxy formation theory (Sun, Morrison, Harding and Woodroofe 2002). There are also applications in actuarial science, biological science, econometrics, medicine, agriculture, zoology, population studies and microarray data analysis. K. Pearson (1894) was the first to study mixture of two normal distributions, where he modeled the mixing of different crab species. Mixture model has become popular because: (1) they provide a simple mechanism to incorporate extra variation and correlation in the model (2) they add model flexibility and (3) they are a natural approach for modeling data that arise in multiple stages or when populations are composed of sub populations. In addition the theory, applications, history and importance of mixture models have been discussed in journal articles, monographs and textbooks. Everitt and Hand (1981), Titterington, Smith and Makov (1985), Böhning (1999), and McLachlan and Peel (2000) provided models, statistical methods and references for finite mixtures problems. 19

33 3.1 Definition and Preliminary Definition A stochastic variable {Y i : 1 i n} with density function f(y i θ j ) follows a finite mixture distribution if Y i π 1 f i1 (y i θ 1 ) + π 2 f i2 (y i θ 2 ) π g f ig (y i θ g ) g = π j f ij (y i θ j ), (3.1.1) where f i1 (y i θ 1 ),..., f ig (y i θ g ) are g density functions and π 1,..., π g are called mixing proportions, satisfying the following properties 0 π j 1 and g π j = 1. The densities f ij (y) for j = 1,..., g may be continuous or discrete, or a combination of both. From Definition we observe that finite mixture distribution is the marginal distribution of a random variable which follows different distributions in different subpopulations of a general population. Therefore, if a population S is defined as S = {S 1, S 2,..., S g }, such that S j S k =, j k. Then the distribution in each sub-population is given to be In S 1 : Y S 1 f 1 (Y θ 1 ) In S 2 : Y S 2 f 2 (Y θ 2 )... In S g : Y S g f g (Y θ g ) Furthermore, let X represent the statistic in each sub-population i.e., X = x 1, if ins 1 ; X = x 2, if ins 2 ;...,... ; X = x 3, if ins 3. Then X follows a discrete distribution with support {x 1, x 2,..., x g } and corresponding probabilities (weights) {π 1, π 2,..., π g }, that is P (X = x j ) = π j for j = 1,..., g. 20

34 Therefore for the finite mixture Y i g π j f ij (y i θ j ), we have Y i (X = x j ) f ij (y i θ j ), j = 1, 2,..., g where X is denoted as follow, X x 1 x 2... x g π 1 π 2... π g Note the random variable X is called latent because, in most applications, it is not observed. We now present some examples of finite mixture distributions Examples of Mixture Distributions Example Normal with common variance, that is, Y g π j N(µ j, σ 2 ) where the parameters for this mixture are θ j = (µ j, σ 2 ) and π j for j = 1,..., g. Note that Y (X = µ j ) N(µ j, σ 2 ) where X µ 1 µ 2... µ g π 1 π 2... π g. Example Normal with common mean, that is, Y g π j N(µ, σj 2 ) 21

35 where the parameters for this mixture are θ j = (µ, σ 2 j ) and π j for j = 1,..., g. Note that Y (X = σ 2 j ) N(µ, σ 2 j ) where X σ2 1 σ σg 2 π 1 π 2... π g. Example Normal with general mean and variance, that is, Y g π j N(µ j, σj 2 ) where the parameters for this mixture are θ j = (µ j, σ 2 j ) and π j for j = 1,..., g. Note that Y (X 1 = µ j, X 2 = σ 2 j ) N(µ j, σ 2 j ) where X = (X 1, X 2 ) (µ, σ2 1) (µ, σ2) 2... (µ, σg) 2 π 1 π 2... π g Mean and Variance of Mixtures Let Y g π ijf ij (y i θ j ) be a random variable that has a mixture distribution. Using the latent variable definition above, the mean and variance have the following known basic probability results for any random variables. Proposition E(Y ) = E(E(Y X)) Proposition V ar(y ) = V ar(e(y X)) + E(V ar(y X)) This implies that the mean and variance of Examples (3.1.2), (3.1.3) and (3.1.4) are given as: For Example (3.1.2) we have E(Y ) = E(E(Y X)) = E(X) = g π j µ j 22

36 and = Example (3.1.3) results in V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(x) + E(V ar(x)) g ( g ) 2 = π j µ 2 j π j µ j + E(σ 2 ) g ( g ) 2 π j µ 2 j π j µ j + σ 2. E(Y ) = E(E(Y X)) = E(µ) = µ and For Example (3.1.4) we have V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(µ) + E(σj 2 ) g = π j σj 2. E(Y ) = E(E(Y X)) = E(X 1 ) = g π j µ j and V ar(y ) = V ar(e(y X)) + E(V ar(y X)) = V ar(x 1 ) + E(X 2 ) g ( g ) 2 = π j µ 2 j π j µ j + E(σ 2 j ) = g ( g ) 2 π j µ 2 j π j µ j + g π j σj 2. 23

Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis

Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray Analysis University of South Florida Scholar Commons Graduate Theses and Dissertations Graduate School 3-29-2010 Statistical Learning and Behrens Fisher Distribution Methods for Heteroscedastic Data in Microarray

More information

Single gene analysis of differential expression. Giorgio Valentini

Single gene analysis of differential expression. Giorgio Valentini Single gene analysis of differential expression Giorgio Valentini valenti@disi.unige.it Comparing two conditions Each condition may be represented by one or more RNA samples. Using cdna microarrays, samples

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 5, Issue 1 2006 Article 28 A Two-Step Multiple Comparison Procedure for a Large Number of Tests and Multiple Treatments Hongmei Jiang Rebecca

More information

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)

More information

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA

DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING THE FALSE DISCOVERY RATE FOR MICROARRAY DATA University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Dissertations and Theses in Statistics Statistics, Department of 2009 DETECTING DIFFERENTIALLY EXPRESSED GENES WHILE CONTROLLING

More information

Non-specific filtering and control of false positives

Non-specific filtering and control of false positives Non-specific filtering and control of false positives Richard Bourgon 16 June 2009 bourgon@ebi.ac.uk EBI is an outstation of the European Molecular Biology Laboratory Outline Multiple testing I: overview

More information

Biochip informatics-(i)

Biochip informatics-(i) Biochip informatics-(i) : biochip normalization & differential expression Ju Han Kim, M.D., Ph.D. SNUBI: SNUBiomedical Informatics http://www.snubi snubi.org/ Biochip Informatics - (I) Biochip basics Preprocessing

More information

Lesson Overview The Structure of DNA

Lesson Overview The Structure of DNA 12.2 THINK ABOUT IT The DNA molecule must somehow specify how to assemble proteins, which are needed to regulate the various functions of each cell. What kind of structure could serve this purpose without

More information

The miss rate for the analysis of gene expression data

The miss rate for the analysis of gene expression data Biostatistics (2005), 6, 1,pp. 111 117 doi: 10.1093/biostatistics/kxh021 The miss rate for the analysis of gene expression data JONATHAN TAYLOR Department of Statistics, Stanford University, Stanford,

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors

Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. Table of Outcomes. T=number of type 2 errors The Multiple Testing Problem Multiple Testing Methods for the Analysis of Microarray Data 3/9/2009 Copyright 2009 Dan Nettleton Suppose one test of interest has been conducted for each of m genes in a

More information

Biology I Fall Semester Exam Review 2014

Biology I Fall Semester Exam Review 2014 Biology I Fall Semester Exam Review 2014 Biomolecules and Enzymes (Chapter 2) 8 questions Macromolecules, Biomolecules, Organic Compunds Elements *From the Periodic Table of Elements Subunits Monomers,

More information

Number of questions TEK (Learning Target) Biomolecules & Enzymes

Number of questions TEK (Learning Target) Biomolecules & Enzymes Unit Biomolecules & Enzymes Number of questions TEK (Learning Target) on Exam 8 questions 9A I can compare and contrast the structure and function of biomolecules. 9C I know the role of enzymes and how

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Statistical analysis of microarray data: a Bayesian approach

Statistical analysis of microarray data: a Bayesian approach Biostatistics (003), 4, 4,pp. 597 60 Printed in Great Britain Statistical analysis of microarray data: a Bayesian approach RAPHAEL GTTARD University of Washington, Department of Statistics, Box 3543, Seattle,

More information

Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following

Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following Name: Score: / Quiz 2 on Lectures 3 &4 Part 1 Sugars, such as glucose or fructose are the basic building blocks of more complex carbohydrates. Which of the following foods is not a significant source of

More information

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Chapters 12&13 Notes: DNA, RNA & Protein Synthesis Name Period Words to Know: nucleotides, DNA, complementary base pairing, replication, genes, proteins, mrna, rrna, trna, transcription, translation, codon,

More information

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001).

Chapter 3: Statistical methods for estimation and testing. Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference: Statistical methods in bioinformatics by Ewens & Grant (2001). Chapter 3: Statistical methods for estimation and testing Key reference:

More information

UNIT 5. Protein Synthesis 11/22/16

UNIT 5. Protein Synthesis 11/22/16 UNIT 5 Protein Synthesis IV. Transcription (8.4) A. RNA carries DNA s instruction 1. Francis Crick defined the central dogma of molecular biology a. Replication copies DNA b. Transcription converts DNA

More information

Single gene analysis of differential expression

Single gene analysis of differential expression Single gene analysis of differential expression Giorgio Valentini DSI Dipartimento di Scienze dell Informazione Università degli Studi di Milano valentini@dsi.unimi.it Comparing two conditions Each condition

More information

Experimental designs for multiple responses with different models

Experimental designs for multiple responses with different models Graduate Theses and Dissertations Graduate College 2015 Experimental designs for multiple responses with different models Wilmina Mary Marget Iowa State University Follow this and additional works at:

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Model Worksheet Teacher Key

Model Worksheet Teacher Key Introduction Despite the complexity of life on Earth, the most important large molecules found in all living things (biomolecules) can be classified into only four main categories: carbohydrates, lipids,

More information

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1:

Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Berg Tymoczko Stryer Biochemistry Sixth Edition Chapter 1: Biochemistry: An Evolving Science Tips on note taking... Remember copies of my lectures are available on my webpage If you forget to print them

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

LIFE SCIENCE CHAPTER 5 & 6 FLASHCARDS

LIFE SCIENCE CHAPTER 5 & 6 FLASHCARDS LIFE SCIENCE CHAPTER 5 & 6 FLASHCARDS Why were ratios important in Mendel s work? A. They showed that heredity does not follow a set pattern. B. They showed that some traits are never passed on. C. They

More information

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments

Linear Models and Empirical Bayes Methods for. Assessing Differential Expression in Microarray Experiments Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments by Gordon K. Smyth (as interpreted by Aaron J. Baraff) STAT 572 Intro Talk April 10, 2014 Microarray

More information

Parametric Empirical Bayes Methods for Microarrays

Parametric Empirical Bayes Methods for Microarrays Parametric Empirical Bayes Methods for Microarrays Ming Yuan, Deepayan Sarkar, Michael Newton and Christina Kendziorski April 30, 2018 Contents 1 Introduction 1 2 General Model Structure: Two Conditions

More information

SPOTTED cdna MICROARRAYS

SPOTTED cdna MICROARRAYS SPOTTED cdna MICROARRAYS Spot size: 50um - 150um SPOTTED cdna MICROARRAYS Compare the genetic expression in two samples of cells PRINT cdna from one gene on each spot SAMPLES cdna labelled red/green e.g.

More information

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations

Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Multiple Change-Point Detection and Analysis of Chromosome Copy Number Variations Yale School of Public Health Joint work with Ning Hao, Yue S. Niu presented @Tsinghua University Outline 1 The Problem

More information

2. What was the Avery-MacLeod-McCarty experiment and why was it significant? 3. What was the Hershey-Chase experiment and why was it significant?

2. What was the Avery-MacLeod-McCarty experiment and why was it significant? 3. What was the Hershey-Chase experiment and why was it significant? Name Date Period AP Exam Review Part 6: Molecular Genetics I. DNA and RNA Basics A. History of finding out what DNA really is 1. What was Griffith s experiment and why was it significant? 1 2. What was

More information

Unsupervised machine learning

Unsupervised machine learning Chapter 9 Unsupervised machine learning Unsupervised machine learning (a.k.a. cluster analysis) is a set of methods to assign objects into clusters under a predefined distance measure when class labels

More information

Statistical testing. Samantha Kleinberg. October 20, 2009

Statistical testing. Samantha Kleinberg. October 20, 2009 October 20, 2009 Intro to significance testing Significance testing and bioinformatics Gene expression: Frequently have microarray data for some group of subjects with/without the disease. Want to find

More information

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data

A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data A Sequential Bayesian Approach with Applications to Circadian Rhythm Microarray Gene Expression Data Faming Liang, Chuanhai Liu, and Naisyin Wang Texas A&M University Multiple Hypothesis Testing Introduction

More information

Optimal normalization of DNA-microarray data

Optimal normalization of DNA-microarray data Optimal normalization of DNA-microarray data Daniel Faller 1, HD Dr. J. Timmer 1, Dr. H. U. Voss 1, Prof. Dr. Honerkamp 1 and Dr. U. Hobohm 2 1 Freiburg Center for Data Analysis and Modeling 1 F. Hoffman-La

More information

BIOCHEMISTRY GUIDED NOTES - AP BIOLOGY-

BIOCHEMISTRY GUIDED NOTES - AP BIOLOGY- BIOCHEMISTRY GUIDED NOTES - AP BIOLOGY- ELEMENTS AND COMPOUNDS - anything that has mass and takes up space. - cannot be broken down to other substances. - substance containing two or more different elements

More information

Name: Date: Hour: Unit Four: Cell Cycle, Mitosis and Meiosis. Monomer Polymer Example Drawing Function in a cell DNA

Name: Date: Hour: Unit Four: Cell Cycle, Mitosis and Meiosis. Monomer Polymer Example Drawing Function in a cell DNA Unit Four: Cell Cycle, Mitosis and Meiosis I. Concept Review A. Why is carbon often called the building block of life? B. List the four major macromolecules. C. Complete the chart below. Monomer Polymer

More information

Notes Chapter 4 Cell Reproduction. That cell divided and becomes two, two become, four become eight, and so on.

Notes Chapter 4 Cell Reproduction. That cell divided and becomes two, two become, four become eight, and so on. Notes Chapter 4 Cell Reproduction 4.1 Cell Division and Mitosis Many organisms start as. That cell divided and becomes two, two become, four become eight, and so on. Many-celled organisms, including you,

More information

Probabilistic Inference for Multiple Testing

Probabilistic Inference for Multiple Testing This is the title page! This is the title page! Probabilistic Inference for Multiple Testing Chuanhai Liu and Jun Xie Department of Statistics, Purdue University, West Lafayette, IN 47907. E-mail: chuanhai,

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures

How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures How To Use CORREP to Estimate Multivariate Correlation and Statistical Inference Procedures Dongxiao Zhu June 13, 2018 1 Introduction OMICS data are increasingly available to biomedical researchers, and

More information

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University

Multiple Testing. Hoang Tran. Department of Statistics, Florida State University Multiple Testing Hoang Tran Department of Statistics, Florida State University Large-Scale Testing Examples: Microarray data: testing differences in gene expression between two traits/conditions Microbiome

More information

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons:

STAT 263/363: Experimental Design Winter 2016/17. Lecture 1 January 9. Why perform Design of Experiments (DOE)? There are at least two reasons: STAT 263/363: Experimental Design Winter 206/7 Lecture January 9 Lecturer: Minyong Lee Scribe: Zachary del Rosario. Design of Experiments Why perform Design of Experiments (DOE)? There are at least two

More information

Chapter 2: Chemistry. What does chemistry have to do with biology? Vocabulary BIO 105

Chapter 2: Chemistry. What does chemistry have to do with biology? Vocabulary BIO 105 Chapter 2: Chemistry What does chemistry have to do with biology? BIO 105 Vocabulary 1. Matter anything that takes up space and has mass Atoms are the smallest units of matter that can participate in chemical

More information

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology 2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the

More information

Honors Biology Fall Final Exam Study Guide

Honors Biology Fall Final Exam Study Guide Honors Biology Fall Final Exam Study Guide Helpful Information: Exam has 100 multiple choice questions. Be ready with pencils and a four-function calculator on the day of the test. Review ALL vocabulary,

More information

THINGS I NEED TO KNOW:

THINGS I NEED TO KNOW: THINGS I NEED TO KNOW: 1. Prokaryotic and Eukaryotic Cells Prokaryotic cells do not have a true nucleus. In eukaryotic cells, the DNA is surrounded by a membrane. Both types of cells have ribosomes. Some

More information

Notes Chapter 4 Cell Reproduction. That cell divided and becomes two, two become four, four become eight, and so on.

Notes Chapter 4 Cell Reproduction. That cell divided and becomes two, two become four, four become eight, and so on. 4.1 Cell Division and Mitosis Many organisms start as one cell. Notes Chapter 4 Cell Reproduction That cell divided and becomes two, two become four, four become eight, and so on. Many-celled organisms,

More information

Sample Size Estimation for Studies of High-Dimensional Data

Sample Size Estimation for Studies of High-Dimensional Data Sample Size Estimation for Studies of High-Dimensional Data James J. Chen, Ph.D. National Center for Toxicological Research Food and Drug Administration June 3, 2009 China Medical University Taichung,

More information

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop

Inferential Statistical Analysis of Microarray Experiments 2007 Arizona Microarray Workshop Inferential Statistical Analysis of Microarray Experiments 007 Arizona Microarray Workshop μ!! Robert J Tempelman Department of Animal Science tempelma@msuedu HYPOTHESIS TESTING (as if there was only one

More information

Guided Notes Unit 4: Cellular Reproduction

Guided Notes Unit 4: Cellular Reproduction Name: Date: Block: Chapter 5: Cell Growth and Division I. Background Guided Notes Unit 4: Cellular Reproduction a. "Where a cell exists, there must have been a preexisting cell..." - Rudolf Virchow b.

More information

Generalized Linear Models (1/29/13)

Generalized Linear Models (1/29/13) STA613/CBB540: Statistical methods in computational biology Generalized Linear Models (1/29/13) Lecturer: Barbara Engelhardt Scribe: Yangxiaolu Cao When processing discrete data, two commonly used probability

More information

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution A Thesis Submitted to the Faculty of Drexel University by Scott Haney in partial fulfillment of the requirements

More information

University of California, Berkeley

University of California, Berkeley University of California, Berkeley U.C. Berkeley Division of Biostatistics Working Paper Series Year 2004 Paper 147 Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data Sunduz

More information

Statistical Methods for Analysis of Genetic Data

Statistical Methods for Analysis of Genetic Data Statistical Methods for Analysis of Genetic Data Christopher R. Cabanski A dissertation submitted to the faculty of the University of North Carolina at Chapel Hill in partial fulfillment of the requirements

More information

nucleus: DNA & chromosomes

nucleus: DNA & chromosomes nucleus: DNA & chromosomes chapter 5 nuclear organization nuclear structure nuclear envelope nucleoplasm nuclear matrix nucleolus nuclear envelope nucleolus nuclear matrix nucleoplasm nuclear pore nuclear

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Post-Selection Inference

Post-Selection Inference Classical Inference start end start Post-Selection Inference selected end model data inference data selection model data inference Post-Selection Inference Todd Kuffner Washington University in St. Louis

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

A Large-Sample Approach to Controlling the False Discovery Rate

A Large-Sample Approach to Controlling the False Discovery Rate A Large-Sample Approach to Controlling the False Discovery Rate Christopher R. Genovese Department of Statistics Carnegie Mellon University Larry Wasserman Department of Statistics Carnegie Mellon University

More information

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective

DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective DESIGNING EXPERIMENTS AND ANALYZING DATA A Model Comparison Perspective Second Edition Scott E. Maxwell Uniuersity of Notre Dame Harold D. Delaney Uniuersity of New Mexico J,t{,.?; LAWRENCE ERLBAUM ASSOCIATES,

More information

Cell Growth and Division

Cell Growth and Division Cell Growth and Division Why do cells divide* Life and reproduction require cell division You require constant cell reproduction to live Mitosis: development (a) mitotic cell division (b) mitotic cell

More information

Inferences about Parameters of Trivariate Normal Distribution with Missing Data

Inferences about Parameters of Trivariate Normal Distribution with Missing Data Florida International University FIU Digital Commons FIU Electronic Theses and Dissertations University Graduate School 7-5-3 Inferences about Parameters of Trivariate Normal Distribution with Missing

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

Central Limit Theorem ( 5.3)

Central Limit Theorem ( 5.3) Central Limit Theorem ( 5.3) Let X 1, X 2,... be a sequence of independent random variables, each having n mean µ and variance σ 2. Then the distribution of the partial sum S n = X i i=1 becomes approximately

More information

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method

Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Controlling the False Discovery Rate: Understanding and Extending the Benjamini-Hochberg Method Christopher R. Genovese Department of Statistics Carnegie Mellon University joint work with Larry Wasserman

More information

2015 FALL FINAL REVIEW

2015 FALL FINAL REVIEW 2015 FALL FINAL REVIEW Biomolecules & Enzymes Illustrate table and fill in parts missing 9A I can compare and contrast the structure and function of biomolecules. 9C I know the role of enzymes and how

More information

Expression arrays, normalization, and error models

Expression arrays, normalization, and error models 1 Epression arrays, normalization, and error models There are a number of different array technologies available for measuring mrna transcript levels in cell populations, from spotted cdna arrays to in

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 6, Issue 1 2007 Article 28 A Comparison of Methods to Control Type I Errors in Microarray Studies Jinsong Chen Mark J. van der Laan Martyn

More information

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays

Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Comparison of the Empirical Bayes and the Significance Analysis of Microarrays Holger Schwender, Andreas Krause, and Katja Ickstadt Abstract Microarrays enable to measure the expression levels of tens

More information

Lecture 21: October 19

Lecture 21: October 19 36-705: Intermediate Statistics Fall 2017 Lecturer: Siva Balakrishnan Lecture 21: October 19 21.1 Likelihood Ratio Test (LRT) To test composite versus composite hypotheses the general method is to use

More information

1/23/2012. Atoms. Atoms Atoms - Electron Shells. Chapter 2 Outline. Planetary Models of Elements Chemical Bonds

1/23/2012. Atoms. Atoms Atoms - Electron Shells. Chapter 2 Outline. Planetary Models of Elements Chemical Bonds Chapter 2 Outline Atoms Chemical Bonds Acids, Bases and the p Scale Organic Molecules Carbohydrates Lipids Proteins Nucleic Acids Are smallest units of the chemical elements Composed of protons, neutrons

More information

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Low-Level Analysis of High- Density Oligonucleotide Microarray Data Low-Level Analysis of High- Density Oligonucleotide Microarray Data Ben Bolstad http://www.stat.berkeley.edu/~bolstad Biostatistics, University of California, Berkeley UC Berkeley Feb 23, 2004 Outline

More information

More on Unsupervised Learning

More on Unsupervised Learning More on Unsupervised Learning Two types of problems are to find association rules for occurrences in common in observations (market basket analysis), and finding the groups of values of observational data

More information

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018

Zhiguang Huo 1, Chi Song 2, George Tseng 3. July 30, 2018 Bayesian latent hierarchical model for transcriptomic meta-analysis to detect biomarkers with clustered meta-patterns of differential expression signals BayesMP Zhiguang Huo 1, Chi Song 2, George Tseng

More information

Mixture models for analysing transcriptome and ChIP-chip data

Mixture models for analysing transcriptome and ChIP-chip data Mixture models for analysing transcriptome and ChIP-chip data Marie-Laure Martin-Magniette French National Institute for agricultural research (INRA) Unit of Applied Mathematics and Informatics at AgroParisTech,

More information

False discovery rate procedures for high-dimensional data Kim, K.I.

False discovery rate procedures for high-dimensional data Kim, K.I. False discovery rate procedures for high-dimensional data Kim, K.I. DOI: 10.6100/IR637929 Published: 01/01/2008 Document Version Publisher s PDF, also known as Version of Record (includes final page, issue

More information

Exam: high-dimensional data analysis January 20, 2014

Exam: high-dimensional data analysis January 20, 2014 Exam: high-dimensional data analysis January 20, 204 Instructions: - Write clearly. Scribbles will not be deciphered. - Answer each main question not the subquestions on a separate piece of paper. - Finish

More information

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine.

1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine. Protein Synthesis & Mutations RNA 1. Contains the sugar ribose instead of deoxyribose. 2. Single-stranded instead of double stranded. 3. Contains uracil in place of thymine. RNA Contains: 1. Adenine 2.

More information

Darwin's theory of evolution by natural selection. Week 3

Darwin's theory of evolution by natural selection. Week 3 Darwin's theory of evolution by natural selection Week 3 1 Announcements -HW 1 - will be on chapters 1 and 2 DUE: 2-27 2 Summary *Essay talk *Explaining evolutionary change *Getting to know natural selection

More information

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3)

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3) Darwin's theory of natural selection, its rivals, and cells Week 3 (finish ch 2 and start ch 3) 1 Historical context Discovery of the new world -new observations challenged long-held views -exposure to

More information

Large-Scale Hypothesis Testing

Large-Scale Hypothesis Testing Chapter 2 Large-Scale Hypothesis Testing Progress in statistics is usually at the mercy of our scientific colleagues, whose data is the nature from which we work. Agricultural experimentation in the early

More information

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question.

SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question. Exam Name SHORT ANSWER. Write the word or phrase that best completes each statement or answers the question. Figure 2.1 Using Figure 2.1, match the following: 1) Lipid. 2) Functional protein. 3) Nucleotide.

More information

Objective 3.01 (DNA, RNA and Protein Synthesis)

Objective 3.01 (DNA, RNA and Protein Synthesis) Objective 3.01 (DNA, RNA and Protein Synthesis) DNA Structure o Discovered by Watson and Crick o Double-stranded o Shape is a double helix (twisted ladder) o Made of chains of nucleotides: o Has four types

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

CCHS 2015_2016 Biology Fall Semester Exam Review

CCHS 2015_2016 Biology Fall Semester Exam Review Biomolecule General Knowledge Macromolecule Monomer (building block) Function Energy Storage Structure 1. What type of biomolecule is hair, skin, and nails? 2. What is the polymer of a nucleotide? 3. Which

More information

Human Biology. The Chemistry of Living Things. Concepts and Current Issues. All Matter Consists of Elements Made of Atoms

Human Biology. The Chemistry of Living Things. Concepts and Current Issues. All Matter Consists of Elements Made of Atoms 2 The Chemistry of Living Things PowerPoint Lecture Slide Presentation Robert J. Sullivan, Marist College Michael D. Johnson Human Biology Concepts and Current Issues THIRD EDITION Copyright 2006 Pearson

More information

Modified Simes Critical Values Under Positive Dependence

Modified Simes Critical Values Under Positive Dependence Modified Simes Critical Values Under Positive Dependence Gengqian Cai, Sanat K. Sarkar Clinical Pharmacology Statistics & Programming, BDS, GlaxoSmithKline Statistics Department, Temple University, Philadelphia

More information

Stat 206: Estimation and testing for a mean vector,

Stat 206: Estimation and testing for a mean vector, Stat 206: Estimation and testing for a mean vector, Part II James Johndrow 2016-12-03 Comparing components of the mean vector In the last part, we talked about testing the hypothesis H 0 : µ 1 = µ 2 where

More information

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2 Cellular Neuroanatomy I The Prototypical Neuron: Soma Reading: BCP Chapter 2 Functional Unit of the Nervous System The functional unit of the nervous system is the neuron. Neurons are cells specialized

More information

GS Analysis of Microarray Data

GS Analysis of Microarray Data GS01 0163 Analysis of Microarray Data Keith Baggerly and Kevin Coombes Section of Bioinformatics Department of Biostatistics and Applied Mathematics UT M. D. Anderson Cancer Center kabagg@mdanderson.org

More information

CCHS 2016_2017 Biology Fall Semester Exam Review

CCHS 2016_2017 Biology Fall Semester Exam Review CCHS 2016_2017 Biology Fall Semester Exam Review Biomolecule General Knowledge Macromolecule Monomer (building block) Function Structure 1. What type of biomolecule is hair, skin, and nails? Energy Storage

More information

What Mad Pursuit (1988, Ch.5) Francis Crick (1916 ) British molecular Biologist 12 BIOLOGY, CH 1

What Mad Pursuit (1988, Ch.5) Francis Crick (1916 ) British molecular Biologist 12 BIOLOGY, CH 1 1 Almost all aspects of life are engineered at the molecular level, and without understanding molecules we can only have a very sketchy understanding of life itself. What Mad Pursuit (1988, Ch.5) Francis

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data

Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Statistical analysis of genomic binding sites using high-throughput ChIP-seq data Ibrahim Ali H Nafisah Department of Statistics University of Leeds Submitted in accordance with the requirments for the

More information

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018

High-Throughput Sequencing Course. Introduction. Introduction. Multiple Testing. Biostatistics and Bioinformatics. Summer 2018 High-Throughput Sequencing Course Multiple Testing Biostatistics and Bioinformatics Summer 2018 Introduction You have previously considered the significance of a single gene Introduction You have previously

More information

SECOND PUBLIC EXAMINATION. Honour School of Physics Part C: 4 Year Course. Honour School of Physics and Philosophy Part C C7: BIOLOGICAL PHYSICS

SECOND PUBLIC EXAMINATION. Honour School of Physics Part C: 4 Year Course. Honour School of Physics and Philosophy Part C C7: BIOLOGICAL PHYSICS 2757 SECOND PUBLIC EXAMINATION Honour School of Physics Part C: 4 Year Course Honour School of Physics and Philosophy Part C C7: BIOLOGICAL PHYSICS TRINITY TERM 2013 Monday, 17 June, 2.30 pm 5.45 pm 15

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information