Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty

Size: px

Start display at page:

Download "Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty"

Vivien Gibson
6 years ago
Views:

1 Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution A Thesis Submitted to the Faculty of Drexel University by Scott Haney in partial fulfillment of the requirements for the degree of Doctor of Philosophy March 3 rd, 011

3 Table of Contents List of Tables... Abstract... ii i 1. Introduction Background on Microarray Data Analysis Gene Expression Measuring Gene Expression Affymetrix Microarrays Experimental Errors and Data Preprocessing Properties of the Exponentially Modified Gaussian (EMG) Distribution Reparameterization of the EMG Distribution EMG Quantile Bounds Parameter Estimation Shape Estimation EMG Right Tail Approximation Application of the EMG Distribution to Actual Affymetrix Microarray Perfect Match (PM) Probe Distributions Comparing the Right Tail to a Shifted Exponential Distribution Discrepancy in the Sample Quantile of the Sample Mean Fitting the Right Tail of the Perfect Match (PM) Probe Data Derivation of Functions That Decrease by a Common Ratio Application of Functions that Decrease by a Common Ratio to the Right Tail Practical Implementation of EMG Parameter Estimation Method and Properties Proof of Consistency... 38

4 6. Practical Considerations and Alterations Summary of Final Parameter Estimation Method Currently Available Methods Maximum Likelihood Estimation Method The Silver Method Method of Moments Comparison of Methods on Synthetic Data Conclusion Appendices A. Derivation of pdf and cdf A.1 Derivation of the Probability Density Function and the Cumulative Distribution Function Bibliography... 57

5 i List of Figures.1 The steps of gene expression that leads to a protein product (taken from [5]) 6. Affymetrix Chip Design (taken from [13]) Step by step procedure of a typical Affymetrix microarray experiment (taken from [9]) Several sources of error for a microarray experiment (Taken from [35]) Plots of EMG distributions for different values of k Plots of the sample pdf histograms for the PM probe distributions from five Affymetrix microarrays along with a plot of an EMG distribution with k = Plot of the right tail of the sample pdf histogram for the PM probe data from T01 tumor.cel fitted to a shifted version of f(x) = 3 log (x)

6 ii List of Tables 7.1 Synthetic data results for the new method Synthetic data results for the method of moments Synthetic data results for the Silver method... 50

8 i Abstract Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution Scott Haney Advisor: Moshe Kam, Ph.D. The exponentially modified Gaussian (EMG) probability distribution is defined as the convolution of an exponential distribution and a Gaussian distribution which are independent of each other. Using a reparameterized form of the EMG cumulative distribution function (cdf) several properties of the EMG distribution are derived. These properties are used to test whether the distribution of the perfect match (PM) probes from five Affymetrix microarrays follows an EMG distribution and to create a new parameter estimation method. A commonly used method for preprocessing Affymetrix microarray data, known as the robust multi-array average (RMA), assumes that the distribution of the PM probes at least approximately follows an EMG distribution. Using the results derived in this thesis it is found that the EMG distribution is not a good fit for these sample data based on differences in the right tail of the sample distribution. A new distribution that is very dissimilar to the right tail of an EMG distribution is derived that more accurately fits the right tail of the sample data. From the properties of the EMG distribution derived in this thesis it is further shown that a new parameter estimation method can be created. This new parameter estimation method is compared against two other methods from the literature namely the method of moments and the Silver method (009). From a theoretical perspective, the new parameter estimation method has the advantage that it is proven to be consistent and to always return valid parameter estimates (such as the constraint that the variance be positive). Neither the Silver method nor the method of moments has both of these qualities. All three methods were compared on the same synthetic data generated from EMG distributions and it was found that the performance of

9 ii each method depended on the shape of the EMG distribution. It was also found that the Silver method appears to not be consistent for EMG distributions that are too close to being a Gaussian distribution.

10 1 1. Introduction The EMG distribution is the convolution of a Gaussian distribution and an exponential distribution which are independent of each other. This distribution has found practical applications in a variety of scientific disciplines such as chromatography [17,0,3,9], cellular biology [14], radiotherapy [16], and microarray preprocessing [18,30]. Many of these practical applications focus on the problem of curve fitting of data points to a function which is an EMG pdf multiplied by a scaling parameter. A large number of algorithms have been introduced in the literature to solve this problem [3, 11, 1, 36]. The focus of this thesis is to better understand the properties of the EMG distribution so that it can be determined whether or not the perfect match (PM) probe distributions from five Affymetrix microarrays approximately follows an EMG distribution. This is an important assumption made by a commonly used microarray preprocessing method known as the robust multi-array average (RMA) [18]. Several properties of the EMG distribution were derived and were used to show that the right tails of the sample probability density function (pdf) were much heavier than would be expected for an EMG distribution. By visual analysis of the sample pdf histograms it was determined that the right tails of the sample pdfs approximately reduced in height by one third whenever the value on the x-axis was doubled. This is a property that the right tail of an EMG pdf does not come close to having. A function with this property was derived and it was found to be a reasonable approximation for the right tails of the sample pdfs. These results strongly challenge the assumption used by the RMA method that the PM probes approximately follow an EMG distribution. Using the derived properties it is also possible to create a parameter estimation method that has some very desirable properties such as consistency and always being

11 CHAPTER 1. INTRODUCTION able to return valid parameter estimates where valid refers to parameter estimates that satisfy all of the constraints of the original parameters. Several parameter estimation methods already exist in the literature [, 30] and the new parameter estimation method is compared to two of these. The two methods selected were the method provided in [30] (referred to as the Silver method ) and the method of moments. All three methods were compared on synthetic data generated from EMG distributions. The synthetic data trials distinguished between three scenarios which were: 1. The EMG distribution is close to being a shifted exponential distribution. The EMG distribution is close to being a Gaussian distribution 3. The EMG distribution is neither close to a shifted exponential distribution nor close to a Gaussian distribution An EMG distribution is considered to be close to a shifted exponential distribution when a large fraction of the variance of the EMG distribution is due to the variance of the exponential component; an EMG distribution is considered to be close to a Gaussian distribution when a large fraction of the variance of the EMG distribution is due to the variance of the Gaussian component. Both the Silver method and the method of moments were found to have distinct disadvantages compared ot the new parameter estimation method. The method of moments failed to return valid parameter estimates at least 10 times out of 100 and at most 61 times out of 100 in the synthetic data trials. For these failed runs the method of moments returned at least one imaginary parameter estimate. The Silver method appears to be converging to incorrect parameter estimates under the second scenario. The average parameter estimates for the Silver method after applying it to 100 random samples of size 10,000 generated from a certain EMG distribution showed

12 CHAPTER 1. INTRODUCTION 3 that the parameter estimates were off by as much as 9 standard deviations. With respect to accuracy, the results of the synthetic data trials showed that the performance of the parameter estimation methods varied across the three scenarios. In the first scenario it was found that the accuracy of the Silver method was noticeably better in most cases than the accuracy of the new method and the accuracy of the method of moments. In the second scenario it was found that the accuracy of the method of moments and the accuracy of the new method were comparable, while in most cases the accuracy of the Silver method was noticeably lower. In the third scenario it was found that the accuracy of the method of moments and the accuracy of the new method were comparable while in most cases the accuracy of the Silver method was noticeably lower. The organization of this thesis is as follows: 1. Background necessary for understanding the application of the EMG distribution to Affymetrix microarray data is described.. Properties of the EMG distribution that will be used in improving the application of the EMG distribution in practice are derived 3. The assumption that the PM probe data from Affymetrix microarrays approximately follows an EMG distribution is tested for data from five microarrays and it is found that this assumption is unlikely to be true. 4. A new distribution is derived to fit the right tails of the PM probe distributions from the five microarrays. This new distribution is found to visually fit the sample data well and is not close to the right tail of an EMG distribution. 5. A new parmeter estimation procedure is described and is proven to be consistent.

13 CHAPTER 1. INTRODUCTION 4 6. The new parameter estimation method is compared to two other parameter estimation methods from the literature and is found to have several important advantages over these two methods.

14 5. Background on Microarray Data Analysis Within a single human being different cell types can have exactly the same DNA yet be extraordinarily different. For example, skin cells and bone cells have the same DNA yet they are not very similar in form or function []. Although skin and muscle cells have the same DNA, certain subsequences of the DNA (known as genes) affect the cellular environment in different ways. Perhaps the most commonly studied way by which a gene can affect a cell is the process of gene expression..1 Gene Expression Gene expression is a multi-step process by which a gene product is created from a gene. In humans the most common gene products are proteins, which are one or more long chains of amino acids that are folded together. For simplicity it is assumed that gene expression refers to gene expression where the gene product is a protein since proteins are thought to be the primary reason for biological changes within the cell. The steps of gene expression for protein products [] are 1. DNA is transcribed into a complementary mrna copy. Intron sequences are removed (or spliced) from the complementary mrna copy 3. The spliced complementary mrna sequence is translated into a chain of amino acids 4. Posttranslational modifications are made to the chain of amino acids and the final protein product is formed These steps are shown pictorially in (Figure.1).

15 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 6 Figure.1: The steps of gene expression that leads to a protein product (taken from [5]) Protein gene products are typically very complex and can affect the cell in different ways depending on a variety of factors. Two common factors that impact the effect of proteins is the concentration of other proteins in the cellular environment and the folded shape of the protein. Any change starting from gene expression and ending with the final structure, form, and environment of the protein product can affect the biology of the cell [].. Measuring Gene Expression Obtaining a meaningful measure of gene expression is not straightforward. A single change in any step of the process can lead to different biological results. In practice, the first step of the process of gene expression where the DNA is transcribed

16 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 7 into a complementary mrna copy is the portion of the process that is measured. Measuring this step provides an estimate of the total amount of protein product that can be produced. This measurement, however, does not provide any estimate as to how much of the protein is actually produced or give any idea as to the final physical form of the protein in the cell. There are several important reasons for focusing on this portion of the process which are as follows: 1. Methods for measuring the presence of mrna molecules are well established. Since the human genome is approximately 99.9% identical across individuals it is reasonable to assume that the same mrna molecules are being tested for 3. It is possible to simultaneously measure the presence of a large number of mrna molecules within the same sample A number of testing devices are available for simultaneously measure large numbers of mrna molecules in a sample. One class of these testing devices, known as microarrays, are commonly used for this purpose in practice..3 Affymetrix Microarrays One of the most well known manufacturers of microarrays is Affymetrix [1]. Affymetrix microarrays are small chips that have their surfaces subdivided into a rectangular grid. Each rectangle in the grid contains a large number of 5 nucleotide base pair long DNA probes all having the same sequence. These DNA probe are standing straight up on the surface of the chip with the bottom end of the probe affixed to the surface of the chip and the top end of the probe being free to move (Figure.). This design allows any mrna molecules to chemically bind to the DNA probes on the surface of the microarray.

CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 8 Figure.: Affymetrix Chip Design (taken from [13]) Each subgrid contains either perfect match (PM) probes or mismatch (MM) probes.

17 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 8 Figure.: Affymetrix Chip Design (taken from [13]) Each subgrid contains either perfect match (PM) probes or mismatch (MM) probes. A PM probe is designed to be complementary to an expected subsequence of a specific mrna. An MM probe is designed to match a PM probe sequence with the exception that the 13 th nucleotide base is switched. Every subgrid of PM probes has a corresponding subgrid of MM probes. For every gene there are typically several PM and MM probe subgrids. The entire collection of these subgrids is termed a probeset. Affymetrix microarrays measure mrna levels by using basic principles of chemistry. Each DNA probe on the surface will prefer to be bound to other DNA that is exactly complementary. In general, the closer a subsequence of an DNA is to being complementary to a probe sequence the more likely it will be to bind to the corresponding probe. By using this principle it is thought that if a targeted sequence is present in solution it will bind to its corresponding probe with high probability. Of course, other DNA sequences in solution that have a subsequence which is close to being complementary can also bind. It is thought that the MM probes can be used to provide an estimate of this erroneous binding known as cross-hybridization.

18 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 9 An Affymetrix microarray experiment is begun by extracted mrna from a biological sample. Extracted mrna then goes through a number of preparation steps where it is labeled with some molecule that can be identified using a scanner and the labeled mrna is then applied to the surface of the microarray. After the chemical reactions have had some time to take place the microarray is washed and only the mrna from the sample that is bound to probes should remain. Lastly, the microarray is put under a scanner and for each rectangular subgrid the intensity of the labeling molecule is measured. A pictorial example of this process is given in Figure.3. Figure.3: Step by step procedure of a typical Affymetrix microarray experiment (taken from [9])

CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 10.4 Experimental Errors and Data Preprocessing Affymetrix microarrays are subject to technical, chemical, and human errors.

19 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 10.4 Experimental Errors and Data Preprocessing Affymetrix microarrays are subject to technical, chemical, and human errors. An example of some of these errors can be seen in Figure.4. These errors have been extensively studied in the literature [33, 35, 37], however, they still remain to be convincingly modeled in practical Affymetrix microarray experiments. An understanding of how these errors affect Affymetrix microarray data is essential for determining how reliable the data are as well as for extracting a reasonable estimate for the level of gene expression in the sample. Figure.4: Several sources of error for a microarray experiment (Taken from [35]) Previous work has been completed towards estimating the mrna concentration in the presence of error and has met with some success. In one publication [1], a method was developed that was capable of detecting known mrna levels in the presence of experimental error. At least two other authors determined differential

20 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 11 equation models that took into account error which worked well on the data tested [5, 35]. For the type of microarray experiment described in the previous section these techniques have not had a very significant impact in practice. For a different type of microarray experiment, a real-time microarray experiment [8], these techniques are much more effective and practical. Most microarray data sets that are currently available, however, are not real-time microarray experiments. In practice it is common to handle microarray errors by using techniques that are much simpler than the methods discussed in the previous paragraph. The typical first step to microarray data analysis is to preprocess the data in order to remove error. Some of the most commonly used microarray preprocessing techniques in practice are those provided directly by Affymetrix (PLIER and MAS 5.0) [34] and the robust multichip average (RMA) [18]. As of the time of this writing no single method preprocessing method has been found to be generally preferable to the rest [19]. After the data are preprocessed it is usually assumed that the resulting data are error free. Data analysis techniques are then applied to the preprocessed data to find interesting results. The preprocessing technique of primary interest in this thesis is RMA. At the present time the original RMA publication has been cited over 3,000 times. This technique makes the assumption that the distribution of PM probes from a microarray approximately follows an EMG distribution [18]. RMA uses this assumption to model observed values as signal (which follows an exponential distribution) plus noise (which follows a Gaussian distribution). The signal value is then estimated by solving for the expected value of signal given the value of signal plus noise. If the assumption that the distribution of the PM probes follow an EMG distribution is incorrect then estimating the use of the EMG distribution in RMA is questoinable. This assumption is shown to be unlikely to be true based on the results of comparing the sample data

21 CHAPTER. BACKGROUND ON MICROARRAY DATA ANALYSIS 1 to certain properties of the EMG distribution.

22 13 3. Properties of the Exponentially Modified Gaussian (EMG) Distribution Due to the large reach of the EMG distribution in practical applications, [14, 18, 30] a better understanding of the EMG is worth pursuing. The probability density function (pdf) and cumulative density function (cdf) for an EMG distribution are given below as EMG(c; µ, σ, λ) and emg(c; µ, σ, λ) respectively (see Appendix A for derivations): EMG(c; µ, σ, λ) = 1 1 λσ eλ( +µ c) erfc( σ (λ + µ c )) + 1 σ erf( 1 (c µ)) σ (3.1) emg(c; µ, σ, λ) = λ λσ eλ( +µ c) erfc(( σ )(λ + µ c )) (3.) σ where erf(x) = x e t dt π 0 erfc(x) = e t dt = 1 erf(x) π x In this chapter several properties of the EMG distribution are derived. The derivations predominantly rely on reparameterizing the input to the EMG cdf. These properties will later be used to challenge a current assumption that the PM probe data from Affymetrix microarrays approximately follows an EMG distribution [18] as well as to create a new parameter estimation method.

23 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION Reparameterization of the EMG Distribution A carefully selected reparameterization of the input to the EMG cdf can be used to show several useful properties of the EMG distribution. This result was discovered by analyzing the mode of the EMG pdf, which occurs when the derivative of the EMG pdf is equal to zero. This equation is given by λσ = 1 e p π erfc(p) (3.3) where p = λσ + µ c σ Solving for c yields the mode of the EMG pdf. The equation for the mode can be simplified somewhat by replacing c with a reparameterization c 1 which is given by c 1 = µ + λσ Dσ (3.4) where D R. Replacing c with c 1 in (3.3) causes the equation to become λσ = e D πerfc(d) This equation shows that the mode of the EMG pdf can be written entirely in terms of D and λσ. The term λσ will be used often throughout the rest of this thesis, and from this point on this term will be denoted by k. This reparameterization can be slightly generalized and can be used to simplify the EMG cdf for some input values. This slightly updated reparameterization is denoted

24 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 15 by c and is given by c = µ + Cλσ + Dσ where C,D R. Replacing c with c in (3.1), the EMG cdf reduces to EMG c (C, D, k) = 1 [1 e k Ck Dk erfc( k Ck D ) + erf( Ck + D )]. (3.5) where k = λσ. Using this equation it is possible to calculate any quantile that can be represented in terms of c once k is known. Several important results that will be used in this writing are now explained in the following sections. These results heavily rely on the term k and (3.5). From the work in the following sections it will become evident that the term k provides a significant amount of information about an EMG distribution. 3. EMG Quantile Bounds Analysis of specific values of c revealed that at least some of the quantiles must lie within certain bounds. This is accomplished by combining the constraint k > 0 (which is true because both σ and λ are greater than zero) with (3.5). Two such bounds are given in the following paragraphs both as examples and for use later in this thesis. Perhaps the simplest example of a quantile bound is when C = D = 0. Under these conditions it follows that c = µ and (3.5) reduces to a function that only depends on k which is given by EMG c (0, 0, k) = EMG µ (k) = 1 [1 e k erfc( k )] (3.6)

25 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 16 Taking a derivative shows that the right hand side of EMG µ (k) is monotonically decreasing for k (0, ). Using this information it can be shown that 0 < EMG µ (k) < 1 (3.7) for any EMG distribution. It is also possible to determine a quantile bound on the mean m = µ + λ 1 of an EMG distribution. The reparameterization c is equal to m when C = k and D = 0. Under these conditions (3.5) reduces to a function that only depends on k which is given by EMG m (k) = EMG c (k, 0, k) = 1 [1 e k 1 erfc( k k 1 1 ) + erf( k )] (3.8) Analysis of the derivative shows that the right hand side of EMG m (k) is monotonically decreasing for k (0, ). Using this information it can be shown that 1 < EMG m(k) < 1 e 1.63 (3.9) for any EMG distribution. 3.3 Parameter Estimation It is possible to completely define an EMG distribution in terms of k and two quantiles rather than in terms of the three parameters µ, σ, and λ. Assuming that k is known, one such procedure for determining the parameters is as follows: 1. Determine µ from the quantile determined by the right hand side of EMG µ (k) = 1 [1 e k erfc( k )]

26 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 17. Determine ml = µ + λ 1 from the quantile determined by the right hand side of EMG m (k) = 1 [1 e k 1 erfc( k k 1 1 ) + erf( k )] 3. Determine ms = µ + σ from the quantile determined by the right hand side of EMG c (0, 1, k) = 1 [1 e k k erfc( k 1 ) + erf( 1 )] 4. Estimate λ by subtracting the estimate of µ from ml and then taking the multiplicative inverse of the result 5. Estimate σ by subtracting the estimate of µ from ms The ability to define the EMG distribution in terms of k and two quantiles opens up the possibility of a new type of parameter estimation method for an EMG distribution. Given a sample from an EMG distribution if k can be estimated then it is possible to estimate the parameters of the EMG distribution. In practice, a simple way to estimate k is by estimating the sample quantile of the sample mean. This estimate can then be substituted for the left hand side of (3.8) and an estimate for k can be obtained by solving this equation for k. As long as the estimate for the sample quantile of the sample mean satisfies (3.9) it will be possible to solve for k. 3.4 Shape Estimation The value of k determines the overall shape of the EMG distribution. This can be seen by analyzing the variance of an EMG distribution in terms of k which yields

27 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 18 the following: Var(EMG(c; µ, σ, λ)) = σ + λ = k + 1 λ (3.10) = σ + σ k (3.11) As k in (3.10) approaches zero the impact of the Gaussian component on the variance becomes negligible. As k in (3.11) approach the impact of the shifted exponential component on the variance becomes negligible. As the variance of a component becomes negligible, the EMG distribution will be close to the distribution of the other component. These observations indicate that for values of k that are large the EMG distribution is close to a Gaussian distribution and that for values of k that are small the EMG distribution is close to a shifted exponential distribution. In practice, it is likely that an EMG distribution which is very close to being either a Gaussian distribution or a shifted exponential distribution will be treated as a Gaussian distribution or a shifted exponential distribution respectively. Due to this, it seems reasonable to assume that EMG distributions which arise in practice are likely to have k values that are located within a certain bounded interval. The variance relations which were discussed in the previous paragraph provide a way to obtain a rough estimate for this bounded interval. By combining 3.10) and (3.11) it follows that k + 1 λ = σ + σ k Setting k = 1 results in σ = λ, which implies that the variance of both components is equal. It seems reasonable to assume that a component will become negligible when its variance is less than a certain percentage of the other. It further seems reasonable

28 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 19 to assume that this percentage can be set to 1%, which results in the following bounds on k k [0.1, 10]. Several plots of EMG distributions for different values of k between 0.1 and 10 are given in Figure 3.1. In practice, it seems unlikely that values of k outside of this interval will be encountered. If this is not the case then it will be very difficult to estimate the parameters of the EMG distribution. The reason for this is that the closer the EMG distribution becomes to either a Gaussian distribution or a shifted exponential distribution the harder it will be to estimate the exact magnitude of the difference. In general, the slighter the modification to a distribution the harder it will be to detect. 3.5 EMG Right Tail Approximation It is possible to show that the EMG cdf is approximately the same as a shifted exponential cdf in the right tail of the distribution. The cdf of a shifted exponential distribution will be denoted by SED(c; λ, T ) and is defined to be SED(c; λ, T ) = 1 e λc T (3.1) where T is the shift parameter and λ is the same shape parameter that is used in an exponential distribution. The desired approximation will be derived by considering the reparameterization c 3 = µ + Dσ (3.13)

29 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 0 (a) k = 0.1 (b) k = 0.5 (c) k = 1.0 (d) k =.0 Figure 3.1: Plots of EMG distributions for different values of k. where D R. Using this new reparameterization in place of c in the EMG cdf it follows that EMG c3 (D, k) = 1 [1 e k Dk erfc( k D ) + erf( D )] To see what happens in the right tail the limit as c 3 approaches infinity is considered. This limit can not immediately be determined because the right hand side of EMG c3 (D, k) does not directly include c 3. The right hand side is written in terms of D so a relation between the limiting value of c 3 and D would allow the limit to be easily evaluated. From the constraint that EMG µ (k) (0, 1 ) (3.7) and the constraint

30 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 1 that σ > 0 it is clear that if c 3 is greater than the median then D > 0. Thus as c 3 approaches infinity, D approaches infinity. Using this information it can be seen that lim EMG 1 c 3 (D, k) = lim c 3 c 3 [1 e k Dk erfc( k D ) + erf( D )] = lim D 1 [1 e k Dk erfc( k D ) + erf( D )] e k D = 1 [ lim = 1 lim D e k(d k ) Dk ] where the last equality is the cdf of a shifted exponential distribution with shift T = k and shape parameter λ = k. Both the erf and the erfc terms approach their limits at a much faster rate than does a term of the form e kd, hence the cdf of the EMG distribution should be approaching the cdf of a shifted exponential distribution. To show that the right tail approximation is accurate in a more quantitative manner it is first noted that if D D 0 > 0 then the following bounds hold 1 > erf( D ) 1 α 1 > erfc( k D ) α where α 1 = erfc( D 0 ) α = erfc( D 0 k ) From the fact that k > 0 it must be that α 1 < α. Using these constraints along with

31 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION the inequality between α 1 and α bounds can be put on EMG c3 (D, k). The lower bound is given by EMG c3 (D, k) 1 [1 e k Dk erfc( k D ) + 1 α 1 ] = 1 [( α 1) e k Dk erfc( k D )] > 1 [( α 1) e k Dk ] = (1 α 1 ) e k Dk = 1 e k Dk α 1 ) = 1 e k Dk erfc( D 0 and the upper bound is given by EMG c3 (D, k) 1 [1 e k Dk ( α ) + 1 α 1 ] = 1 [( α 1) ( α )e k Dk ] < 1 [ ( α )e k Dk ] = 1 e k Dk + erfc(d 0 k) e k Dk 1 e k Dk + erfc(d 0 k) e k D 0k = 1 e k Dk + erfc(w(k)) e w (k) D 0 where w(k) = (D 0 k). For these two bounds the error between EMG c3 (D, k) and the bounds are given by L e = erfc( D 0 ) (3.14) U e = erfc(w(k)) e w (k) D 0 (3.15)

32 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 3 where L e is the error in the lower bound and U e is the error in the upper bound. Both error terms approach zero much more rapidly than does the term e k Dk so these approximations should be quite accurate as long as D is large enough relative to k. The error in approximation can also be characterized in terms of the percentage error. It is possible to show that the percentage error of the approximation monotonically decreases to zero for D > 0. The percentage error P E of approximating the value of EMG(c 3 ) at D is given by P E = EMG c 3 (D, k) 1 e k Dk EMG c3 (D, k) = 1 1 e k Dk EMG c3 (D, k) (3.16) If the percentage error is monotonically decreasing to zero for D > 0 then it must be the case that the second term in P E given by P r = 1 e k Dk EMG c3 (D, k) is monotonically increasing to one for D > 0. Clearly the limit of P r is one since both the numerator and the denominator are valid cdfs. The derivative of P r with respect to D can be shown to be positive so it follows that P r is monotonically increasing with respect to D. The denominator of the derivative is always positive since it is squared and the numerator of the derivative given by ke k Dk (erf( (D k)) erf( D)) will be positive for all D > 0. As an example of the accuracy of this approximation assume that k = 1 and

33 CHAPTER 3. PROPERTIES OF THE EXPONENTIALLY MODIFIED GAUSSIAN (EMG) DISTRIBUTION 4 D =. Under these circumstances it follows that EMG c3 (D, k) (1 e k Dk ) EMG c3 (D, k) which shows that the percentage error in the approximation is close to 1.6%. Because the percentage error is monotonically decreasing for D > 0, the percent error in the approximation will be no more than approximately 1.6% for all D.

34 5 4. Application of the EMG Distribution to Actual Affymetrix Microarray Perfect Match (PM) Probe Distributions Data from five Affymetrix microarrays described in [31] were downloaded from [7]. The five Affymetrix microarray data files that were selected were T01 tumor.cel - T05 tumor.cel. It is found that the sample distributions of the PM probes for these five Affymetrix microarrays are unlikely to follow an EMG distribution. First, it is shown that the right tail of the sample pdf is not what would be expected for an EMG distribution. Further, it is shown that the sample quantiles of the sample means for all five distributions are larger than would be expected for an EMG distribution. 4.1 Comparing the Right Tail to a Shifted Exponential Distribution From the results in 3.5 it is clear that the EMG cdf should be well approximated by the cdf of a shifted exponential distribution in the right tail. In order to apply this approximation in practice it will be necessary to know where to begin. It will be shown that the start of the right tail can be reasonably approximated if an upper bound k max on k can be assumed. Once the right tail has been located, a slightly modified ratio of two sample quantiles will be compared to the ratio that would be expected if the distribution was a shifted exponential distribution. The results of this test will show that the right tails of the PM probe distributions from the five Affymetrix microarrays described at the beginning of this chapter are very different from what would be expected for an EMG distribution.

35 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS Locating the Beginning of the Right Tail To estimate the beginning of the right tail there must first be an estimate for the upper bound on k denoted by k max. From this estimate it is possible to determine upper bounds on σ denoted by σ max and µ denoted by µ max. Using the result from 3.5 that the percentage error in the right tail approximation is monotonically decreasing it is possible to select a value for D such that the percentage error is bounded. These three estimates are then used to calculate c 3 (3.13) which is the estimate for the beginning of the right tail. To estimate k max it is not unreasonable to assume a value for k max by eye-balling the data given the insights from 3.4. From viewing the sample pdf histograms of the five PM probe distributions (Figure 4.1) k max = 1 seems like a safe estimate. Using k max, σ max can be obtained by rearranging (3.11) to obtain σ max s 1 + k max where s is the sample standard deviation. Substituting k = k max, C = D = 0 into (3.5) yields an estimate for µ max. Lastly a suitable value for D must be chosen so that the percentage error between the actual EMG tail and the shifted exponential tail is small enough. In 3.5 it was shown that for D > the percentage error in the approximation was no more than roughly 1.6%. Given that this error seems to be small enough it is assumed that the right tail begins at c 3 = µ max + σ max Testing the Right Tail In order to test that the right tail is approximately a shifted exponential distribution it is necessary to use a test that will not be affected much by the error in the approximation. One such test is to slightly modify the ratio of two quantiles.

36 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS 7 (a) T01 tumor.cel (b) T0 tumor.cel (c) T03 tumor.cel (d) T04 tumor.cel (e) T05 tumor.cel (f) EMG with k = 1 Figure 4.1: Plots of the sample pdf histograms for the PM probe distributions from five Affymetrix microarrays along with a plot of an EMG distribution with k = 1.

37 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS 8 Estimates of quantiles tend to be fairly robust so this test should not be greatly affected by the approximation error between EMG c3 (D, k) and the cdf of some shifted exponential distribution. In order to derive a test for the ratio of quantiles from a shifted exponential distribution such a test was first created for an exponential distribution. This test is then extended in a natural way to the shifted exponential distribution. The cdf for an exponential distribution denoted by E(c; λ) is given by E(c; λ) = 1 e λc For an exponential distribution, the ratio of any two quantiles is constant. To see this suppose that E(x 1 ; λ) = q and E(x ; λ) = p. Then it follows that x 1 x = ln(1 q) ln(1 p) where the ratio of the quantiles is clearly independent of λ. For a shifted exponential distribution the only change that needs to be made is to shift the input by the value of the shift parameter T. The shifted ratio of its quantile denoted by S r is given by S r = x 1 T x T ln(1 q) = ln(1 p) (4.1) (4.) The ratio test just derived for a shifted exponential distribution can be directly applied to the experimental data being studied despite two possible problems. The first possible problem is that the right tail of the distribution will not be a valid probability distribution (because the area under the right tail is not equal to one). Instead the right tail will be some constant multiple of a probability distribution that

38 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS 9 is close to being a shifted exponential distribution. Fortunately, these constants will cancel out by taking a ratio so the test is not affected. The second possible problem is the approximation error. It is important to show that the approximation error will not cause a large error in S r. Bounds on the error in S r caused by error in the approximation will be derived. In the application of this test to the PM probe distribution of Affymetrix microarrays these bounds will be used to show that the error in S r caused by approximation error is not significant. Showing that the approximation does not significantly affect S r requires some extra work due to the shift parameter T being present in the ratio test. It was shown in 3.5 that the error in approximation can be written in terms of percentage error and that the percentage error can be made as small as desired by moving far enough to the right. Shifting the actual quantile value along with its approximation changes the percentage error so it is necessary to know how the percentage error changes in this case. The percentage error (P E from 3.16) and the shifted percentage error denoted by P E s can be related as follows P E = P E s EMG c3 (D, k) T EMG c3 (D, k) From the last equation it follows that if the ratio of the shifted quantile to the actual quantile is not too large then it will follow that if P E is reasonably small then P E s will also be reasonably small. Applying this result back to the EMG cdf it follows that the ratio of any two quantiles x 1 = EMG(y 1 ; µ, σ, λ) and x = EMG(y ; µ, σ, λ) that are far enough into the right tail is bounded by ( 1 P E s 1 + P E s ) < ( EMG(y 1; µ, σ, λ) T EMG(y ; µ, σ, λ) T )(SED(y ) T SED(y 1 ) T ) < (1 + P E s 1 P E s ) This shows that the quantile ratio assuming a shifted exponential distribution will be

39 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS 30 approximately the same as the quantile ratio assuming an EMG cdf as long as P E s is small enough. This ratio test is now applied to the sample data from the five Affymetrix PM probe distributions using the quantiles q = 0.50 and p = For all five sample pdfs it was assumed that k max = 1 (see Figure 4.1 for a visual comparison) and that the right tail could be assumed to start at D = (see 4.1 for justification). Using these assumptions it was found that even with the approximation error being taken into account, varying the sample quantiles by even as much as five standard deviations was not enough to match the ratio that would be expected. This result strongly suggests that the right tail does not follow a shifted exponential distribution which casts doubt on the assumption that this data follows an EMG distribution. 4. Discrepancy in the Sample Quantile of the Sample Mean For all five data sets the sample quantile of the sample mean was much larger than the (1 e 1 ) th quantile. Since the quantile of the mean of an EMG distribution can not be larger than the (1 e 1 ) th quantile it seems likely that the sample data are not EMG distributed. To investigate this possibility a hypothesis test is created to determine whether or not the quantile of the mean of each distribution was larger than the (1 e 1 ) th quantile. To create the hypothesis test it is assumed that both the sample mean and the sample quantiles approximately follow a Gaussian distribution. Due to the fact that the sample size was greater than 00,000 for all five sample distributions, these two assumptions seem reasonable in light of the central limit theorem. Given these assumptions, the paired t-test can be used to determine if it is likely that the quantile of the mean is larger than the (1 e 1 ) th quantile. After applying the paired t-test to all of the sample distributions it was found that

40 CHAPTER 4. APPLICATION OF THE EMG DISTRIBUTION TO ACTUAL AFFYMETRIX MICROARRAY PERFECT MATCH (PM) PROBE DISTRIBUTIONS 31 the difference between the sample quantile of the sample mean and the (1 e 1 ) th quantile was very high. For all five data sets, the difference between the sample quantile of the sample mean and the (1 e 1 ) th sample quantile was no less than 90, while the standard deviations for both estimates were less than 1. Given these values the null hypothesis that the quantile of the mean is less than the (1 e 1 ) th quantile can easily be rejected at the α = 0.01 level for all five sample distributions. Since the mean of the sample data occurs at such a large quantile it seems likely that the best EMG fit for the data would be a distribution that is close to being a shifted exponential distribution (small value of k). From viewing Figure 4.1 it is clear that this sample pdf is not very similar to a shifted exponential distribution. This result shows that it is unlikely that these sample distributions follow an EMG distribution.

41 3 5. Fitting the Right Tail of the Perfect Match (PM) Probe Data Given that the sample data are unlikely to follow an EMG distribution the next question that should be asked is what distribution do these data follow? The previous chapter showed that the right tails of the sample distributions were very different from the right tail of an EMG distribution. From visual inspection (Figure 4.1) it appears that the problem is due to the right tails of the sample pdf histograms being much too heavy. In other words the right tails of the sample pdf histograms do not go to zero as quickly as would be expected. After further visual examination the right tails of the sample pdfs all seemed to share the property that doubling the input to the sample pdf reduced the height of the sample pdf histogram by approximately one third. Taking this observation as an assumption the problem of determining an appropriate distribution for the right tail of the sample data becomes the problem of finding a function with this property. Such a function will be derived in the next section and will be shown to fit the right tails of the sample pdf histograms closely. The derivation of this function will then be generalized to functions of a larger class. 5.1 Derivation of Functions That Decrease by a Common Ratio It is assumed that a function f(x) such that f(x) f(x) = 3 (5.1) may be an appropriate distribution for modeling the right tails of the sample pdf histograms. In order to determine the form of f(x), several common functional forms were assumed for f(x) and the algebra was checked to see if the final result was valid.

42 CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM) PROBE DATA 33 After several attempts it was found that by assuming f(x) = g(x) x it was possible to determine f(x). By using the substitution f(x) = g(x) x it follows that f(x) f(x) = g(x)x = 3 (5.) g(x) x It is possible to create a recurrence relation over some values of g(x). This is accomplished using the following modified form of (5.) g(x) x g(x) x = 3 xlog(g(x)) xlog(g(x)) = log(3) log( g(x) g(x) ) = log(3) x g(x) = 3 x 1 g(x) g(x) = g(x)3 x 1 = log(3 x 1 ) If it is assumed that g(1) = 1 then the first six terms of the recurrence are as follows: g(1) = 1 g() = g(1)3 1 = 3 1 g(4) = g()3 1 = 3 1 g(8) = g(4)3 1 4 = g(16) = g(8)3 1 8 = g(3) = g(16) = The last three terms in this list show that the numerator of the exponent is the log base

43 CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM) PROBE DATA 34 two of the input and the denominator of the exponent is the input. This suggests that the function 3 log (x) x may work for g(x) which suggests that f(x) = g(x) x = 3 log (x). (5.1). To verify that f(x) = 3 log (x) has the desired property this function is tested in 3 log (x) 3 log (x) = r log (x 1 ) log ((x) 1 ) = log 3 (r) log () = log 3 (r) 3 = r The last line of the algebra shows that f(x) has the desired property. The format of this function suggests that it would be possible to generate functions such that f(x) f(αx) = β where α, β > 1 by using the function β logα(x). Working out the same steps that were performed for f(x) in the previous paragraph it follows that β logα(x) β logα(αx) = r log α (x 1 ) log α ((αx) 1 ) = log β (r) log α (α) = log β (r) β = r This algebra shows that this class of functions has the expected property.

44 CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM) PROBE DATA Application of Functions that Decrease by a Common Ratio to the Right Tail Attempting to fit the right tails of the sample pdfs immediately yields encouraging results. By fitting a shifted version of the function f(x) = 3 log(x) to the right tail of the sample pdf histogram from T01 tumor.cel it can be seen that the shifted version of f(x) and the right tail of the sample pdf histogram are very similar (Figure 5.1). It seems likely that the cdf for the sample data approaches a function that decreases by a common ratio. Figure 5.1: Plot of the right tail of the sample pdf histogram for the PM probe data from T01 tumor.cel fitted to a shifted version of f(x) = 3 log (x). Comparing the right tail of an EMG pdf to f(x) shows that these two functions are very different. Both functions are concave up and constantly decreasing, however, the rate of decrease is very different. By definition the ratio f(x) = 3 is constant with f(x) respect to x. For an EMG distribution this ratio is not constant with respect to x

45 CHAPTER 5. FITTING THE RIGHT TAIL OF THE PERFECT MATCH (PM) PROBE DATA 36 and can be very different from 3 depending on the value of x. As an example when mu = 0, σ = 1, λ = 1, and x =10 the ratio for the right tail of an EMG distribution is approximatley 5,000. In general, the right tail of an EMG pdf converges to zero much more quickly than does f(x).

46 37 6. Practical Implementation of EMG Parameter Estimation Method and Properties In section 3.3 it was shown that once the value of the variable k is known, it is possible to estimate the parameters of an EMG distribution using two sample quantiles. Using (3.6) it is possible to estimate k by replacing EMG m (k) with the sample quantile of the sample mean. Combining these two results constitutes a parameter estimation method which is given by 1. Estimate k with k e where k e is calculated by replacing the left hand side of the following equation with the sample quantile of the sample mean and solving for k e EMG m (k e ) = 1 [1 e k e 1 erfc( k e ke 1 1 ) + erf( )] k e. Determine µ from the quantile determined by the right hand side of EMG µ (k e ) = 1 [1 e k e k e erfc( )] 3. Determine ml = µ + λ 1 from the quantile determined by the right hand side of EMG m (k e ) = 1 [1 e k e 1 erfc( k e ke 1 1 ) + erf( )] k e 4. Determine ms = µ + σ from the quantile determined by the right hand side of EMG c (0, 1, k e ) = 1 [1 e k e ke erfc( k e 1 ) + erf( 1 )]

47 CHAPTER 6. PRACTICAL IMPLEMENTATION OF EMG PARAMETER ESTIMATION METHOD AND PROPERTIES Estimate λ by subtracting the estimate of µ from ml and then taking the multiplicative inverse of the result 6. Estimate σ by subtracting the estimate of µ from ms Although this method will work in theory there are several modifications that need to be made in order to make it practical. It will be shown that by performing several slight modifications to the parameter estimation procedure described in the previous paragraph a practical implementation will result. The final implementation is consistent and always returns valid parameter estimates where valid parameter estimates are parameter estimates that satisfy all constraints on the original parameters (such as σ > 0). This new parameter estimation method is then compared to other parameter estimation methods for the EMG distribution from the literature. It is found that the new parameter estimation method has several advantages over other currently available methods. 6.1 Proof of Consistency It is proved that the new parameter estimation method as introduced at the beginning of this chapter is consistent. This proof will also apply to the final implementation as the modification made will in no way affect consistency. Theorem The parameter estimation method introduced is consistent. Proof. To prove the theorem it will first be proved that the estimate for k is consistent. Applying the same techniques used to show that k is consistent it can easily be shown that the consistency of the parameter estimates follows from the consistency of k. To show that k is consistent it will be shown that the sample quantile of the sample mean is a consistent estimate for the quantile of the mean. Given the continuity of the EMG cdf it will then follow that the estimate for k is consistent.

Lesson 11. Functional Genomics I: Microarray Analysis

Lesson 11. Functional Genomics I: Microarray Analysis Lesson 11 Functional Genomics I: Microarray Analysis Transcription of DNA and translation of RNA vary with biological conditions 3 kinds of microarray platforms Spotted Array - 2 color - Pat Brown (Stanford)