Multiple protein sequence alignment is a central problem in computational molecular biology.

Size: px

Start display at page:

Download "Multiple protein sequence alignment is a central problem in computational molecular biology."

Sheila Brianne Collins
6 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 18, Number 8, 2011 # Mary Ann Liebert, Inc. Pp DOI: /cmb On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison XUGANG YE, YI-KUO YU and STEPHEN F. ALTSCHUL ABSTRACT Dirichlet mixtures provide an elegant formalism for constructing and evaluating protein multiple sequence alignments. Their use requires the inference of Dirichlet mixture priors from curated sets of accurately aligned sequences. This article addresses two questions relevant to such inference: of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set. To apply the Minimum Description Length principle to the first question, we extend an analytic formula for the complexity of a Dirichlet model to Dirichlet mixtures by informal argument. We apply a Gibbs-sampling based approach to the second question. Using artificial data generated by a Dirichlet mixture, we demonstrate that our methods are able to approximate well the true theory, when it exists. We apply our methods as well to real data, and infer Dirichlet mixtures that describe the data better than does a mixture derived using previous approaches. Key words: algorithms, combinatorics, linear programming, machine learning, statistics. 1. INTRODUCTION Multiple protein sequence alignment is a central problem in computational molecular biology. A powerful and elegant formalism underpinning one approach to multiple sequence alignment is that of Dirichlet mixtures (Brown et al., 1993; Sjölander et al., 1996; Altschul et al., 2010). Intuitively, it is assumed that positions within a protein can be thought of as falling into a small number of classes. Each such class may be described by its frequency within proteins, by a set of typical amino acid probabilities for protein positions belonging to the class, and by how far the probabilities associated with a specific position tend to diverge from these typical values. A Dirichlet mixture (DM) captures these notions formally. There is no plausible way to construct from first principles a DM appropriate for protein sequence comparison, and DMs for this purpose are therefore derived from curated sets of protein multiple alignments, generally by seeking a maximum-likelihood DM (Brown et al., 1993; Sjölander et al., 1996). An immediate problem that arises is how many classes or components such a DM should have, but no systematic guidance for answering his question has been published. As is generally the case in model selection, the more components and therefore parameters a DM is allowed to have, the better it will be able to describe a given set of data, but a DM with too many components will tend to overfit the data. The National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland. 941

2 942 YE ET AL. Minimum Description Length (MDL) principle (Grünwald, 2007) provides a way to deal with this problem. It implies that one should seek to minimize the description length of the data given the maximumlikelihood M-component DM, plus a measure of the complexity of the set of all M-component DMs. To apply the principle, we require both a method for finding maximum-likelihood DMs, and a formula for the complexity of a Dirichlet mixture model. This article is organized as follows. First, we review the essentials of multinomial, Dirichlet, and Dirichlet mixture distributions, and of MDL theory. Second, we present heuristic arguments for extending to Dirichlet mixtures an analytic formula for the complexity of a single Dirichlet distribution model. Third, we present a Gibbs-sampling algorithm for seeking a maximum-likelihood DM to describe a given set of multiple alignment data. Other algorithmic approaches to this problem have been developed previously (Brown et al., 1993; Sjölander et al., 1996). Fourth, we apply our complexity formula and optimization algorithm to artificial data generated by a known DM, to study whether our method allows us to recover the DM s number of components, and to estimate its parameters accurately. Finally, we apply our method to real data from protein multiple alignments, and compare our results to earlier ones. 2. REVIEW 2.1. Multinomial and Dirichlet distributions, and Dirichlet mixtures Statistical approaches to protein sequence comparison analyze the probabilities ~p for the various amino acids to appear at particular protein positions. Baysian approaches require the specification of prior probabilities over the space of all possible ~p, and for mathematical convenience these priors are almost always taken to be Dirichlet distributions or Dirichlet mixtures. In this section, we review briefly the relevant mathematical concepts. A multinomial distribution over an alphabet of L letters is described by an L-component vector of positive probabilities ~p, with P L j ¼ 1 p j ¼ 1. Because of the constraint, the space O of all possible multinomials is L 1 dimensional. One may imagine the probabilities for the various amino acids appearing at a particular protein position as described by a particular multinomial. Bayesian statistics require the specification of a prior probability density over the multinomial space O. Such a prior should capture as well as possible one s general knowledge concerning proteins. For example, multinomials may be favored, among others, in which all and only the hydrophobic residues have high probabilities, whereas multinomials may be disfavored in which both hydrophobic and charged residues have high probabilities. Although a prior distribution y may take any form one wishes, for analytic and computational convenience it is best to require that y be a Dirichlet distribution or a Dirichlet mixture. A Dirichlet distribution is a probability density over O. A particular Dirichlet distribution may be specified by an L-dimensional parameter vector ~a, with all a j positive, and it is convenient to define a P L j ¼ 1 a j. The density of this distribution at ~x is defined as f (~x) ¼ Z YL j ¼ 1 x a j 1 j, (1) where the normalizing scalar Z ¼ C(a )= Q L j ¼ 1 C(a j) ensures that integrating f over its domain O yields 1. The special case of all a j ¼ 1 corresponds to the uniform density. One may show that the expected value of ~x is ~q ¼~a=a, and it is sometimes convenient to specify a Dirichlet distribution using the alternative parametrization (~q, a * ). Because only L 1 of the components of ~q are independent, there are still L free parameters. The location parameters ~q specify the center of mass of the distribution, while the concentration parameter a * specifies how tightly the probability density is concentrated near ~q. Large values of a * correspond to densities very concentrated near ~q, whereas values of a * near 0 correspond to densities concentrated near the boundaries of O. Different protein positions are under different physical constraints and selective pressures, but it may be imagined that most positions fall into one of several broad categories, each with its own amino acid preferences. This implies that multiple distinct regions of multinomial space should have high prior probabilities. A single Dirichlet distribution in general cannot model such a density, but a Dirichlet mixture

3 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 943 (DM) can. A DM is a probability density over O that is a simple linear combination of a finite number M of Dirichlet distributions, each called a Dirichlet component. Formally, it is defined by M Dirichlet distributions with respective parameters ~a i, and a set of M positive mixture parameters ~m that sum to 1. Because of this last constraint, a DM has ML þ M 1 free parameters. Defining a i P L j ¼ 1 a i, j, the center of mass of a DM is simply P M i ¼ 1 m i ~a i a ¼ P M i ¼ 1 m i~q i. i For Bayesian analysis, the signal advantage of using a Dirichlet prior y over O is that after the observation of a single letter a, the posterior is also a Dirichlet distribution y 0, with parameters identical to y, except that a 0 a ¼ a a þ 1. Similarly, if the prior is a DM, the posterior is also a DM, with parameters that are easy to calculate (Brown et al., 1993; Sjölander et al., 1996; Altschul et al., 2010). The class of DMs is rich enough to model well a broad range of prior beliefs Issues in inferring Dirichlet mixture priors Because there is no plausible way to infer a DM appropriate for protein sequence comparison from theory alone, DMs for this purpose are derived from curated sets of protein multiple alignments. Intuitively, by examining a gold standard set of multiple alignment data, constructed using structural or other considerations, one may develop general knowledge about which multinomials tend best to describe protein alignment columns. Formally, after selecting the number M of components comprising the DM, one seeks the maximum-likelihood DM, i.e., that which assigns the multiple alignment data the greatest probability (Brown et al., 1993; Sjölander et al., 1996). Neither step of this procedure is trivial, and the two may be seen as interrelated. This article studies ways in which each of these two steps may be accomplished. In general, given a variety of related models with which to describe a set of data, the model with the greatest number of parameters will fit the data best. Too many parameters, however, can result in overfitting modeling the noise within the data rather than the regularities which can lead to poor predictions on new data. To avoid this problem, a useful criterion for model selection is the MDL principle. Below, we will describe in detail how the MDL principle may be applied to selecting the number of Dirichlet components. Given a specific number M of components, the question of how to find a maximum-likelihood DM remains. Taking the likelihood of the data as an objective function, the central problem is that the space of M-component DMs has many local maxima. This classic optimization problem has no known rigorous solution, but there are a variety of fruitful heuristic approaches. Among these is Gibbs sampling, whose application to DM optimization we describe below The minimum description length principle The MDL principle (Grünwald, 2007) addresses the question of which among several models to choose for describing a set of data. It proposes that the model is best which minimizes the description length of the data given the model, plus the description length of the model. Formalizing these concepts is not trivial (Grünwald, 2007), and for brevity we will confine our review of the MDL approach to those elements relevant to the problem at hand. We take a theory y to specify a probability distribution P y over the space of all possible sets of data. The description length (in bits) of a set of data D given y is then defined as DL(Djy) ¼ log 2 [P y (D)]. (Throughout this article, we assume logs to be base 2; when natural logarithms are needed, we use the notation ln.) A model M is a parametrized set of theories, and the description length of D given M is defined as DL(DjM) ¼ inf h2m DL(Djh). For a nested set of models M 1, M 2,...such as all DMs with 1, 2, etc. components, it is evident that a more comprehensive model will never imply a longer description length for a set of data. The most challenging aspect of MDL theory is the definition and calculation of the description length, also called the complexity, of a model. The complexity of a parametrized model may be thought of as the log of the effective number of independent theories it contains (Grünwald, 2007), and this number is dependent on the quantity of data that the model is asked to describe. In brief, assume our data consist of n independent samples drawn from a probability distribution parametrized by y, where y lies in the k- dimensional space Y. Then, given certain reasonable assumptions, we have that, for large n, the complexity of M is given by

4 944 YE ET AL. COMP(M, n) ¼ k log n þ A þ o(1), (2) 2 where A is a constant dependent on M (Grünwald, 2007). The mathematics are too complex to compute A in eq. (2) for Dirichlet mixtures. However, as described by Yu and Altschul (2011), the simpler case of a single Dirichlet distribution is tractable. Heuristic arguments then allow us to derive a plausible formula for the complexity of Dirichlet mixtures from this simpler case, and that of the multinomial distribution. This formula is the tool we require to specify the optimal number of DM components for describing a set of multiple alignment data. 3. MODEL COMPLEXITY 3.1. The single Dirichlet model A set of observed data is not drawn directly from a Dirichlet distribution, but is mediated through a multinomial. Specifically, to say that the data in a multiple alignment is described by a Dirichlet distribution is shorthand to say that, for each column of the alignment, a multinomial ~p is sampled from the Dirichlet distribution, and the data in the column are then sampled according to ~p. The model D L in question is the set of all Dirichlet distributions over the alphabet of L letters, applied to the description of multiple alignment data consisting of n independent columns, with each column containing c letters. This model is L-dimensional, and the constant A in eq. (2) depends on L and c. As described by Yu and Altschul (2011), the complexity of D L is given by COMP(D L, n, c) ¼ L L 1 log n þ 2 2 log c 2 log C(L=2) 1 2 log (L 1) þ D L, c þ o(1), (3) where D L,c is a small calculable constant which approaches 0 for c large. For proteins, D 20,c is always less than 0.3 bits, and its values for c ranging from 2 to 500 are given in Table 1. In practice, one s data frequently consists of n columns in which c varies by column. In this case, it is appropriate to extend eq. (3) by using c, the mean number of observations per column, in place of c (Yu and Altschul, 2011) The Dirichlet mixture model It is challenging to analyze the complexity of a single Dirichlet distribution (Yu and Altschul, 2011), and we see no feasible way to analyze rigorously the complexity of a Dirichlet mixture. However, it is possible to use informal arguments to plausibly approximate this complexity. Table 1. D 20,c for the Single Dirichlet Model c D 20,c (bits) c D 20,c (bits) Values are calculated as described in Yu and Altschul (2011) and have a standard error of about bits.

5 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 945 We consider DM M, L, the Dirichlet mixture model with M components, on an alphabet of L letters. The data DM M, L describes is the same as that considered above, namely a multiple alignment with n columns, and c observations in each column. A DM can be viewed as using all its components to describe a particular column of data, but with varying component probabilities which reflect how well the various components respectively fit the data. Given a DM that describes a set of data well, for most columns the component probabilities are concentrated on a single component. In order to gain a handle on the complexity of a DM model, we therefore make the approximation that each column can be assigned to a single component. This allows us to separate the complexity of a DM model into the complexity of its mixture parameters and the complexity of each of its Dirichlet components. Our approximation necessarily counts some theories as distinct that might not be distinguishable by the data, and therefore should overestimate the complexity of a DM model. To start, the M 1 mixture parameters of DM M, L can be viewed as describing a multinomial distribution over an alphabet of M letters, where each letter represents a particular Dirichlet component. Thus, by our approximation, the complexity of the mixture parameters can be seen as equivalent to that of a multinomial model on M letters (M M ), applied to n pieces of data. As described, for example, in the supplementary material of Altschul et al. (2009), for large n the complexity of such a multinomial model is given by COMP(M M, n) ¼ M 1 log n 2 2 þ 1 log p log C(M=2) þ o(1): (4) 2 Second, we calculate the complexity of each of the M Dirichlet components using the theory reviewed above, but modified so that each component is assumed to describe only a subset of the columns. Because the logarithm is a concave function, the aggregate complexity of the Dirichlet components is maximized by assuming the columns are divided evenly among them. Finally, we note that the labels attached to the M Dirichlet components are purely arbitrary, and that a permutation of the labels results in an identical DM. To compensate for this overcounting of distinct theories, we need to subtract log(m!) from our assessment of the complexity of a DM model. Putting the various pieces together, we approximate this complexity with the formula n COMP(DM M, L, n, c) COMP(M M, n) þ M COMP(D L,, c) log (M!): (5) M Given a set of multiple alignment data, Dirichlet mixture models with increasing numbers of components will lead to shorter data description lengths. Eq. (5) and the MDL principle allows us to calculate at what point these decreasing description lengths no longer compensate sufficiently for the increasing complexity of the models. 4. INFERRING MAXIMUM-LIKELIHOOD DIRICHLET MIXTURES 4.1. A Gibbs-sampling strategy As important as calculating the complexity of a Dirichlet mixture model is finding the specific mixture y contained by the model that minimizes the description length of a given set of data. Formally, assume the M-component DM y has mixture parameters ~m and Dirichlet parameters ~a i, which are alternatively expressed, as described above, by (~q i, a i ). Assume further that the data D consist of n independent columns, with letter counts c (1), c (2),..., c (n), and with letter count vectors ~c (1),~c (2),...,~c (n). Then the description length of D given y is DL(Djh) ¼ Xn k ¼ 1 log XM i ¼ 1 m i p (k) i, (6) where p (k) i is the probability of a specific column with count vector ~c (k) given the ith Dirichlet component, and can be written as p (k) i ¼ C(a i ) C(a i þ c(k) ) Y L j ¼ 1 C(a i, j þ c (k) j ) C(a i, j ) (7)

6 946 YE ET AL. (Sjölander et al., 1996; Altschul et al., 2010). We seek the M-component DM y that minimizes DL(Djy), i.e. the maximum likelihood DM. Unfortunately, this optimization problem is both high-dimensional and nonconcave, and its rigorous solution is therefore likely to be intractable. Nevertheless, heuristic optimization procedures are available, and an expectation-maximization (EM) approach has been described by Sjölander et al. (1996). We here propose an alternative optimization procedure that we will argue has certain advantages to the earlier one. Our basic approach is to use Gibbs sampling (Geman and Geman, 1984; Lawrence et al., 1993) to reduce the hard problem of simultaneously optimizing the parameters of a Dirichlet mixture to the much simpler one of separately optimizing the parameters of its constituent Dirichlet components. Specifically, we proceed as follows: a. Start with an M-component DM y. Calculate DL best : ¼ DL(Djy) using eqs. (6) and (7), and let y best : ¼ y. b. Create M empty bins, and then for each column k: i. Use ~m and eq. (7) to calculate a likelihood m i p (k) i for each constituent component of y. ii. Normalize these likelihoods, and use them to randomly sample column k into one of the M bins. c. For each bin i, corresponding to an individual Dirichlet component, calculate parameters for a new DM y 0 as follows: i. Calculate a new mixture parameter as m 0 i : ¼ n i=n, where n i is the number of columns that have been sampled into bin i. ii. Calculate new location parameters as q 0 i, j : ¼ C i, j=c i, where C i,j is the aggregate count of letter j among all columns assigned to bin i, and C i is the aggregate count of all letters in all columns assigned to bin i. iii. Calculate a new concentration parameter a 0 i using the maximum-likelihood procedure described in the Appendix. d. Calculate DL 0 : ¼ DL(Djy 0 ). If DL 0 < DL best, let DL best : ¼ DL 0, and y best : ¼ y 0. e. If more than R iterations have passed since y best was changed, stop. Otherwise, let y : ¼ y 0, and return to step b). A notable feature of this Gibbs-sampling algorithm is that, after the assignment of columns to individual bins, the mixture and location parameters ~m and ~q i are trivial to estimate. For each component, the estimation of the concentration parameter a i reduces to the simple one-dimensional optimization of a smooth function, as described in the Appendix. This stands in contrast to the multi-dimensional optimization required by each step of the EM algorithm described by Sjölander et al. (1996) Refinements This basic optimization procedure can be refined in several ways. It may be iterated multiple times using different random number seeds, and the best result from the multiple runs retained. When a relatively small number of columns are sampled into a given bin, it is possible that one or more letters may be completely missing from the sample. For this case, it is useful to employ pseudocounts, or some positive lower bound on the q 0 i, j, so that the Dirichlet parameters remain valid, and so that the component in question retains the ability to describe columns that contain the missing letters. Also, if fewer than two columns are sampled into a bin, it is impossible to calculate a 0 i in the final part of step c. It can thus be useful to impose a minimum number of columns per bin, which if not achieved causes the program to halt, or else to remove the bin completely and proceed henceforth with a smaller number of Dirichlet components. Finally, as we discuss below, various strategies may be used to choose a promising y with which to initialize the procedure. The Gibbs-sampling algorithm above can be modified to produce a non-stochastic descent procedure. In place of sampling each column into a random bin in step b ii, each column can be assigned fractional membership in each bin proportional to calculated likelihoods. Step c may then be generalized to calculate the parameters of y 0 from non-unitary column counts. The procedure terminates when DL best improves by less than a small, set value e. In our implementation, we use Gibbs simpling as a first stage, and once it terminates move into a second, descent stage. We have found that this second stage produces negligible improvement when the data consist of a large number of columns, but can yield noticeably improved results for small data sets. Omitting the initial Gibbs-sampling stage, however, is not advisable, as the descent stage alone is liable to get trapped by local optima.

7 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS Initialization strategies Gibbs sampling and related stochastic approaches such as simulated annealing ( Metropolis et al., 1953; Kirkpatrick et al., 1983) are widely used in attempts to find the global optimum of objective functions that have many local optima. They have the virtue of frequently being able to escape local optima that may trap deterministic procedures. By default, our Gibbs-sampling routine begins with an essentially flat y, i.e., one in which all m i ¼ 1/M and all a i,j ¼ 1. Remarkably, as we will see below, excellent results can be obtained using even this completely agnostic strategy. It is often useful to initiate a Gibbs-sampling routine at a point one believes may be near the global optimum. Especially when one is exploring DMs with varying numbers of components, such initialization strategies can be of significant utility. For example, one can use a good M-component DM y M to seek an optimal (M þ 1)-component DM by initiating the algorithm above with the ~a i of y M plus a flat component, with trivial adjustments to the mixture parameters ~m. Conversely, given an (M þ 1)-component DM y M þ 1, one can partition one s data into M þ 1 bins as above, and then determine which two bins, when combined, yields the M-component y M that best describes the data. y M can then be used to initiate the search for an optimal M-component DM. Even when using stochastic approaches, a cohort of related columns can become stuck in one bin when it would better be assigned to another, because the pull of the cohort will prevent individual members from migrating. In this situation, moving back and forth between dimensions as described above can be useful. In brief, when seeking an optimal (M þ 1)-component DM, it is possible that the misplaced cohort will break away to populate the new bin. Then, using the new y M þ 1 to create a y M with which to seed a new search for an optimal M-component DM, it may become apparent that the cohort belongs better with a set of columns other than that from which it broke away Simulated data 5. RESULTS The research group at UCSC that originally proposed the DM formalism for multiple sequence analysis currently provides a variety of multiple alignment data sets and Dirichlet mixtures derived therefrom at their website: In order to test both the MDL principle and our Gibbs-sampling algorithm on protein-like data for which the true solution is known, we used the 9- component DM byst comp from this website (here called ~ h) to generate artificial data. Specifically, we constructed four artificial data sets of 10,000 or more columns, each with a different average column size c, equal to 10, 20, 40 and 80. For a column k in a given data set, we first randomly selected c *(k) > 1, the number of observations in that column, from the Poisson distribution with mean c. We then used ~ h to select a random multinomial distribution over 20 letters, and generated c *(k) independent observations from this multinomial. For the first n columns of a given data set, we used the Gibbs-sampling algorithm described above, with flat initial ys, to seek maximum-likelihood DMs y M from models with M ¼ 1toM ¼ 15 components. (We let R ¼ 3 in step e, and retained the best result from ten runs using different random number seeds.) We then calculated the complexity of each model (eq. (5)), and the description length of the data (eqs. (6) and (7)) given y M. Finally, using the MDL principle, we chose ^M to be the number of components for which the sum of these numbers was minimized. The calculation of ^M for a specific example, with c ¼ 20 and n ¼ 2000, is illustrated in Table 2. As can be seen, the description length of the data decreases as the number of components grows, but this improvement eventually is outweighed by increases in the complexity of the models. In this case, the MDL principle yields ^M ¼ 9, but this result prevails only barely over M ¼ 8. For each data set, with n ranging from 50 to 10,000 in increments of 50, we calculated ^M using the procedure described above; the results are shown in Figure 1. When there are very few columns, the best model is a simple one, with only one or two Dirichlet components. However, as the number of columns grows so, on average, does ^M, until it settles first into the range 9 1, and eventually becomes nearly certain to equal 9, the number of components actually used to generate the data. For each data set, we indicate in Figure 1 the lowest value n 0 for which ^M ¼ 9 1 for all tested n n 0. The larger the average number of observations per column, the earlier this convergence tends to occur.

8 948 YE ET AL. Table 2. Calculation of ^M Using the MDL Principle M Free parameters COMP(DM) (bits) DL(Djy M ) (bits) COMP(DM) þ DL(Djh M ) (bits) , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,991 Model complexities are estimated using eq. (5) for DM models with M components, applied to a data set with 2000 columns and a mean of 20 observations per column. Data description lengths are for a particular data set, generated as described in the text, using a putative maximum-likelihood DM y M, estimated using the Gibbs-sampling algorithm described in the text. The MDL principle yields ^M ¼ 9, at which the total description length COMP(DM M,20, 2000, 20) þ DL(Djh M ) is minimized. All description lengths are rounded to the nearest bit. Although the correct number of Dirichlet components can be recovered to good precision with even a relatively small number of columns, the recovery of accurate values for the parameters of the generating DM requires substantially more data. In Table 3, we compare the parameters recovered by the example considered in Table 2 to those of ~ h, used to generate the data. We also consider the parameters recovered using a much larger data set with n ¼ 100,000 and c ¼ 80, values which are comparable to those for real FIG. 1. The estimated number of Dirichlet components as a function of data size. The 9- component DM h ~ was used to generate four data sets, with an average of c observations per column, with c ¼ 10, 20, 40, and 80. Using our Gibbs-sampling algorithm applied to the first n columns of each data set (with n divisible by 50), and the MDL principle, we calculated ^M, the number of components in the optimal Dirichlet mixture model. For each data set, a vertical line indicates the the smallest n 0 for which ^M ¼ 9 1 for all tested value of n n

9 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 949 Table 3. Comparison of Recovered to Generating DM Parameters ^m i = ~m i ^a i =~a i JS( ~ ~q~q i, ^~q~q i ) i ~m i c ¼ 20 c ¼ 80 ~a i c ¼ 20 c ¼ 80 c ¼ 20 c ¼ Using a nine-component DM h, ~ artificial data sets with (n ¼ 2000, c ¼ 20) and (n ¼ 100, 000, c ¼ 80) were generated. For each data set, a maximum-likelihood 9-component DM ^h was estimated. The parameters of each component i of h ~ are compared to the corresponding parameters in the nearest component of ^h. JS( ~q~q ~ i, ^~q~q i ) is the Jensen-Shannon divergence of the location parameter vectors ~~q~q i and ^~q~q i, where JS(~r,~s) is given by 1 2 Pj r j ln 2rj þ s j ln 2sj. rj þ sj rj þ sj protein data sets used to derive DMs, such as that studied below. Given the larger data set, we are able to estimate both the mixture parameters ~m i and the concentration parameters a i on average to within 0.6%, and in no case to err by more than 1.5%. The smaller data set yields much larger errors, erring in the estimation of these parameters by over 15% more than half the time. In general, the smaller ~a i, j, the smaller the absolute but the greater the relative error in estimating its value, a fact related to the development in Ye et al. (2010). Thus, rather than recording data for individual ~a i, j, Table 3 summarizes the accuracy with which each component s location parameters ~ ~q~q i are estimated, using the measure of Jensen-Shannon divergence. For the smaller data set, three of the components yield a divergence of over bits, but for the larger data set the divergence in under bits for all but one component. For the artificial data studied in this section, we made no attempt to improve the result for one value of M by using the best result for a neighboring value to select an initial y. Averaged over 10 runs on an Intel Xeon 2.4-GHz E7440 CPU, using different random number seeds, our program took 1.7 seconds per run to converge for the (n ¼ 2000, c ¼ 20) data set with M ¼ 9, and 84 seconds for the (n ¼ , c ¼ 80) data set with M ¼ 9. A 9-component DM model provides 188 free parameters, and optimizing a function that yields local optima over a space of this size provides a significant challenge. Nevertheless, Table 3 suggests that when the observations are in fact generated by a DM contained in this model, our Gibbs-sampling algorithm is able to converge on the true solution given sufficient data Real data To analyze real data, we consider the diverse-1216-uw data set, from the website cited above, here called D UCSC,containingn ¼ columns, with an average of c ¼ 75:99 amino acids per column, and the 20- component DM dist.20comp, here called y UCSC, that was previously derived from this data set. Letting y 0 be a multinomial distribution fit to the amino acid frequencies of D UCSC, a baseline description of the data has total description length DL(D UCSC jh 0 ) þ COMP(M 20, :99) ¼ þ 206 ¼ bits, or bits per amino acid. Using y UCSC instead to describe the data, the total description length is reduced to DL(D UCSC jh UCSC ) þ COMP(DM 20, 20, , 75:99) ¼ þ 3461 ¼ bits, or bits/ a.a. The improvement with respect to the baseline multinomial model is bits/a.a., and is represented by atriangleinfigure2. We applied the optimization methods described above to approximate maximum-likelihood y M for these data. In contrast to our artificial data, real data are not produced by a true underlying DM, even though they can perhaps be well modeled by one. This means that as the quantity of available data grows, the complexity of the best model is likely to grow as well, albeit slowly, never converging to a true number of components.

10 950 YE ET AL FIG. 2. Decrease in total description length as a function of the number of Dirichlet components. Given the data set D UCSC (with n ¼ 314, 585 and c ¼ 75:99), we used our Gibbs-sampling algorithm to estimate with y M the maximum-likelihood M-component DM, for M from 2 to 38. The total description length was calculated as DL(D UCSC jh M ) þ COMP(DM M,20, n, c), and compared to the total description length given the multinomial model on 20 letters, M 2 0. The decrease in description length per amino acid of the data is plotted, and reaches its maximum at ^M ¼ 35. The triangle represents the decrease in description length yielded by the 20-component y UCSC, which was derived from D UCSC. Decrease in description length (bits per amino acid) Because there is likely no true DM describing the data, the objective function tends to be much flatter over parameter space than it is for artificial data, making the search for a globally optimal DM correspondingly harder. Perhaps as a result, we found that for a given M, flat initial ys often were outperformed by initial ys derived from the best current solution for adjacent values of M. Thus, our optimization strategy consisted first of finding a set of provisional solutions y M for the range of M explored, and then of improving these solutions through further runs in which initial ys were derived from neighboring provisional y M. We graph in Figure 2 the improvement in total description length our best y M yield with respect to the baseline, for M from 2 to 38. y 35 produced the greatest improvement, of bits/a.a. (On D UCSC,an average single run for M ¼ 35 took 512 seconds until convergence.) y 16 yielded an improvement of bits/a.a., essentially equivalent to that of the 20-component y UCSC, and y 20 yielded the somewhat greater improvement of bits/a.a. Using the fssp select cols data set from the website above, we achieved comparable results and improvements with respect to the corresponding DM fournierfssp.20comp (data not shown). As is evident from Figure 2, until M ¼ 35, increasing the number of components yields a steady but generally diminishing improvement in total description length. Programs that employ DMs generally run more slowly the larger the number of components, which provides an independent reason for preferring smaller M. There are no points in Figure 2 where the implicit slope changes abruptly, but examination suggests that for this data set, y 10, y 14, y 23, and y 29, which yield improvements that are, respectively, 99.0%, 99.3%, 99.7%, and 99.9% of the optimal, may provide good tradeoffs between simplicity and accurate description of the data. 6. DISCUSSION Although for real data, no optimal solution is known with certainty, it is instructive to examine the Dirichlet components recovered, and to compare y M for different values of M. In Table 4, we summarize the components of our y 10 and y 14, ordered by the magnitude of their mixture parameters m i. Similarly to the results of Brown et al. (1993) and Sjölander et al. (1996), most components can be seen to correspond to natural classes of amino acids. For example, component 9 of y 10 and component 13 of y 14 both favor the aromatic amino acids. Other components favor hydrophobic, positively charged, and negatively charged residues, and the special amino acids glycine and proline. It is also worth noting two other types of components. First, components 1 of y 10 and y 14 both have very low values of a*, indicating high probability density near the boundaries of multinomial space; the columns these components describe therefore consist heavily of single amino acids. As M grows further, a component of this type may break into multiple components, each with large a* andcorrespondingtoasingleaminoacid,similarinthis way to components 8 and 10 of y 10. Second, components 4 of y 10 and y 14 both have fairly high values of a*, but do not strongly favor or disfavor any amino acid. These components describe protein positions

11 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 951 Table 4. Components of Dirichlet Mixtures y 10 and y 14 log(q i,j /p j ) i m i a i > 1 > 0.5 < 1 y 10 parameters CG DHW VMIL DE NQK YMWCVLFI ILVM F PSHQNGRKED RQK ILVMF ENGD KRQ E YPVCLGIWF AC ST PDER G N CYWMFLVI YWFH QIVRTSAKDPEG P YVMFLCI y 14 parameters CWG H L VI ML PSQHNGERKD EQD K GMWYVLICF RKQ ILVMF NDG TS HN LMI F ATHSPQRNGKED RKQ CPLVGFI DN E RCAYWMFVLI G N TCYWMFLVI DN SP YCWMFVLI AC S ILWNREKD YWFH TASPKEGD P GVWYLMCFI The components of y 10 and y 14 are ordered by decreasing value of their mixture parameters ~m. For each component, the log ratio of each amino acid s location parameter to its background frequency is calculated, and listed in decreasing order of log ratio if this ratio is > 1, > 0.5 or< 1. that are probably not under strong evolutionary pressure, and accept all amino acids at roughly their background frequencies. In Figure 3, we compare y 10 s to y 14 s location parameters. Qualitatively, it is evident that as M grows, many Dirichlet components retain their essential character, as represented by the corresponding components in the first eight rows and columns of the figure. Other Dirichlet components split apart, as seen in the last two rows, which correspond to components 3 and 2 of y 10 Finally, components of an almost completely new character can be born, as seen in the last two columns, which correspond to components 9and6ofy CONCLUSION In this article, we have studied several questions relevant to the inference of a Dirichlet mixture model from a gold standard set of protein alignment data. We sought to apply the MDL principle to the question of how many components a Dirichlet mixture should have. This required an evaluation of the complexity of DM models, and we accordingly developed heuristic arguments for extending to Dirichlet mixtures an analytic formula for the complexity of a single Dirichlet model. A second element needed for the application of the MDL principle is a method for approximating maximum-likelihood DMs from a given set of data. Although this problem has been addressed previously, we have described a new Gibbssampling approach that reduces the high-dimensional optimization problem to several tractable onedimensional optimization problems.

12 952 YE ET AL. FIG. 3. Comparison of location parameters for the Dirichlet mixtures y 10 and y 14. The location parameters ~q i for each component of y 10 are compared to those for each component of y 14 using the measure of Jensen-Shannon divergence, described in the legend to Table 3. The indices of the components are those given in Table 4, and are reordered to make the relationships among components easier to read. To test the efficacy of our methods, we have applied them to artificial data generated by a known Dirichlet mixture, and shown that they are able both to recover the correct number of Dirichlet components, and to converge effectively on the true DM parameters. Finally, we have applied our methods to real data, where they are able to recover DMs that describe the data more concisely than do DMs constructed using an earlier approach. It is hoped that the methods presented here will aid in the construction of improved DMs for the comparison of multiple protein sequences. 8. APPENDIX: MAXIMUM-LIKELIHOOD ESTIMATION OF THE CONCENTRATION PARAMETER Our Gibbs sampling approach provides us with a way to estimate the values of the mixture parameters ~m, and thereby to reduce the problem of finding a maximum-likelihood (m.l.) Dirichlet mixture to that of finding a m.l. Dirichlet distribution. Because of this reduction, we simplify the notation in this section by dropping the component subscript i from all relevant parameters. Let ~c (k) be the letter count vectors associated with the n columns assigned to the bin in question, and let c *(k) be the total letter count for column k. Applying eq. (7), the log-likelihood L of the data is then given by " # L¼ Xn C(a ) Y L C(a q j þ c (k) j ) ln C(a k ¼ 1 þ c (k) ) C(a j ¼ 1 q j ) ( ) ¼ Xn k ¼ 1 ln C(a ) ln C(a þ c (k) ) þ XL j ¼ 1 [lnc(a q j þ c (k) j ) ln C(a q j )] : (8) If the columns associated with a bin are assigned non-unitary weights w k, one may generalize this formula by simply including the factor w k before ln in the first summation. Although it is possible to find the a* and ~q that maximize eq. (8) by Newton s method in L- dimensional space, the computation ofthegradientandthehessianmatrix is time-consuming, especially when n is large. Thus, our second heuristic idea is to reduce the maximization problem to one dimension using first moment information from the ~c (k). In short, if ~p is sampled from a Dirichlet distribution with parameters ~a, p j follows a beta distribution with parameters (a j, a* a j ). Furthermore,

13 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 953 A C B D FIG. 4. The log-likelihood L of the data as a function of a *. The graphs give L in units of Examples are shown for four bins generated using the data set D UCSC described in the text, with M ¼ 10 components. (A) A bin generated after the first round of Gibbs sampling beginning from a flat initial y; all 10 bins in this first round yield very similar results. (B) The bin corresponding to component 1 of Table 4, after convergence. (C) The bin corresponding to component 3 of Table 4, after convergence. (D) The bin corresponding to component 8 of Table 4, after convergence. Solid vertical lines indicate the maximum-likelihood values ~a. Dashed vertical lines indicate the ^a calculated using the method of moments, and used to initiate Newton s method. conditioning on c *(k) and p j, c (k) j follows a binomial distribution with parameters (c *(k), p j ). This implies we can estimate q j by P n k ¼ 1 ^q j ¼ c(k) j P n, (9) k ¼ 1 c(k) and replace q j in eq. (8) by ^q j. This leaves the single parameter a* still to estimate. To apply Newton s method, we need the first and second derivatives of L with respect to a*, which can be written as: dl da ¼ Xn k ¼ 1 ( w(a ) w(a þ c (k) ) þ XL ^q j [w(a ^q j þ c (k) j ) w(a ^q ) j ) ; (10) j ¼ 1 ( ) d 2 L Xn ¼ w 0 (a ) w 0 (a þ c (k) ) þ XL h ^q 2 da2 j [w 0 (a ^q j þ c (k) j ) w 0 (a ^q i j ), (11) k ¼ 1 j ¼ 1 where c and c 0 are the digamma and trigamma functions. Given an initial a* near the m.l. ~a that optimizes L, it is then simple to estimate ~a to great precision in a small number of iterations using Newton s method. There are rapid algorithms for calculating c and c 0 (Bernardo, 1976; Schneider, 1978; Spouge, 1994). Note as well that because c and c 0 appear in eqs. (10) and (11) only as differences, they can be replaced by rational functions when all letter counts c (k) j are integral. Several problem may arise with this approach. First, L may be maximized in the limit only at the boundaries of (0,?). One may show that L is optimal as a* approaches 0 only if each column of the data consists of a single type of letter, although this letter may vary from column to column (proof omitted). Furthermore, one may show that L is optimal as a* approaches? only if the observed letter frequencies are identical from column to column (proof omitted). It is easy to detect and allow for these special cases, but when there are more than a very small number of columns associated with a bin they essentially never arise in practice. Second, it is possible that L has multiple local maxima over (0,?). We conjecture that this can not be the case, but have not been able to prove it. As shown by representative examples in Figure 4, L is very simply behaved in all cases we have observed. Third, it is possible that d 2 L=da 2 may be positive for an initial value of a*, for example if a* were chosen greater that 5.3 in panel B of Figure 4. It is

14 954 YE ET AL. trivial to detect such cases, and replace a* by a smaller value, for which d 2 L=da 2 is negative. In practice, as illustrated in Figure 4, we use the method of moments, detailed below, to initialize Newton s method. This approach yields initial a* that tend to be near to but somewhat smaller than the m.l. ~a. Just as we use first moment information for ~c (k) to reduce the dimension of our problem, so we can use second moment information to obtain a starting point for Newton s method. Specifically, for an individual letter j, some algebra allows us to write a method-of-moments estimate of a* as where v j ¼^Var(c j ) ^ ^Var(c ) E 2 (c ^ ) ^a (j) ¼ [^q j (1 ^q j ) ^E(c 2 ) ^ E 2 (c ) v j ] = [v j ^q j (1 ^q j ) ], (12) E(c ^ ) ^q 2 E 2 (c ) j. We average ^a (j) to obtain an initial ^a ¼ P L j ¼ 1 ^q j^a (j). ACKNOWLEDGMENTS This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Altschul, S.F., Gertz, E.M., Agarwala, R., et al PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 37, Altschul, S.F., Wootton, J.C., Zaslavsky, E., et al The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comp. Biol. 6, e Bernardo, J.M Algorithm AS 103: psi (digamma) function. J. R. Stat. Soc. Ser. C Appl. Stat. 25, Brown, M., Hughey, R., Krogh, A., et al Using Dirichlet mixture priors to derive hidden Markov models for protein families, In Hunter, L., Searls, D., and Shavlik, J., eds. Proc. First Int. Conf. Intell. Sys. Mol. Biol. AAAI Press, Menlo Park, CA. Geman, S., and Geman, D Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Mach. Intell. 6, Grünwald, P.D The Minimum Description Length Principle. MIT Press, Cambridge, MA. Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P Optimization by simulated annealing. Science 220, Lawrence, C.E., Altschul, S.F., Boguski, M.S., et al Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, Metropolis, N., Rosenbluth, A.W., Rosenbluth, M. N., et al Equation of state calculations by fast computing machines. J. Chem. Phys. 21, Schneider, B. E Algorithm AS 121: trigamma function. J. R. Stat. Soc. Ser. C Appl. Stat. 27, Sjölander, K., Karplus, K., Brown, M., et al Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12, Spouge, J.L Computation of the gamma, digamma, and trigamma functions. SIAM J. Numer. Anal. 31, Ye, X., Yu, Y.-K., and Altschul, S. F Compositional adjustment of Dirichlet mixture priors. J. Comput. Biol. 17, Yu, Y.-K., and Altschul, S. F The complexity of the Dirichlet model for multiple alignment data. J. Comput. Biol. 18 (this issue). Address correspondence to: Dr. Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD altschul@ncbi.nlm.nih.gov

Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino Acid Multinomial Space. Stephen Altschul

Dirichlet Mixtures, the Dirichlet Process, and the Topography of mino cid Multinomial Space Stephen ltschul National Center for Biotechnology Information National Library of Medicine National Institutes