Multiple protein sequence alignment is a central problem in computational molecular biology.

Size: px
Start display at page:

Download "Multiple protein sequence alignment is a central problem in computational molecular biology."

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 18, Number 8, 2011 # Mary Ann Liebert, Inc. Pp DOI: /cmb On the Inference of Dirichlet Mixture Priors for Protein Sequence Comparison XUGANG YE, YI-KUO YU and STEPHEN F. ALTSCHUL ABSTRACT Dirichlet mixtures provide an elegant formalism for constructing and evaluating protein multiple sequence alignments. Their use requires the inference of Dirichlet mixture priors from curated sets of accurately aligned sequences. This article addresses two questions relevant to such inference: of how many components should a Dirichlet mixture consist, and how may a maximum-likelihood mixture be derived from a given data set. To apply the Minimum Description Length principle to the first question, we extend an analytic formula for the complexity of a Dirichlet model to Dirichlet mixtures by informal argument. We apply a Gibbs-sampling based approach to the second question. Using artificial data generated by a Dirichlet mixture, we demonstrate that our methods are able to approximate well the true theory, when it exists. We apply our methods as well to real data, and infer Dirichlet mixtures that describe the data better than does a mixture derived using previous approaches. Key words: algorithms, combinatorics, linear programming, machine learning, statistics. 1. INTRODUCTION Multiple protein sequence alignment is a central problem in computational molecular biology. A powerful and elegant formalism underpinning one approach to multiple sequence alignment is that of Dirichlet mixtures (Brown et al., 1993; Sjölander et al., 1996; Altschul et al., 2010). Intuitively, it is assumed that positions within a protein can be thought of as falling into a small number of classes. Each such class may be described by its frequency within proteins, by a set of typical amino acid probabilities for protein positions belonging to the class, and by how far the probabilities associated with a specific position tend to diverge from these typical values. A Dirichlet mixture (DM) captures these notions formally. There is no plausible way to construct from first principles a DM appropriate for protein sequence comparison, and DMs for this purpose are therefore derived from curated sets of protein multiple alignments, generally by seeking a maximum-likelihood DM (Brown et al., 1993; Sjölander et al., 1996). An immediate problem that arises is how many classes or components such a DM should have, but no systematic guidance for answering his question has been published. As is generally the case in model selection, the more components and therefore parameters a DM is allowed to have, the better it will be able to describe a given set of data, but a DM with too many components will tend to overfit the data. The National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland. 941

2 942 YE ET AL. Minimum Description Length (MDL) principle (Grünwald, 2007) provides a way to deal with this problem. It implies that one should seek to minimize the description length of the data given the maximumlikelihood M-component DM, plus a measure of the complexity of the set of all M-component DMs. To apply the principle, we require both a method for finding maximum-likelihood DMs, and a formula for the complexity of a Dirichlet mixture model. This article is organized as follows. First, we review the essentials of multinomial, Dirichlet, and Dirichlet mixture distributions, and of MDL theory. Second, we present heuristic arguments for extending to Dirichlet mixtures an analytic formula for the complexity of a single Dirichlet distribution model. Third, we present a Gibbs-sampling algorithm for seeking a maximum-likelihood DM to describe a given set of multiple alignment data. Other algorithmic approaches to this problem have been developed previously (Brown et al., 1993; Sjölander et al., 1996). Fourth, we apply our complexity formula and optimization algorithm to artificial data generated by a known DM, to study whether our method allows us to recover the DM s number of components, and to estimate its parameters accurately. Finally, we apply our method to real data from protein multiple alignments, and compare our results to earlier ones. 2. REVIEW 2.1. Multinomial and Dirichlet distributions, and Dirichlet mixtures Statistical approaches to protein sequence comparison analyze the probabilities ~p for the various amino acids to appear at particular protein positions. Baysian approaches require the specification of prior probabilities over the space of all possible ~p, and for mathematical convenience these priors are almost always taken to be Dirichlet distributions or Dirichlet mixtures. In this section, we review briefly the relevant mathematical concepts. A multinomial distribution over an alphabet of L letters is described by an L-component vector of positive probabilities ~p, with P L j ¼ 1 p j ¼ 1. Because of the constraint, the space O of all possible multinomials is L 1 dimensional. One may imagine the probabilities for the various amino acids appearing at a particular protein position as described by a particular multinomial. Bayesian statistics require the specification of a prior probability density over the multinomial space O. Such a prior should capture as well as possible one s general knowledge concerning proteins. For example, multinomials may be favored, among others, in which all and only the hydrophobic residues have high probabilities, whereas multinomials may be disfavored in which both hydrophobic and charged residues have high probabilities. Although a prior distribution y may take any form one wishes, for analytic and computational convenience it is best to require that y be a Dirichlet distribution or a Dirichlet mixture. A Dirichlet distribution is a probability density over O. A particular Dirichlet distribution may be specified by an L-dimensional parameter vector ~a, with all a j positive, and it is convenient to define a P L j ¼ 1 a j. The density of this distribution at ~x is defined as f (~x) ¼ Z YL j ¼ 1 x a j 1 j, (1) where the normalizing scalar Z ¼ C(a )= Q L j ¼ 1 C(a j) ensures that integrating f over its domain O yields 1. The special case of all a j ¼ 1 corresponds to the uniform density. One may show that the expected value of ~x is ~q ¼~a=a, and it is sometimes convenient to specify a Dirichlet distribution using the alternative parametrization (~q, a * ). Because only L 1 of the components of ~q are independent, there are still L free parameters. The location parameters ~q specify the center of mass of the distribution, while the concentration parameter a * specifies how tightly the probability density is concentrated near ~q. Large values of a * correspond to densities very concentrated near ~q, whereas values of a * near 0 correspond to densities concentrated near the boundaries of O. Different protein positions are under different physical constraints and selective pressures, but it may be imagined that most positions fall into one of several broad categories, each with its own amino acid preferences. This implies that multiple distinct regions of multinomial space should have high prior probabilities. A single Dirichlet distribution in general cannot model such a density, but a Dirichlet mixture

3 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 943 (DM) can. A DM is a probability density over O that is a simple linear combination of a finite number M of Dirichlet distributions, each called a Dirichlet component. Formally, it is defined by M Dirichlet distributions with respective parameters ~a i, and a set of M positive mixture parameters ~m that sum to 1. Because of this last constraint, a DM has ML þ M 1 free parameters. Defining a i P L j ¼ 1 a i, j, the center of mass of a DM is simply P M i ¼ 1 m i ~a i a ¼ P M i ¼ 1 m i~q i. i For Bayesian analysis, the signal advantage of using a Dirichlet prior y over O is that after the observation of a single letter a, the posterior is also a Dirichlet distribution y 0, with parameters identical to y, except that a 0 a ¼ a a þ 1. Similarly, if the prior is a DM, the posterior is also a DM, with parameters that are easy to calculate (Brown et al., 1993; Sjölander et al., 1996; Altschul et al., 2010). The class of DMs is rich enough to model well a broad range of prior beliefs Issues in inferring Dirichlet mixture priors Because there is no plausible way to infer a DM appropriate for protein sequence comparison from theory alone, DMs for this purpose are derived from curated sets of protein multiple alignments. Intuitively, by examining a gold standard set of multiple alignment data, constructed using structural or other considerations, one may develop general knowledge about which multinomials tend best to describe protein alignment columns. Formally, after selecting the number M of components comprising the DM, one seeks the maximum-likelihood DM, i.e., that which assigns the multiple alignment data the greatest probability (Brown et al., 1993; Sjölander et al., 1996). Neither step of this procedure is trivial, and the two may be seen as interrelated. This article studies ways in which each of these two steps may be accomplished. In general, given a variety of related models with which to describe a set of data, the model with the greatest number of parameters will fit the data best. Too many parameters, however, can result in overfitting modeling the noise within the data rather than the regularities which can lead to poor predictions on new data. To avoid this problem, a useful criterion for model selection is the MDL principle. Below, we will describe in detail how the MDL principle may be applied to selecting the number of Dirichlet components. Given a specific number M of components, the question of how to find a maximum-likelihood DM remains. Taking the likelihood of the data as an objective function, the central problem is that the space of M-component DMs has many local maxima. This classic optimization problem has no known rigorous solution, but there are a variety of fruitful heuristic approaches. Among these is Gibbs sampling, whose application to DM optimization we describe below The minimum description length principle The MDL principle (Grünwald, 2007) addresses the question of which among several models to choose for describing a set of data. It proposes that the model is best which minimizes the description length of the data given the model, plus the description length of the model. Formalizing these concepts is not trivial (Grünwald, 2007), and for brevity we will confine our review of the MDL approach to those elements relevant to the problem at hand. We take a theory y to specify a probability distribution P y over the space of all possible sets of data. The description length (in bits) of a set of data D given y is then defined as DL(Djy) ¼ log 2 [P y (D)]. (Throughout this article, we assume logs to be base 2; when natural logarithms are needed, we use the notation ln.) A model M is a parametrized set of theories, and the description length of D given M is defined as DL(DjM) ¼ inf h2m DL(Djh). For a nested set of models M 1, M 2,...such as all DMs with 1, 2, etc. components, it is evident that a more comprehensive model will never imply a longer description length for a set of data. The most challenging aspect of MDL theory is the definition and calculation of the description length, also called the complexity, of a model. The complexity of a parametrized model may be thought of as the log of the effective number of independent theories it contains (Grünwald, 2007), and this number is dependent on the quantity of data that the model is asked to describe. In brief, assume our data consist of n independent samples drawn from a probability distribution parametrized by y, where y lies in the k- dimensional space Y. Then, given certain reasonable assumptions, we have that, for large n, the complexity of M is given by

4 944 YE ET AL. COMP(M, n) ¼ k log n þ A þ o(1), (2) 2 where A is a constant dependent on M (Grünwald, 2007). The mathematics are too complex to compute A in eq. (2) for Dirichlet mixtures. However, as described by Yu and Altschul (2011), the simpler case of a single Dirichlet distribution is tractable. Heuristic arguments then allow us to derive a plausible formula for the complexity of Dirichlet mixtures from this simpler case, and that of the multinomial distribution. This formula is the tool we require to specify the optimal number of DM components for describing a set of multiple alignment data. 3. MODEL COMPLEXITY 3.1. The single Dirichlet model A set of observed data is not drawn directly from a Dirichlet distribution, but is mediated through a multinomial. Specifically, to say that the data in a multiple alignment is described by a Dirichlet distribution is shorthand to say that, for each column of the alignment, a multinomial ~p is sampled from the Dirichlet distribution, and the data in the column are then sampled according to ~p. The model D L in question is the set of all Dirichlet distributions over the alphabet of L letters, applied to the description of multiple alignment data consisting of n independent columns, with each column containing c letters. This model is L-dimensional, and the constant A in eq. (2) depends on L and c. As described by Yu and Altschul (2011), the complexity of D L is given by COMP(D L, n, c) ¼ L L 1 log n þ 2 2 log c 2 log C(L=2) 1 2 log (L 1) þ D L, c þ o(1), (3) where D L,c is a small calculable constant which approaches 0 for c large. For proteins, D 20,c is always less than 0.3 bits, and its values for c ranging from 2 to 500 are given in Table 1. In practice, one s data frequently consists of n columns in which c varies by column. In this case, it is appropriate to extend eq. (3) by using c, the mean number of observations per column, in place of c (Yu and Altschul, 2011) The Dirichlet mixture model It is challenging to analyze the complexity of a single Dirichlet distribution (Yu and Altschul, 2011), and we see no feasible way to analyze rigorously the complexity of a Dirichlet mixture. However, it is possible to use informal arguments to plausibly approximate this complexity. Table 1. D 20,c for the Single Dirichlet Model c D 20,c (bits) c D 20,c (bits) Values are calculated as described in Yu and Altschul (2011) and have a standard error of about bits.

5 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 945 We consider DM M, L, the Dirichlet mixture model with M components, on an alphabet of L letters. The data DM M, L describes is the same as that considered above, namely a multiple alignment with n columns, and c observations in each column. A DM can be viewed as using all its components to describe a particular column of data, but with varying component probabilities which reflect how well the various components respectively fit the data. Given a DM that describes a set of data well, for most columns the component probabilities are concentrated on a single component. In order to gain a handle on the complexity of a DM model, we therefore make the approximation that each column can be assigned to a single component. This allows us to separate the complexity of a DM model into the complexity of its mixture parameters and the complexity of each of its Dirichlet components. Our approximation necessarily counts some theories as distinct that might not be distinguishable by the data, and therefore should overestimate the complexity of a DM model. To start, the M 1 mixture parameters of DM M, L can be viewed as describing a multinomial distribution over an alphabet of M letters, where each letter represents a particular Dirichlet component. Thus, by our approximation, the complexity of the mixture parameters can be seen as equivalent to that of a multinomial model on M letters (M M ), applied to n pieces of data. As described, for example, in the supplementary material of Altschul et al. (2009), for large n the complexity of such a multinomial model is given by COMP(M M, n) ¼ M 1 log n 2 2 þ 1 log p log C(M=2) þ o(1): (4) 2 Second, we calculate the complexity of each of the M Dirichlet components using the theory reviewed above, but modified so that each component is assumed to describe only a subset of the columns. Because the logarithm is a concave function, the aggregate complexity of the Dirichlet components is maximized by assuming the columns are divided evenly among them. Finally, we note that the labels attached to the M Dirichlet components are purely arbitrary, and that a permutation of the labels results in an identical DM. To compensate for this overcounting of distinct theories, we need to subtract log(m!) from our assessment of the complexity of a DM model. Putting the various pieces together, we approximate this complexity with the formula n COMP(DM M, L, n, c) COMP(M M, n) þ M COMP(D L,, c) log (M!): (5) M Given a set of multiple alignment data, Dirichlet mixture models with increasing numbers of components will lead to shorter data description lengths. Eq. (5) and the MDL principle allows us to calculate at what point these decreasing description lengths no longer compensate sufficiently for the increasing complexity of the models. 4. INFERRING MAXIMUM-LIKELIHOOD DIRICHLET MIXTURES 4.1. A Gibbs-sampling strategy As important as calculating the complexity of a Dirichlet mixture model is finding the specific mixture y contained by the model that minimizes the description length of a given set of data. Formally, assume the M-component DM y has mixture parameters ~m and Dirichlet parameters ~a i, which are alternatively expressed, as described above, by (~q i, a i ). Assume further that the data D consist of n independent columns, with letter counts c (1), c (2),..., c (n), and with letter count vectors ~c (1),~c (2),...,~c (n). Then the description length of D given y is DL(Djh) ¼ Xn k ¼ 1 log XM i ¼ 1 m i p (k) i, (6) where p (k) i is the probability of a specific column with count vector ~c (k) given the ith Dirichlet component, and can be written as p (k) i ¼ C(a i ) C(a i þ c(k) ) Y L j ¼ 1 C(a i, j þ c (k) j ) C(a i, j ) (7)

6 946 YE ET AL. (Sjölander et al., 1996; Altschul et al., 2010). We seek the M-component DM y that minimizes DL(Djy), i.e. the maximum likelihood DM. Unfortunately, this optimization problem is both high-dimensional and nonconcave, and its rigorous solution is therefore likely to be intractable. Nevertheless, heuristic optimization procedures are available, and an expectation-maximization (EM) approach has been described by Sjölander et al. (1996). We here propose an alternative optimization procedure that we will argue has certain advantages to the earlier one. Our basic approach is to use Gibbs sampling (Geman and Geman, 1984; Lawrence et al., 1993) to reduce the hard problem of simultaneously optimizing the parameters of a Dirichlet mixture to the much simpler one of separately optimizing the parameters of its constituent Dirichlet components. Specifically, we proceed as follows: a. Start with an M-component DM y. Calculate DL best : ¼ DL(Djy) using eqs. (6) and (7), and let y best : ¼ y. b. Create M empty bins, and then for each column k: i. Use ~m and eq. (7) to calculate a likelihood m i p (k) i for each constituent component of y. ii. Normalize these likelihoods, and use them to randomly sample column k into one of the M bins. c. For each bin i, corresponding to an individual Dirichlet component, calculate parameters for a new DM y 0 as follows: i. Calculate a new mixture parameter as m 0 i : ¼ n i=n, where n i is the number of columns that have been sampled into bin i. ii. Calculate new location parameters as q 0 i, j : ¼ C i, j=c i, where C i,j is the aggregate count of letter j among all columns assigned to bin i, and C i is the aggregate count of all letters in all columns assigned to bin i. iii. Calculate a new concentration parameter a 0 i using the maximum-likelihood procedure described in the Appendix. d. Calculate DL 0 : ¼ DL(Djy 0 ). If DL 0 < DL best, let DL best : ¼ DL 0, and y best : ¼ y 0. e. If more than R iterations have passed since y best was changed, stop. Otherwise, let y : ¼ y 0, and return to step b). A notable feature of this Gibbs-sampling algorithm is that, after the assignment of columns to individual bins, the mixture and location parameters ~m and ~q i are trivial to estimate. For each component, the estimation of the concentration parameter a i reduces to the simple one-dimensional optimization of a smooth function, as described in the Appendix. This stands in contrast to the multi-dimensional optimization required by each step of the EM algorithm described by Sjölander et al. (1996) Refinements This basic optimization procedure can be refined in several ways. It may be iterated multiple times using different random number seeds, and the best result from the multiple runs retained. When a relatively small number of columns are sampled into a given bin, it is possible that one or more letters may be completely missing from the sample. For this case, it is useful to employ pseudocounts, or some positive lower bound on the q 0 i, j, so that the Dirichlet parameters remain valid, and so that the component in question retains the ability to describe columns that contain the missing letters. Also, if fewer than two columns are sampled into a bin, it is impossible to calculate a 0 i in the final part of step c. It can thus be useful to impose a minimum number of columns per bin, which if not achieved causes the program to halt, or else to remove the bin completely and proceed henceforth with a smaller number of Dirichlet components. Finally, as we discuss below, various strategies may be used to choose a promising y with which to initialize the procedure. The Gibbs-sampling algorithm above can be modified to produce a non-stochastic descent procedure. In place of sampling each column into a random bin in step b ii, each column can be assigned fractional membership in each bin proportional to calculated likelihoods. Step c may then be generalized to calculate the parameters of y 0 from non-unitary column counts. The procedure terminates when DL best improves by less than a small, set value e. In our implementation, we use Gibbs simpling as a first stage, and once it terminates move into a second, descent stage. We have found that this second stage produces negligible improvement when the data consist of a large number of columns, but can yield noticeably improved results for small data sets. Omitting the initial Gibbs-sampling stage, however, is not advisable, as the descent stage alone is liable to get trapped by local optima.

7 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS Initialization strategies Gibbs sampling and related stochastic approaches such as simulated annealing ( Metropolis et al., 1953; Kirkpatrick et al., 1983) are widely used in attempts to find the global optimum of objective functions that have many local optima. They have the virtue of frequently being able to escape local optima that may trap deterministic procedures. By default, our Gibbs-sampling routine begins with an essentially flat y, i.e., one in which all m i ¼ 1/M and all a i,j ¼ 1. Remarkably, as we will see below, excellent results can be obtained using even this completely agnostic strategy. It is often useful to initiate a Gibbs-sampling routine at a point one believes may be near the global optimum. Especially when one is exploring DMs with varying numbers of components, such initialization strategies can be of significant utility. For example, one can use a good M-component DM y M to seek an optimal (M þ 1)-component DM by initiating the algorithm above with the ~a i of y M plus a flat component, with trivial adjustments to the mixture parameters ~m. Conversely, given an (M þ 1)-component DM y M þ 1, one can partition one s data into M þ 1 bins as above, and then determine which two bins, when combined, yields the M-component y M that best describes the data. y M can then be used to initiate the search for an optimal M-component DM. Even when using stochastic approaches, a cohort of related columns can become stuck in one bin when it would better be assigned to another, because the pull of the cohort will prevent individual members from migrating. In this situation, moving back and forth between dimensions as described above can be useful. In brief, when seeking an optimal (M þ 1)-component DM, it is possible that the misplaced cohort will break away to populate the new bin. Then, using the new y M þ 1 to create a y M with which to seed a new search for an optimal M-component DM, it may become apparent that the cohort belongs better with a set of columns other than that from which it broke away Simulated data 5. RESULTS The research group at UCSC that originally proposed the DM formalism for multiple sequence analysis currently provides a variety of multiple alignment data sets and Dirichlet mixtures derived therefrom at their website: In order to test both the MDL principle and our Gibbs-sampling algorithm on protein-like data for which the true solution is known, we used the 9- component DM byst comp from this website (here called ~ h) to generate artificial data. Specifically, we constructed four artificial data sets of 10,000 or more columns, each with a different average column size c, equal to 10, 20, 40 and 80. For a column k in a given data set, we first randomly selected c *(k) > 1, the number of observations in that column, from the Poisson distribution with mean c. We then used ~ h to select a random multinomial distribution over 20 letters, and generated c *(k) independent observations from this multinomial. For the first n columns of a given data set, we used the Gibbs-sampling algorithm described above, with flat initial ys, to seek maximum-likelihood DMs y M from models with M ¼ 1toM ¼ 15 components. (We let R ¼ 3 in step e, and retained the best result from ten runs using different random number seeds.) We then calculated the complexity of each model (eq. (5)), and the description length of the data (eqs. (6) and (7)) given y M. Finally, using the MDL principle, we chose ^M to be the number of components for which the sum of these numbers was minimized. The calculation of ^M for a specific example, with c ¼ 20 and n ¼ 2000, is illustrated in Table 2. As can be seen, the description length of the data decreases as the number of components grows, but this improvement eventually is outweighed by increases in the complexity of the models. In this case, the MDL principle yields ^M ¼ 9, but this result prevails only barely over M ¼ 8. For each data set, with n ranging from 50 to 10,000 in increments of 50, we calculated ^M using the procedure described above; the results are shown in Figure 1. When there are very few columns, the best model is a simple one, with only one or two Dirichlet components. However, as the number of columns grows so, on average, does ^M, until it settles first into the range 9 1, and eventually becomes nearly certain to equal 9, the number of components actually used to generate the data. For each data set, we indicate in Figure 1 the lowest value n 0 for which ^M ¼ 9 1 for all tested n n 0. The larger the average number of observations per column, the earlier this convergence tends to occur.

8 948 YE ET AL. Table 2. Calculation of ^M Using the MDL Principle M Free parameters COMP(DM) (bits) DL(Djy M ) (bits) COMP(DM) þ DL(Djh M ) (bits) , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,991 Model complexities are estimated using eq. (5) for DM models with M components, applied to a data set with 2000 columns and a mean of 20 observations per column. Data description lengths are for a particular data set, generated as described in the text, using a putative maximum-likelihood DM y M, estimated using the Gibbs-sampling algorithm described in the text. The MDL principle yields ^M ¼ 9, at which the total description length COMP(DM M,20, 2000, 20) þ DL(Djh M ) is minimized. All description lengths are rounded to the nearest bit. Although the correct number of Dirichlet components can be recovered to good precision with even a relatively small number of columns, the recovery of accurate values for the parameters of the generating DM requires substantially more data. In Table 3, we compare the parameters recovered by the example considered in Table 2 to those of ~ h, used to generate the data. We also consider the parameters recovered using a much larger data set with n ¼ 100,000 and c ¼ 80, values which are comparable to those for real FIG. 1. The estimated number of Dirichlet components as a function of data size. The 9- component DM h ~ was used to generate four data sets, with an average of c observations per column, with c ¼ 10, 20, 40, and 80. Using our Gibbs-sampling algorithm applied to the first n columns of each data set (with n divisible by 50), and the MDL principle, we calculated ^M, the number of components in the optimal Dirichlet mixture model. For each data set, a vertical line indicates the the smallest n 0 for which ^M ¼ 9 1 for all tested value of n n

9 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 949 Table 3. Comparison of Recovered to Generating DM Parameters ^m i = ~m i ^a i =~a i JS( ~ ~q~q i, ^~q~q i ) i ~m i c ¼ 20 c ¼ 80 ~a i c ¼ 20 c ¼ 80 c ¼ 20 c ¼ Using a nine-component DM h, ~ artificial data sets with (n ¼ 2000, c ¼ 20) and (n ¼ 100, 000, c ¼ 80) were generated. For each data set, a maximum-likelihood 9-component DM ^h was estimated. The parameters of each component i of h ~ are compared to the corresponding parameters in the nearest component of ^h. JS( ~q~q ~ i, ^~q~q i ) is the Jensen-Shannon divergence of the location parameter vectors ~~q~q i and ^~q~q i, where JS(~r,~s) is given by 1 2 Pj r j ln 2rj þ s j ln 2sj. rj þ sj rj þ sj protein data sets used to derive DMs, such as that studied below. Given the larger data set, we are able to estimate both the mixture parameters ~m i and the concentration parameters a i on average to within 0.6%, and in no case to err by more than 1.5%. The smaller data set yields much larger errors, erring in the estimation of these parameters by over 15% more than half the time. In general, the smaller ~a i, j, the smaller the absolute but the greater the relative error in estimating its value, a fact related to the development in Ye et al. (2010). Thus, rather than recording data for individual ~a i, j, Table 3 summarizes the accuracy with which each component s location parameters ~ ~q~q i are estimated, using the measure of Jensen-Shannon divergence. For the smaller data set, three of the components yield a divergence of over bits, but for the larger data set the divergence in under bits for all but one component. For the artificial data studied in this section, we made no attempt to improve the result for one value of M by using the best result for a neighboring value to select an initial y. Averaged over 10 runs on an Intel Xeon 2.4-GHz E7440 CPU, using different random number seeds, our program took 1.7 seconds per run to converge for the (n ¼ 2000, c ¼ 20) data set with M ¼ 9, and 84 seconds for the (n ¼ , c ¼ 80) data set with M ¼ 9. A 9-component DM model provides 188 free parameters, and optimizing a function that yields local optima over a space of this size provides a significant challenge. Nevertheless, Table 3 suggests that when the observations are in fact generated by a DM contained in this model, our Gibbs-sampling algorithm is able to converge on the true solution given sufficient data Real data To analyze real data, we consider the diverse-1216-uw data set, from the website cited above, here called D UCSC,containingn ¼ columns, with an average of c ¼ 75:99 amino acids per column, and the 20- component DM dist.20comp, here called y UCSC, that was previously derived from this data set. Letting y 0 be a multinomial distribution fit to the amino acid frequencies of D UCSC, a baseline description of the data has total description length DL(D UCSC jh 0 ) þ COMP(M 20, :99) ¼ þ 206 ¼ bits, or bits per amino acid. Using y UCSC instead to describe the data, the total description length is reduced to DL(D UCSC jh UCSC ) þ COMP(DM 20, 20, , 75:99) ¼ þ 3461 ¼ bits, or bits/ a.a. The improvement with respect to the baseline multinomial model is bits/a.a., and is represented by atriangleinfigure2. We applied the optimization methods described above to approximate maximum-likelihood y M for these data. In contrast to our artificial data, real data are not produced by a true underlying DM, even though they can perhaps be well modeled by one. This means that as the quantity of available data grows, the complexity of the best model is likely to grow as well, albeit slowly, never converging to a true number of components.

10 950 YE ET AL FIG. 2. Decrease in total description length as a function of the number of Dirichlet components. Given the data set D UCSC (with n ¼ 314, 585 and c ¼ 75:99), we used our Gibbs-sampling algorithm to estimate with y M the maximum-likelihood M-component DM, for M from 2 to 38. The total description length was calculated as DL(D UCSC jh M ) þ COMP(DM M,20, n, c), and compared to the total description length given the multinomial model on 20 letters, M 2 0. The decrease in description length per amino acid of the data is plotted, and reaches its maximum at ^M ¼ 35. The triangle represents the decrease in description length yielded by the 20-component y UCSC, which was derived from D UCSC. Decrease in description length (bits per amino acid) Because there is likely no true DM describing the data, the objective function tends to be much flatter over parameter space than it is for artificial data, making the search for a globally optimal DM correspondingly harder. Perhaps as a result, we found that for a given M, flat initial ys often were outperformed by initial ys derived from the best current solution for adjacent values of M. Thus, our optimization strategy consisted first of finding a set of provisional solutions y M for the range of M explored, and then of improving these solutions through further runs in which initial ys were derived from neighboring provisional y M. We graph in Figure 2 the improvement in total description length our best y M yield with respect to the baseline, for M from 2 to 38. y 35 produced the greatest improvement, of bits/a.a. (On D UCSC,an average single run for M ¼ 35 took 512 seconds until convergence.) y 16 yielded an improvement of bits/a.a., essentially equivalent to that of the 20-component y UCSC, and y 20 yielded the somewhat greater improvement of bits/a.a. Using the fssp select cols data set from the website above, we achieved comparable results and improvements with respect to the corresponding DM fournierfssp.20comp (data not shown). As is evident from Figure 2, until M ¼ 35, increasing the number of components yields a steady but generally diminishing improvement in total description length. Programs that employ DMs generally run more slowly the larger the number of components, which provides an independent reason for preferring smaller M. There are no points in Figure 2 where the implicit slope changes abruptly, but examination suggests that for this data set, y 10, y 14, y 23, and y 29, which yield improvements that are, respectively, 99.0%, 99.3%, 99.7%, and 99.9% of the optimal, may provide good tradeoffs between simplicity and accurate description of the data. 6. DISCUSSION Although for real data, no optimal solution is known with certainty, it is instructive to examine the Dirichlet components recovered, and to compare y M for different values of M. In Table 4, we summarize the components of our y 10 and y 14, ordered by the magnitude of their mixture parameters m i. Similarly to the results of Brown et al. (1993) and Sjölander et al. (1996), most components can be seen to correspond to natural classes of amino acids. For example, component 9 of y 10 and component 13 of y 14 both favor the aromatic amino acids. Other components favor hydrophobic, positively charged, and negatively charged residues, and the special amino acids glycine and proline. It is also worth noting two other types of components. First, components 1 of y 10 and y 14 both have very low values of a*, indicating high probability density near the boundaries of multinomial space; the columns these components describe therefore consist heavily of single amino acids. As M grows further, a component of this type may break into multiple components, each with large a* andcorrespondingtoasingleaminoacid,similarinthis way to components 8 and 10 of y 10. Second, components 4 of y 10 and y 14 both have fairly high values of a*, but do not strongly favor or disfavor any amino acid. These components describe protein positions

11 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 951 Table 4. Components of Dirichlet Mixtures y 10 and y 14 log(q i,j /p j ) i m i a i > 1 > 0.5 < 1 y 10 parameters CG DHW VMIL DE NQK YMWCVLFI ILVM F PSHQNGRKED RQK ILVMF ENGD KRQ E YPVCLGIWF AC ST PDER G N CYWMFLVI YWFH QIVRTSAKDPEG P YVMFLCI y 14 parameters CWG H L VI ML PSQHNGERKD EQD K GMWYVLICF RKQ ILVMF NDG TS HN LMI F ATHSPQRNGKED RKQ CPLVGFI DN E RCAYWMFVLI G N TCYWMFLVI DN SP YCWMFVLI AC S ILWNREKD YWFH TASPKEGD P GVWYLMCFI The components of y 10 and y 14 are ordered by decreasing value of their mixture parameters ~m. For each component, the log ratio of each amino acid s location parameter to its background frequency is calculated, and listed in decreasing order of log ratio if this ratio is > 1, > 0.5 or< 1. that are probably not under strong evolutionary pressure, and accept all amino acids at roughly their background frequencies. In Figure 3, we compare y 10 s to y 14 s location parameters. Qualitatively, it is evident that as M grows, many Dirichlet components retain their essential character, as represented by the corresponding components in the first eight rows and columns of the figure. Other Dirichlet components split apart, as seen in the last two rows, which correspond to components 3 and 2 of y 10 Finally, components of an almost completely new character can be born, as seen in the last two columns, which correspond to components 9and6ofy CONCLUSION In this article, we have studied several questions relevant to the inference of a Dirichlet mixture model from a gold standard set of protein alignment data. We sought to apply the MDL principle to the question of how many components a Dirichlet mixture should have. This required an evaluation of the complexity of DM models, and we accordingly developed heuristic arguments for extending to Dirichlet mixtures an analytic formula for the complexity of a single Dirichlet model. A second element needed for the application of the MDL principle is a method for approximating maximum-likelihood DMs from a given set of data. Although this problem has been addressed previously, we have described a new Gibbssampling approach that reduces the high-dimensional optimization problem to several tractable onedimensional optimization problems.

12 952 YE ET AL. FIG. 3. Comparison of location parameters for the Dirichlet mixtures y 10 and y 14. The location parameters ~q i for each component of y 10 are compared to those for each component of y 14 using the measure of Jensen-Shannon divergence, described in the legend to Table 3. The indices of the components are those given in Table 4, and are reordered to make the relationships among components easier to read. To test the efficacy of our methods, we have applied them to artificial data generated by a known Dirichlet mixture, and shown that they are able both to recover the correct number of Dirichlet components, and to converge effectively on the true DM parameters. Finally, we have applied our methods to real data, where they are able to recover DMs that describe the data more concisely than do DMs constructed using an earlier approach. It is hoped that the methods presented here will aid in the construction of improved DMs for the comparison of multiple protein sequences. 8. APPENDIX: MAXIMUM-LIKELIHOOD ESTIMATION OF THE CONCENTRATION PARAMETER Our Gibbs sampling approach provides us with a way to estimate the values of the mixture parameters ~m, and thereby to reduce the problem of finding a maximum-likelihood (m.l.) Dirichlet mixture to that of finding a m.l. Dirichlet distribution. Because of this reduction, we simplify the notation in this section by dropping the component subscript i from all relevant parameters. Let ~c (k) be the letter count vectors associated with the n columns assigned to the bin in question, and let c *(k) be the total letter count for column k. Applying eq. (7), the log-likelihood L of the data is then given by " # L¼ Xn C(a ) Y L C(a q j þ c (k) j ) ln C(a k ¼ 1 þ c (k) ) C(a j ¼ 1 q j ) ( ) ¼ Xn k ¼ 1 ln C(a ) ln C(a þ c (k) ) þ XL j ¼ 1 [lnc(a q j þ c (k) j ) ln C(a q j )] : (8) If the columns associated with a bin are assigned non-unitary weights w k, one may generalize this formula by simply including the factor w k before ln in the first summation. Although it is possible to find the a* and ~q that maximize eq. (8) by Newton s method in L- dimensional space, the computation ofthegradientandthehessianmatrix is time-consuming, especially when n is large. Thus, our second heuristic idea is to reduce the maximization problem to one dimension using first moment information from the ~c (k). In short, if ~p is sampled from a Dirichlet distribution with parameters ~a, p j follows a beta distribution with parameters (a j, a* a j ). Furthermore,

13 THE INTERFERENCE OF DIRICHLET MIXTURE PRIORS 953 A C B D FIG. 4. The log-likelihood L of the data as a function of a *. The graphs give L in units of Examples are shown for four bins generated using the data set D UCSC described in the text, with M ¼ 10 components. (A) A bin generated after the first round of Gibbs sampling beginning from a flat initial y; all 10 bins in this first round yield very similar results. (B) The bin corresponding to component 1 of Table 4, after convergence. (C) The bin corresponding to component 3 of Table 4, after convergence. (D) The bin corresponding to component 8 of Table 4, after convergence. Solid vertical lines indicate the maximum-likelihood values ~a. Dashed vertical lines indicate the ^a calculated using the method of moments, and used to initiate Newton s method. conditioning on c *(k) and p j, c (k) j follows a binomial distribution with parameters (c *(k), p j ). This implies we can estimate q j by P n k ¼ 1 ^q j ¼ c(k) j P n, (9) k ¼ 1 c(k) and replace q j in eq. (8) by ^q j. This leaves the single parameter a* still to estimate. To apply Newton s method, we need the first and second derivatives of L with respect to a*, which can be written as: dl da ¼ Xn k ¼ 1 ( w(a ) w(a þ c (k) ) þ XL ^q j [w(a ^q j þ c (k) j ) w(a ^q ) j ) ; (10) j ¼ 1 ( ) d 2 L Xn ¼ w 0 (a ) w 0 (a þ c (k) ) þ XL h ^q 2 da2 j [w 0 (a ^q j þ c (k) j ) w 0 (a ^q i j ), (11) k ¼ 1 j ¼ 1 where c and c 0 are the digamma and trigamma functions. Given an initial a* near the m.l. ~a that optimizes L, it is then simple to estimate ~a to great precision in a small number of iterations using Newton s method. There are rapid algorithms for calculating c and c 0 (Bernardo, 1976; Schneider, 1978; Spouge, 1994). Note as well that because c and c 0 appear in eqs. (10) and (11) only as differences, they can be replaced by rational functions when all letter counts c (k) j are integral. Several problem may arise with this approach. First, L may be maximized in the limit only at the boundaries of (0,?). One may show that L is optimal as a* approaches 0 only if each column of the data consists of a single type of letter, although this letter may vary from column to column (proof omitted). Furthermore, one may show that L is optimal as a* approaches? only if the observed letter frequencies are identical from column to column (proof omitted). It is easy to detect and allow for these special cases, but when there are more than a very small number of columns associated with a bin they essentially never arise in practice. Second, it is possible that L has multiple local maxima over (0,?). We conjecture that this can not be the case, but have not been able to prove it. As shown by representative examples in Figure 4, L is very simply behaved in all cases we have observed. Third, it is possible that d 2 L=da 2 may be positive for an initial value of a*, for example if a* were chosen greater that 5.3 in panel B of Figure 4. It is

14 954 YE ET AL. trivial to detect such cases, and replace a* by a smaller value, for which d 2 L=da 2 is negative. In practice, as illustrated in Figure 4, we use the method of moments, detailed below, to initialize Newton s method. This approach yields initial a* that tend to be near to but somewhat smaller than the m.l. ~a. Just as we use first moment information for ~c (k) to reduce the dimension of our problem, so we can use second moment information to obtain a starting point for Newton s method. Specifically, for an individual letter j, some algebra allows us to write a method-of-moments estimate of a* as where v j ¼^Var(c j ) ^ ^Var(c ) E 2 (c ^ ) ^a (j) ¼ [^q j (1 ^q j ) ^E(c 2 ) ^ E 2 (c ) v j ] = [v j ^q j (1 ^q j ) ], (12) E(c ^ ) ^q 2 E 2 (c ) j. We average ^a (j) to obtain an initial ^a ¼ P L j ¼ 1 ^q j^a (j). ACKNOWLEDGMENTS This work was supported by the Intramural Research Program of the National Library of Medicine at the National Institutes of Health. No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Altschul, S.F., Gertz, E.M., Agarwala, R., et al PSI-BLAST pseudocounts and the minimum description length principle. Nucleic Acids Res. 37, Altschul, S.F., Wootton, J.C., Zaslavsky, E., et al The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comp. Biol. 6, e Bernardo, J.M Algorithm AS 103: psi (digamma) function. J. R. Stat. Soc. Ser. C Appl. Stat. 25, Brown, M., Hughey, R., Krogh, A., et al Using Dirichlet mixture priors to derive hidden Markov models for protein families, In Hunter, L., Searls, D., and Shavlik, J., eds. Proc. First Int. Conf. Intell. Sys. Mol. Biol. AAAI Press, Menlo Park, CA. Geman, S., and Geman, D Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Patt. Anal. Mach. Intell. 6, Grünwald, P.D The Minimum Description Length Principle. MIT Press, Cambridge, MA. Kirkpatrick, S., Gelatt, C.D., and Vecchi, M.P Optimization by simulated annealing. Science 220, Lawrence, C.E., Altschul, S.F., Boguski, M.S., et al Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. Science 262, Metropolis, N., Rosenbluth, A.W., Rosenbluth, M. N., et al Equation of state calculations by fast computing machines. J. Chem. Phys. 21, Schneider, B. E Algorithm AS 121: trigamma function. J. R. Stat. Soc. Ser. C Appl. Stat. 27, Sjölander, K., Karplus, K., Brown, M., et al Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology. Comput. Appl. Biosci. 12, Spouge, J.L Computation of the gamma, digamma, and trigamma functions. SIAM J. Numer. Anal. 31, Ye, X., Yu, Y.-K., and Altschul, S. F Compositional adjustment of Dirichlet mixture priors. J. Comput. Biol. 17, Yu, Y.-K., and Altschul, S. F The complexity of the Dirichlet model for multiple alignment data. J. Comput. Biol. 18 (this issue). Address correspondence to: Dr. Stephen F. Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD altschul@ncbi.nlm.nih.gov

Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino Acid Multinomial Space. Stephen Altschul

Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino Acid Multinomial Space. Stephen Altschul Dirichlet Mixtures, the Dirichlet Process, and the Topography of mino cid Multinomial Space Stephen ltschul National Center for Biotechnology Information National Library of Medicine National Institutes

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Links: Journal [http://online.liebertpub.com/doi/pdfplus/ /cmb ] Downloaded from

Links: Journal [http://online.liebertpub.com/doi/pdfplus/ /cmb ] Downloaded from Viet-An Nguyen, Jordan Boyd-Graber, and Stephen Altschul. Dirichlet Mixtures, the Dirichlet Process, and the Structure of Protein Space. Journal of Computational Biology, 2013, 48 pages. @article{nguyen:boyd-graber:altschul-2013,

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Does the Wake-sleep Algorithm Produce Good Density Estimators?

Does the Wake-sleep Algorithm Produce Good Density Estimators? Does the Wake-sleep Algorithm Produce Good Density Estimators? Brendan J. Frey, Geoffrey E. Hinton Peter Dayan Department of Computer Science Department of Brain and Cognitive Sciences University of Toronto

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

What is the expectation maximization algorithm?

What is the expectation maximization algorithm? primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises

More information

OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE. Bruce Hajek* University of Illinois at Champaign-Urbana

OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE. Bruce Hajek* University of Illinois at Champaign-Urbana OPTIMIZATION BY SIMULATED ANNEALING: A NECESSARY AND SUFFICIENT CONDITION FOR CONVERGENCE Bruce Hajek* University of Illinois at Champaign-Urbana A Monte Carlo optimization technique called "simulated

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Gentle Introduction to Infinite Gaussian Mixture Modeling

Gentle Introduction to Infinite Gaussian Mixture Modeling Gentle Introduction to Infinite Gaussian Mixture Modeling with an application in neuroscience By Frank Wood Rasmussen, NIPS 1999 Neuroscience Application: Spike Sorting Important in neuroscience and for

More information

A General Overview of Parametric Estimation and Inference Techniques.

A General Overview of Parametric Estimation and Inference Techniques. A General Overview of Parametric Estimation and Inference Techniques. Moulinath Banerjee University of Michigan September 11, 2012 The object of statistical inference is to glean information about an underlying

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Simulated Annealing for Constrained Global Optimization

Simulated Annealing for Constrained Global Optimization Monte Carlo Methods for Computation and Optimization Final Presentation Simulated Annealing for Constrained Global Optimization H. Edwin Romeijn & Robert L.Smith (1994) Presented by Ariel Schwartz Objective

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Adaptive Multi-Modal Sensing of General Concealed Targets

Adaptive Multi-Modal Sensing of General Concealed Targets Adaptive Multi-Modal Sensing of General Concealed argets Lawrence Carin Balaji Krishnapuram, David Williams, Xuejun Liao and Ya Xue Department of Electrical & Computer Engineering Duke University Durham,

More information

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari Learning MN Parameters with Alternative Objective Functions Sargur srihari@cedar.buffalo.edu 1 Topics Max Likelihood & Contrastive Objectives Contrastive Objective Learning Methods Pseudo-likelihood Gradient

More information

Clustering using Mixture Models

Clustering using Mixture Models Clustering using Mixture Models The full posterior of the Gaussian Mixture Model is p(x, Z, µ,, ) =p(x Z, µ, )p(z )p( )p(µ, ) data likelihood (Gaussian) correspondence prob. (Multinomial) mixture prior

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19.

de Blanc, Peter Ontological Crises in Artificial Agents Value Systems. The Singularity Institute, San Francisco, CA, May 19. MIRI MACHINE INTELLIGENCE RESEARCH INSTITUTE Ontological Crises in Artificial Agents Value Systems Peter de Blanc Machine Intelligence Research Institute Abstract Decision-theoretic agents predict and

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bayesian Modeling and Classification of Neural Signals

Bayesian Modeling and Classification of Neural Signals Bayesian Modeling and Classification of Neural Signals Michael S. Lewicki Computation and Neural Systems Program California Institute of Technology 216-76 Pasadena, CA 91125 lewickiocns.caltech.edu Abstract

More information

Pairwise sequence alignment and pair hidden Markov models

Pairwise sequence alignment and pair hidden Markov models Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS chapter MORE MATRIX ALGEBRA GOALS In Chapter we studied matrix operations and the algebra of sets and logic. We also made note of the strong resemblance of matrix algebra to elementary algebra. The reader

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Food delivered. Food obtained S 3

Food delivered. Food obtained S 3 Press lever Enter magazine * S 0 Initial state S 1 Food delivered * S 2 No reward S 2 No reward S 3 Food obtained Supplementary Figure 1 Value propagation in tree search, after 50 steps of learning the

More information

The Growth of Functions. A Practical Introduction with as Little Theory as possible

The Growth of Functions. A Practical Introduction with as Little Theory as possible The Growth of Functions A Practical Introduction with as Little Theory as possible Complexity of Algorithms (1) Before we talk about the growth of functions and the concept of order, let s discuss why

More information

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS

ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS ECO 513 Fall 2009 C. Sims HIDDEN MARKOV CHAIN MODELS 1. THE CLASS OF MODELS y t {y s, s < t} p(y t θ t, {y s, s < t}) θ t = θ(s t ) P[S t = i S t 1 = j] = h ij. 2. WHAT S HANDY ABOUT IT Evaluating the

More information

CS1800: Sequences & Sums. Professor Kevin Gold

CS1800: Sequences & Sums. Professor Kevin Gold CS1800: Sequences & Sums Professor Kevin Gold Moving Toward Analysis of Algorithms Today s tools help in the analysis of algorithms. We ll cover tools for deciding what equation best fits a sequence of

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

For most practical applications, computing the

For most practical applications, computing the Efficient Calculation of Exact Mass Isotopic Distributions Ross K. Snider Snider Technology, Inc., Bozeman, Montana, USA, and Department of Electrical and Computer Engineering, Montana State University,

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

SIMU L TED ATED ANNEA L NG ING

SIMU L TED ATED ANNEA L NG ING SIMULATED ANNEALING Fundamental Concept Motivation by an analogy to the statistical mechanics of annealing in solids. => to coerce a solid (i.e., in a poor, unordered state) into a low energy thermodynamic

More information

MODULE -4 BAYEIAN LEARNING

MODULE -4 BAYEIAN LEARNING MODULE -4 BAYEIAN LEARNING CONTENT Introduction Bayes theorem Bayes theorem and concept learning Maximum likelihood and Least Squared Error Hypothesis Maximum likelihood Hypotheses for predicting probabilities

More information

Background on Coherent Systems

Background on Coherent Systems 2 Background on Coherent Systems 2.1 Basic Ideas We will use the term system quite freely and regularly, even though it will remain an undefined term throughout this monograph. As we all have some experience

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago December 2015 Abstract Rothschild and Stiglitz (1970) represent random

More information

Markov chain Monte Carlo methods for visual tracking

Markov chain Monte Carlo methods for visual tracking Markov chain Monte Carlo methods for visual tracking Ray Luo rluo@cory.eecs.berkeley.edu Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94720

More information

Computer Vision Group Prof. Daniel Cremers. 14. Clustering

Computer Vision Group Prof. Daniel Cremers. 14. Clustering Group Prof. Daniel Cremers 14. Clustering Motivation Supervised learning is good for interaction with humans, but labels from a supervisor are hard to obtain Clustering is unsupervised learning, i.e. it

More information

Investigation into the use of confidence indicators with calibration

Investigation into the use of confidence indicators with calibration WORKSHOP ON FRONTIERS IN BENCHMARKING TECHNIQUES AND THEIR APPLICATION TO OFFICIAL STATISTICS 7 8 APRIL 2005 Investigation into the use of confidence indicators with calibration Gerard Keogh and Dave Jennings

More information

Does Better Inference mean Better Learning?

Does Better Inference mean Better Learning? Does Better Inference mean Better Learning? Andrew E. Gelfand, Rina Dechter & Alexander Ihler Department of Computer Science University of California, Irvine {agelfand,dechter,ihler}@ics.uci.edu Abstract

More information

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Every binary word is, almost, a shuffle of twin subsequences a theorem of Axenovich, Person and Puzynina

Every binary word is, almost, a shuffle of twin subsequences a theorem of Axenovich, Person and Puzynina Every binary word is, almost, a shuffle of twin subsequences a theorem of Axenovich, Person and Puzynina Martin Klazar August 17, 2015 A twin in a word u = a 1 a 2... a n is a pair (u 1, u 2 ) of disjoint

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

Notes on Complexity Theory Last updated: December, Lecture 2

Notes on Complexity Theory Last updated: December, Lecture 2 Notes on Complexity Theory Last updated: December, 2011 Jonathan Katz Lecture 2 1 Review The running time of a Turing machine M on input x is the number of steps M takes before it halts. Machine M is said

More information

THE N-VALUE GAME OVER Z AND R

THE N-VALUE GAME OVER Z AND R THE N-VALUE GAME OVER Z AND R YIDA GAO, MATT REDMOND, ZACH STEWARD Abstract. The n-value game is an easily described mathematical diversion with deep underpinnings in dynamical systems analysis. We examine

More information

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures 17th Europ. Conf. on Machine Learning, Berlin, Germany, 2006. Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures Shipeng Yu 1,2, Kai Yu 2, Volker Tresp 2, and Hans-Peter

More information

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM DMITRY SHEMETOV UNIVERSITY OF CALIFORNIA AT DAVIS Department of Mathematics dshemetov@ucdavis.edu Abstract. We perform a preliminary comparative

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

THE DYNAMICS OF SUCCESSIVE DIFFERENCES OVER Z AND R

THE DYNAMICS OF SUCCESSIVE DIFFERENCES OVER Z AND R THE DYNAMICS OF SUCCESSIVE DIFFERENCES OVER Z AND R YIDA GAO, MATT REDMOND, ZACH STEWARD Abstract. The n-value game is a dynamical system defined by a method of iterated differences. In this paper, we

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Development of Stochastic Artificial Neural Networks for Hydrological Prediction Development of Stochastic Artificial Neural Networks for Hydrological Prediction G. B. Kingston, M. F. Lambert and H. R. Maier Centre for Applied Modelling in Water Engineering, School of Civil and Environmental

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Experimental Uncertainty (Error) and Data Analysis

Experimental Uncertainty (Error) and Data Analysis Experimental Uncertainty (Error) and Data Analysis Advance Study Assignment Please contact Dr. Reuven at yreuven@mhrd.org if you have any questions Read the Theory part of the experiment (pages 2-14) and

More information

Roger Levy Probabilistic Models in the Study of Language draft, October 2,

Roger Levy Probabilistic Models in the Study of Language draft, October 2, Roger Levy Probabilistic Models in the Study of Language draft, October 2, 2012 224 Chapter 10 Probabilistic Grammars 10.1 Outline HMMs PCFGs ptsgs and ptags Highlight: Zuidema et al., 2008, CogSci; Cohn

More information

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o

bound on the likelihood through the use of a simpler variational approximating distribution. A lower bound is particularly useful since maximization o Category: Algorithms and Architectures. Address correspondence to rst author. Preferred Presentation: oral. Variational Belief Networks for Approximate Inference Wim Wiegerinck David Barber Stichting Neurale

More information

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming

Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Ahlswede Khachatrian Theorems: Weighted, Infinite, and Hamming Yuval Filmus April 4, 2017 Abstract The seminal complete intersection theorem of Ahlswede and Khachatrian gives the maximum cardinality of

More information

The Ising model and Markov chain Monte Carlo

The Ising model and Markov chain Monte Carlo The Ising model and Markov chain Monte Carlo Ramesh Sridharan These notes give a short description of the Ising model for images and an introduction to Metropolis-Hastings and Gibbs Markov Chain Monte

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Chapter 20. Deep Generative Models

Chapter 20. Deep Generative Models Peng et al.: Deep Learning and Practice 1 Chapter 20 Deep Generative Models Peng et al.: Deep Learning and Practice 2 Generative Models Models that are able to Provide an estimate of the probability distribution

More information

34.1 Polynomial time. Abstract problems

34.1 Polynomial time. Abstract problems < Day Day Up > 34.1 Polynomial time We begin our study of NP-completeness by formalizing our notion of polynomial-time solvable problems. These problems are generally regarded as tractable, but for philosophical,

More information

Scribe Notes for Parameterized Complexity of Problems in Coalition Resource Games

Scribe Notes for Parameterized Complexity of Problems in Coalition Resource Games Scribe Notes for Parameterized Complexity of Problems in Coalition Resource Games Rajesh Chitnis, Tom Chan and Kanthi K Sarpatwar Department of Computer Science, University of Maryland, College Park email

More information

CSCE 478/878 Lecture 6: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning Bayesian Methods Not all hypotheses are created equal (even if they are all consistent with the training data) Outline CSCE 478/878 Lecture 6: Bayesian Learning Stephen D. Scott (Adapted from Tom Mitchell

More information

Lecture 3: Introduction to Complexity Regularization

Lecture 3: Introduction to Complexity Regularization ECE90 Spring 2007 Statistical Learning Theory Instructor: R. Nowak Lecture 3: Introduction to Complexity Regularization We ended the previous lecture with a brief discussion of overfitting. Recall that,

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester Physics 403 Numerical Methods, Maximum Likelihood, and Least Squares Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Quadratic Approximation

More information

Efficient Cryptanalysis of Homophonic Substitution Ciphers

Efficient Cryptanalysis of Homophonic Substitution Ciphers Efficient Cryptanalysis of Homophonic Substitution Ciphers Amrapali Dhavare Richard M. Low Mark Stamp Abstract Substitution ciphers are among the earliest methods of encryption. Examples of classic substitution

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago January 2016 Consider a situation where one person, call him Sender,

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

CHAPTER-17. Decision Tree Induction

CHAPTER-17. Decision Tree Induction CHAPTER-17 Decision Tree Induction 17.1 Introduction 17.2 Attribute selection measure 17.3 Tree Pruning 17.4 Extracting Classification Rules from Decision Trees 17.5 Bayesian Classification 17.6 Bayes

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

, Department of Computer and Communication Sciences, University of Michigan, Ann Arbor, Michigan

, Department of Computer and Communication Sciences, University of Michigan, Ann Arbor, Michigan SIAM J. COMPUT. Vol. 13, No. 4, November 1984 (C) 1984 Society for Industrial and Applied Mathematics 002 EQUIVALENCE RELATIONS, INVARIANTS, AND NORMAL FORMS* ANDREAS BLASS" AND YURI GUREVICH* Abstract.

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

6 Markov Chain Monte Carlo (MCMC)

6 Markov Chain Monte Carlo (MCMC) 6 Markov Chain Monte Carlo (MCMC) The underlying idea in MCMC is to replace the iid samples of basic MC methods, with dependent samples from an ergodic Markov chain, whose limiting (stationary) distribution

More information

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract)

Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Approximating the Partition Function by Deleting and then Correcting for Model Edges (Extended Abstract) Arthur Choi and Adnan Darwiche Computer Science Department University of California, Los Angeles

More information