MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be learned. In this class we will therefore not focus so much on specific algorithms or machine learning models, but rather give an introduction to the overall approach to using machine learning in bioinformatics, as we see it. To us, the core of machine learning boils down to three things: 1) Building computer models to capture some desired structure of the data you are working on, 2) training such models on existing data to optimise them as well as we can, and 3) use them to make predictions on new data. In these lecture notes we start with some toy examples illustrating these steps. Later you will see a concrete example of this when building a gene finder using a hidden Markov model. At the end of the class you will see algorithms that do not quite follow the framework in these notes, just to see that there are other approaches. 1. Classifying strings To illustrate the three core tasks mentioned above, we use a toy example where we want to classify strings as coming from one class of strings rather than another. It is a very simple example, and probably not quite an approach we would actually take in a real application. It illustrates many of the core ideas you will see when you work on the hidden Markov model project later in the class, though. The setup we imagine is this: we somehow get strings that are generated from one of two processes, and given a string we want to classify it according to which process it comes from. To do this, we have to 1) build a model that captures strings, 2) train this model to classify strings, and 3) use the model on new strings. Going through the example we ll switch 2) and 3), though; we need to know how to actually classify strings using the model before we can train the model to do it. Anyway, those are the tasks. 2. Modelling strings from different processes By modelling we mean constructing an algorithm or some mathematics we can apply to our data. Think of it as constructing some function, f, that maps a data point, x, to some value y = f(x). In the general case, both x and y can be vectors. A good model is a function where f extracts the relevant features of the input, x, and gives us a y we can use to make predictions about x; in this case that just means that y should be something we can use to classify x. That s a bit abstract, but in our string classification problem it simply means that we want to construct a function that given a string gives us a classification. 1

2 THOMAS MAILUND 2.1. Modelling, probabilities, and likelihoods. In machine learning we are rarely so lucky that we can get perfect models, that is models that with 100% accuracy classifies correctly. So we cannot expect that f will always give us a perfect y; at best we can hope for a good f. We need to quantify what good means, in order to know exactly how good a model we have, to compare too model to know which is better, and in order to optimise a model to be as good as we can make it. Probability theory and statistics gives us a very strong framework to measure how good a given model is, and general approaches we can use to train models. It is not the only approach to machine learning, but practically all classical machine learning models and algorithms can be framed in terms of probabilistic models and statistical inference, so as a basic framework it is very powerful. For a probabilistic model of strings from two different classes, we can look at the joint probability of seeing a string x Σ from class C i : Pr(x, C i ). In section 3, Classifying Strings, we will see how to classify strings from this, but for now let us just consider how to specify such a probability. In general we will build models with parameters we can tweak to fit them to data, so rather than having a specifying Pr(x, C i ) we have a whole class of probabilities indexed by parameters θ: Pr(x, C i ; θ) where θ can be continuous or discrete, a single value or an arbitrary long vector of values, whatever we come up with for our model. Training our model will boil down to picking a good parameter point, ˆθ, where we can then use the function (x, C i ) Pr(x, C i ; ˆθ) for classifying x. This function we call the probability of (x, C i ) given parameters ˆθ (and implicitly given the assumed model). If we imagine keeping the data point fixed instead, at some point (ˆx, Ĉi), we have a function mapping parameters to values: θ Pr(ˆx, Ĉi ; θ). This we call the likelihood of θ given the data (ˆx, Ĉi), and we sometimes write this lhd(θ ; ˆx, Ĉi) instead of Pr(ˆx, Ĉi ; θ). The only difference between probability of the data given the parameters, or the likelihood of the parameters given the data, is which part we keep fixed and which we vary. We require that Pr(x, C i ; θ) is a probability distribution over (x, C i ) though, which means that the sum over all possible values of x and C i (or integrating if we had continuous variables) must be 1, while we do not require that summing (or integrating) over all possible values of θ should be 1. 2.2. Modelling strings from two classes. How exactly to define a probability like Pr(x, C i ; θ) is often subjective and somewhat arbitrary, and there rarely is one right way of doing it. So it is somewhat like programming: there are many ways you can solve a problem and you can be more or less creative about it. There are some general strategies that are often useful, but it always depends on the application and there are no guarantees that these strategies will work. You just have to try and see how it goes. One strategy that is often successful when we want to classify data, is to look at the probability of a given data point conditional on the class, that is the probability Pr(x C i ; θ). We will specify the probability of a string x in each of the two classes, C 1 and C 2 and then uses the differences in these probabilities to decide which class x most likely comes from. Since we are unlikely to guess the true model in any real application of machine learning, constructing models all boils down to constructing something that is fast to

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 3 compute and good enough for our purpose. As with any programming task, it makes sense to start simple. For a string x = x 1 x 2 x k we need to define Pr(x 1, x 2,..., x k C i ; θ). A simple model assumes first that the letters in x are independent and second that the probability of seeing a given letter a Σ is independent on the index in x. With these assumptions, the probability of the string, which is the joint probability of the letters in the string, becomes Pr(x 1 C i ; θ) Pr(x 2 C i ; θ) Pr(x k C i ; θ). A set of parameters for such a model could specify the probability of seeing each alphabet in the alphabet, Pr(a C i ; θ) so our parameters could specify those. Let θ = (p (1), p (2) ) where p i is a vector indexed by letters a in our alphabet and with a Σ p( i) a = 1. We use p (i) as the distribution of letters in class C i then and consider it a parameter we can fit to the data when we later train the model. To compute the probability of any given string, assuming it came from class C i, you simply look up index x[j], j = 1,..., k in θ i and multiply them together: Pr(x = x 1 x 2 x k C i ; θ) = k j=1 p (i) x[j]. Since a Σ p(i) a = 1 we have Σ 1 parameters from each of the two classes, and for each choice of parameters we get slightly different distributions over strings from the two classes. It is through the differences between p (1) and p (2) we will be able to classify a string x. Now, whether this is a good model for our application depends a lot on what the real data looks like. It might not capture important structure in the real data. For instance, the assumption that the letter probability is independent of the index in the string might be incorrect (and we will see a model where the distribution depends on the index in next week s lectures), or the probability of letters might not be independent between them (we will see an example of this when we work with hidden Markov models). Deciding whether you have made a good model often is a question of comparing data you simulate under your constructed model and comparing it with real data to see if there are large differences. If there are, you should improve on your model to fit the data better, but quite often simple models are good enough for our application. Of course, even if this model is a completely accurate model of the real data it doesn t mean that we are going to be able to easily classify strings. If you are equally likely to see each letter in the alphabet whether a string comes from C 1 or C 2 then each class will give roughly the same probability to each string x and this model will not be able to distinguish them. Nevertheless, this is going to be our model for string classification. 2.3. Exercises. Write a function in your preferred programming language that simulate strings of length k given a vector of letter probabilities p and another function that given a string and a vector of letter probabilities computes the probability of the string. Use the simulator to simulate a string and compute the probability of that string both using the true p you used when simulating and some different probability vector p. You can measure how how far p is from p using the Kullback-Leibler divergence D KL (p p) = ( ) p log a p a p a a

4 THOMAS MAILUND (http://en.wikipedia.org/wiki/kullbackleibler_divergence). If you plot this distance between p and p against the ratio between the probability of x under the two models Pr(x ; p)/ Pr(x ; p ) what happens? 1 (You might want to try this with a number of different simulated strings, since choosing random strings gives different results each time). Intuitively, you would expect longer strings to contain more information about the process that generated them than shorter strings does. What happens with Pr(x ; p) and Pr(x ; p ) when you simulate longer and longer strings? Try plotting Pr(x ; p )/ Pr(x ; p) against the length of x. Again, for each string length you might want to sample several strings to take stochastic variation into account. If you simulate long strings you will probably quickly run into underflow problems. You avoid this if you compute the log-likelihood instead of the likelihood, i.e. instead of x log Pr(x ; p) = log(p x[j] ) j=1 Pr(x ; p) = x p x[j] j=1 and if you do that you want to look at the difference log Pr(x ; p ) log Pr(x ; p) instead of the ratio Pr(x ; p ) Pr(x ; p) If instead of having a single string x had a set of strings D = {x 1, x 2,..., x n } then how would you write the probability of the set D coming from the distribution p? If you do the exercises above with a set of strings rather than a single string, what changes? 3. Classifying strings Now, what we wanted to build was a model that, given a string x, would tell us if x came from C 1 or C 2. From the model of Pr(x C i ; θ) we developed above we therefore want to get a function x Pr(C i x ; θ) instead. If we know Pr(C i x ; θ) we would classify x as belong to class C i if Pr(C i x ; θ) is high enough. If we have two classes to choose from, this typically means that we would classify x as coming from C 1 if Pr(C i x ; θ) > 0.5 and classify it as coming from C 2 otherwise. We don t always have to classify x though, and sometimes we might have an application where we should only 1 The probability of x gets exponentially smaller as the length increases (do you see why?) so comparing different lengths can be difficult. The ratio here shows have probably x is from one model over the other, and while both nominator and denominator shrinks exponentially the fraction still tells you the relative support of one model compared to the other. This particular ratio is called the likelihood ratio since it is just another way of writing lhd(p ; x)/lhd(p ; x). If we think of p and p as two different models, rather than two different parameter points, it is called the Bayes factor (http://en.wikipedia.org/wiki/bayes_factor).

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 5 classify if we are relatively certain that we are right. This we also get from knowing Pr(C i x ; θ) since we then simply require that the support for the chosen class is high enough, and refrain from classifying strings where there is not high enough probability for a single class. We get the formula we want from Bayes formula which for our example means Pr(B A) = Pr(A B) Pr(B) Pr(A) Pr(C i x ; θ) = Pr(x C i ; θ) Pr(C i ; θ) Pr(x ; θ) which introduces two new probabilities: Pr(C i ; θ) and Pr(x ; θ). We can compute Pr(x ; θ) from the other two probabilities since Pr(x ; θ) = Pr(x C 1 ; θ) Pr(C 1 ; θ) + Pr(x C 2 ; θ) Pr(C 2 ; θ) assuming that there are only the two classes C 1 and C 2. 2 The other probability, Pr(C i ; θ), we have to specify. The probability Pr(C i ; θ) is independent of x and can be thought of as how likely it is that any given string would be chosen from that class in the first place. This can just be another parameter of our model, π such that the set of parameters is now θ = (π, p i a) where p i, i = 1, 2, are the probabilities of the letters for the two classes as before, and Pr(C 1 ; θ) = π and Pr(C 2 ; θ) = 1 π. The parameter π is something we must set, either explicitly or train from data as we see in the next section. For now, let us just consider what the functions Pr(C i ; θ) and Pr(x C i ; θ) tells us, and how they help us pick the right class for a string x. The so-called prior probability, Pr(C i ; θ) is a probability that describes how likely we think it is that class C i produces a string to begin with. If we think that C 1 and C 2 are equally likely to produce strings it doesn t matter so much when going from Pr(x C i ; θ) to Pr(C i x ; θ), but if we expect for example only one in a hundred string to come from C 1 would need more evidence that a specific string, x, is likely to have come from C 1 if we want to classify it as such. The probability of the string given the class, Pr(x C i ; θ), on the other hand tells us how likely it is that class would produce the string x. For that reason we can call it the likelihood, although we have already used that term for the probability as a function of the parameters θ. Still, you might sometimes see it called the likelihood, and in most ways it behaves like a likelihood, where as you recall lhd(θ ; x) is just a way of saying Pr(x ; θ). If you think of C i as a parameter of the model rather than a stochastic variable we condition on you see the resemblance. 3 If Pr(x C 1 ; θ) Pr(x C 2 ; θ), that is C 1 is much more likely to produce the string x than C 2 is, then observing x weighs the 2 This follows from how we calculate with probabilities. We can marginalise over some of the parameters in a joint distribution so Pr(A) = i Pr(A, Bi) and by definition of conditional distributions Pr(A, B) = Pr(A B) Pr(B). 3 I have been careful to distinguish between conditional probabilities, Pr(A B) and parameterised distributions Pr(A ; θ) but in all the arithmetic we do there really isn t much of a difference. A conditional probability is just a parameterised distribution and the only difference from having a conditional distribution and a parameterised distribution is whether we think parameters can be thought of as stochastic

6 THOMAS MAILUND odds towards C 1 rather than C 2, so even if we a priori thought that we would only see a string from C 1 one times in a hundred, if Pr(x C 1 ; θ) is a thousand times higher than Pr(x C 2 ; θ), then observing x it would still be more likely that it came from x. It is by combining the prior probability of seeing the class C i with how likely it is to produce the string we observe that we get the posterior probability of C i : Pr(C i x ; θ). We often write this intuition in the following form: Pr(C 1 x ; θ) Pr(C 2 x ; θ) = Pr(x C 1 ; θ) Pr(x C 2 ; θ) Pr(C 1 ; θ) Pr(C 2 ; θ) and you can think of Pr(C 1 ; θ) Pr(C 2 ; θ) as the prior odds, that is the odds of seeing something from C 1 rather than C 2 to begin with, and of Pr(C 1 x ; θ) Pr(C 2 x ; θ) as the posterior odds, that is the odds that the x you saw came from C 1 rather than C 2. If C 1 is unlikely to happen to begin with, the prior odds are small. However, if we then observe a string that C 1 is very likely to produce and C 2 is unlikely to produce the odds changes. The stronger the prior odds are against C 1 the more evidence we demand to see before we select C 1 over C 2. Since Pr(C 1 x ; θ) Pr(C 2 x ; θ) > 0 Pr(C 1 x ; θ) > Pr(C 2 x ; θ) we would classify x as coming from C 1 if the posterior odds are higher than 1 (or sufficiently higher than 1 if we want to avoid less certain cases) and classify it as coming from C 2 if the posterior odds are below 1. If the posterior odds are exactly 1 it is probably best not to make a decision. 3.1. Exercises. Pick two letter distributions, p and p and simulate n strings of length k from each. Classify a string x as class C 1 if Pr(x C 1 ; θ) > 0.5 and as C 2 otherwise and measure how well you do (how many strings you assign to the right class divided by the number of strings, 2n). How well do you classify as a function of how far p is from p? How well do you classify as a function of the length of the strings? Now simulate strings by first randomly choosing p or p so you choose p with some probability π. Classify the strings both as above and by using their posterior odds (or posterior probabilities, whichever you prefer, it gives you the same result). Compare the accuracy of the classification when the prior probabilities / prior odds are taken into account versus when they are not. Plot the accuracy with both approaches as a function of π. 4. Training the string classifier Finally we come to training the model, that is, how to set the parameters of the model θ = (π, p (1), p (2 ). We of course want to choose the parameters in such a way that we maximise the probability of classifying a new string that might show up. We don t know which strings we are likely to see, however, nor which classes they come from, and until we actually have a set of parameters we cannot even make educated guesses about it. Just saying that we want to optimise how well we can do in the future is thus and having a distribution or not. This philosophical distinction is the difference between Bayesian and Frequentist statistics.

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 7 not something we can tell a computer to do; we need some algorithm for setting the parameters, such that at least we are likely to get a good classifier for future data. One can show as a general result that if you have the right model, then on average you cannot do better than using the true parameters. That is, the best average performance you can achieve on new data you get if you use the true parameters of the model. If we don t really have the right model all bets are off, really, and unfortunately this is almost always the case. It s a bit of a fatalistic thought, though, so we are going to assume that we have the right model (and if not we always go back to modelling to at least get one that is as close as possible), because if we have the right model we have some general approaches to estimating the true parameters. If we do not have any data to work with, we can do nothing but guess at the parameters, but typically we can get a set of data D = {(x 1, t 1 ), (x 2, t 2 ),..., (x n, t n )} of data points x j and targets t j ; in our application strings x j Σ and associated classes t j {C 1, C 2 }. From this data we need to set the parameters. There are three approaches that are frequently used and many machine learning algorithms are just concrete algorithms for one of these general approaches. They are not always the best choice, but always a good choice and unless you can show that an alternative can do better you should use one of these. The approaches are: (1) you maximise the likelihood, (2) you maximise the posterior, (3) or you make predictions using a posterior distribution, see equations (1), (2) and (3) below. The first is a Frequentists approach (http://en.wikipedia.org/wiki/frequentist_ inference) while the second and third are Bayesian (http://en.wikipedia.org/wiki/ Bayesian_inference). We will only use the first two in this class but just mention the third in case you run into it in the future. 4.1. Maximum likelihood estimates. For maximum likelihood estimation (http: //en.wikipedia.org/wiki/maximum_likelihood), as the name suggests, is based on maximising the likelihood function lhd(θ ; D) = Pr(D ; θ) with respect to the parameters θ: (1) ˆθMLE = argmax θ Pr(D ; θ) This you do in whatever fashion you can, just as with maximising any other function. Sometimes you can do this analytically by setting the derivative to zero, θ lhd(θ ; D) = 0 (when θ is a vector you set the gradient to zero lhd = 0), or more often the log likelihood since those are often easier to take the derivative off. Sometimes there are constraints on what values θ can legally take which complicates this slightly, and sometimes we simply cannot solve this analytically. So not surprisingly there are algorithms and heuristics in the literature for how to maximise this function for specific machine learning methods and the Baum-Welch algorithm you will see for hidden Markov models is one such algorithm and an instance of a general class of optimisation algorithms called Expectation-Maximisation or EM. EM is a numerical optimisation that is guaranteed to find a local but not necessarily a global maximum. Quite often you have to use heuristics and numerical algorithms to optimise the likelihood.

8 THOMAS MAILUND Intuitively, maximising Pr(D ; θ) is a sensible thing to do; you are picking the parameters that make the data you have observed most likely. There are also more theoretical properties with the maximum likelihood estimates that makes them a good choice, not least that they are guaranteed to converge towards the real parameters as the number of data points grows. They can often be biased, meaning that on average they slightly over- or under-estimate the true parameters, but this bias is guaranteed to get smaller and smaller as the number of data points grows. Still, in many algorithms you will see estimators that corrects for the bias in the maximum likelihood estimator to get an unbiased estimate that still converges to the true value. We won t worry about this and just maximise likelihoods (and hope that we have enough data and are sufficiently converged to the true value that we needn t worry about the bias). 4.2. Bayesian estimates. For Bayesian estimation we tread the parameter not just as an unknown value but as a stochastic one with its own distribution. So rather than having a likelihood lhd(θ ; D) = Pr(D ; θ) we have a conditional distribution lhd(θ D) = Pr(D θ). From this we can get a posterior distribution over parameters, Pr(θ D), 4 and we can use this in two different ways in our classification model. With a posterior distribution over parameters it makes more sense to choose the parameters with the maximal probability rather than the maximal likelihood, and we call this estimator the maximum a posteriori estimator: (2) ˆθMAP = argmax θ Pr(θ D) = argmax θ Pr(D θ) Pr(θ) where in the last equality we ignore dividing by Pr(D) using Bayes rule since this is a constant when optimising with respect to θ. With a Bayesian approach, however, you do not need to maximise the posterior. You have a distribution of parameters and by integrating over all possible parameters, weighted by their probability, you can make predictions as well. So if you need to make predictions for a data point x, rather than using a single estimated parameter ˆθ (whether maximum likelihood estimate or maximum a posteriori estimate) and the probability Pr(x ; ˆθ) you can get the probability of x given all the previous data, Pr(x D) using (3) Pr(x D) = Pr(x θ) Pr(θ D) dθ = Pr(D θ) Pr(θ) Pr(x θ) dθ Pr(D) A main benefit of using Bayesian approaches is that we can alleviate some of the problems we have with the stochastic variation of estimates when we have very little data. The maximum likelihood estimates will converge to the true parameters, but when there is little data the estimate can be far from the truth just by random chance. Imagine flipping a coin to estimate the probability p is seeing heads rather than tail. 5 If you flip a coin n times and see h heads, the maximum likelihood estimate for p is ˆp MLE = h/n. As n the estimate will go to the true value, ˆp MLE p, but for small 4 Strictly speaking, if θ is continuous then we have a density for it rather than a probability, but I am not going to bother making a distinction in these notes; when you see me use the notation Pr(x) for a continuous variable x just substitute it with a density f(x) if you prefer. Likewise, if I integrate f(x) dx and f(x) is discrete then read it as a sum over all values of x. If there are risks for confusion when doing this, I will point them out, but there very rarely is. 5 Coin flipping is a classical example because it is simple, but it is physically very hard to get a biased coin, see http://www.stat.columbia.edu/~gelman/research/published/dicerev2.pdf.

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 9 n it can be quite far from the true value. For the extreme that n = 1 we have ˆp = 0 or ˆp = 1 regardless of the true value of p. Using a prior distribution over parameter values we can nudge the estimated parameters away from extremes, and if we have some idea about what likely values are going to look like we can capture this with the prior distribution. For the coin toss example, for example, we can use the prior to make it more likely that p is around 0.5 if we a priori believe that the coin is unbiased. Using the full posterior probability to make prediction, as in (3), rather than just the most likely value, as in (2), captures how certain you are in the parameter estimate. If you use the ˆθ MAP estimator for prediction, you predict with the same certainly regardless of how concentrated the posterior probability is around the maximum point ˆθ MAP. Using (3) is better in the sense that it takes into account all your knowledge about parameters that you have learned from the data. It is not always easy to get a computational fast model where you can do it, though, so we won t see more of this in this class. 4.3. Conjugate priors. Taking a Bayesian approach means that we have two new probabilities, Pr(θ) and Pr(D). The latter is the probability for the data without conditioning on the parameters, something that might look odd since we have modelled the probability of data given parameters, but it comes from treating parameters as stochastic and marginalising: Pr(D) = Pr(D, θ) dθ = Pr(D θ) Pr(θ) dθ For maximising the posterior we don t need this probability, however, since it doesn t depend on the θ we maximise with respect to. For using using the full posterior distribution (3) it is needed for normalisation but in many cases we can avoid computing it explicitly through the integration, as we will see below. The probability Pr(θ) is something we have to provide and not something we can train from the data (it is independent of the data unlike the likelihood, after all). So it becomes part of our modelling and it is up to our intuition and inventiveness to come up with a good distribution. A good choice is a so-called conjugate prior which is a function that makes it especially easy to combine prior distributions and likelihoods into posterior distributions (http:// en.wikipedia.org/wiki/conjugate_prior). The idea behind conjugate priors is that the prior is chosen such that both prior and posterior distribution is from the the same parameterised family of functions f( ; ξ), i.e. the difference between prior and posterior is the (meta-)parameter of this function: Pr(θ) = f(θ ; ξ 0 ) and Pr(θ D) = f(θ ; ξ D ). To use conjugate priors, you then simply need a function that combines prior and data meta-parameter into the posterior meta-parameter: g(ξ 0, D) = ξ D. If we take coin-flipping again as an example, we can write the likelihood of the head probability p from a single coin toss as lhd(p h) = Pr(h p) = p h (1 p) 1 h where h is 1 if we observe head and 0 if we observe tail. A series of independent coin toss will just be a product of these for different outcomes, so n tosses with h heads and

10 THOMAS MAILUND t = n h tails has the likelihood ( ) h + t lhd(p h, t) = p h (1 p) t h where ( ) h+t h is the binomial coefficient, needed to normalise the function as a probability over the number of heads and tails. For a conjugate prior we want a function f(p ; ξ 0 ) such that p h (1 p) 1 h f(p ; ξ 0 ) = f(p ; ξ h,t ) for some ξ h,t. If we take something on the same form as the likelihood, with the metaparameter similar to the heads and tail observations, we can define f(p ; ξ 0 ) = C α,β p α (1 p) β where ξ 0 = (α, β) and C α,β the normalisation constant so this is a distribution over p. 6 Combining this prior with the likelihood, ignoring for now normalising constants, we get [ lhd(p h, t) f(p ; α, β) p h (1 p) t] [p α (1 p) β] where the last line comes from how we define = p h+α (1 p) t+β f(p ; h + α, t + β) f(p ; h + α, t + β) = C h+α,t+β p h+α (1 p) t+β p h+α (1 p) t+β Since this shows that the posterior is proportional to f(p ; h + α, t + β) and since f(p ; h + α, t + β) by definition integrates to 1 and the posterior does as well by the property of being a density, they must be equal. So we move from prior to posterior by modifying the meta-parameters based on the observed data: g(ξ 0, D) = g((α, β), (h, t)) = (α + h, β + t) = ξ h,t. The new meta-parameters can be combined with more data if we get more, and this way that distribution for p can be updated each time we observe more data using this procedure again and again. The conjugate prior for a given likelihood of course depends on the form of the likelihood, but most standard probability distributions have a corresponding conjugate that you can look up if you need it. The conjugate prior for the coin toss above is called a Beta distribution, Beta(α, β), and a way of thinking about the meta-parameters α and β is as pseudo-counts of heads and tails, respectively. Using the prior, we pretend that we have already observed α heads and β tails. If we set α = β we imply that we believe that there is an equal chance of seeing heads and tails. The larger numbers we use for α and β the stronger we make the influence of the prior; if we pretend that we have seen 50 heads and 50 tails, observing 5 new tosses is not going to move our posterior distribution far away from 0.5. As h + t grows higher and higher compared to α + β, the more the posterior probability is influenced by the observed data rather than the prior, and the posterior will look more and more like just the likelihood. Consequently, the maximum a posteriori estimator will converge to the same point as the maximum 6 1 For f(p ; α, β) to be a density over p it is necessary that f(p ; α, β) dp = 1 which means that 0 1 C α,β = 1 0 pα (1 p) β dp.

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 11 likelihood estimator and thus share the nice property that it converges to the true value. It is just potentially less sensitive to stochastic fluctuations due to little initial data. 4.4. Training the string classifier. Going back to our string classification problem, we want to estimate parameters π, the probability of seeing class C 1 rather than C 2, and the two letter distributions p (1) and p (2). Training data would be a set of strings paired with their class D = {(s 1, c 1 ), (s 2, c 2 ),..., (s n, c n )}. If the training data is generated such that the probability of a string coming from C 1 is actually π we can estimate π the same way as we estimated the probability of seeing head rather than tail with a coin toss. The maximum likelihood estimator of π would be ˆπ MLE = n C 1 n where n C1 is the number of c i from C 1. With a Beta(α, β) prior for π we could instead use a maximum a posteriori estimate ˆπ MAP = n C 1 + α n + α + β where again we see that α and β works as pseudo counts for C 1 and C 2, respectively. The strings paired with C 1 are independent from the strings paired with C 2, and p (1) only depend on the first set and p (2) only on the second. For convenience and without lack of generality we assume that s 1, s 2,..., s m are the strings paired with C 1. The distribution p (1) is a multinomial distribution and the maximum likelihood estimator has ˆp (1) a = n(1) a L (1) for all a Σ, where n (1) is the number of times a occur in s 1,..., s m and L (1) = m j=1 s j is the total string length of the strings from C 1. The distribution from the other class, p (2) is estimated exactly the same way, but using the strings s m+1,..., s n. We can also add priors to multinomial distributions. The conjugate is called a Dirichlet distribution but it works just as the pseudo counts we have already seen. If we have a pseudo count α a (1) for all a Σ and let α = a Σ α a, then the MAP estimator would be ˆp (1) a + α a L (1) + α = n(1) a 4.5. Exercises. Use your simulator from earlier to simulate data sets of strings paired with the class that produced them. Estimate π from this data and plot the distance from your estimate, ˆπ, to the simulated value, π sim : ˆπ π sim, as a function of the number of strings you have simulated. You want to simulate several times for each size of the data to see the stochastic variation from the simulations. Try this for several values of π sim. Try both the maximum likelihood estimator, ˆπ MLE = n C1 /n and a maximum a posteriori estimator, ˆπ MAP = (n C1 + α)/(n + α + β). Try different values of α and β to see how this affects the estimation accuracy. Simulate strings with a letter probability distribution p sim and try to estimate this distribution. Try both the ˆp MLE and ˆp MAP estimation, with different pseudo counts for

12 THOMAS MAILUND the maximum a posteriori estimator. Plot the Kullback-Leibler divergences between the simulated and estimated distribution as a function of to total string length simulated. Finally, put it all together so you can simulate a set of strings, paired with classes, estimate all the parameters of this model, and then make predictions on the strings to test how well the prediction matches the simulated values.