MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Size: px
Start display at page:

Download "MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION"

Transcription

1 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be learned. In this class we will therefore not focus so much on specific algorithms or machine learning models, but rather give an introduction to the overall approach to using machine learning in bioinformatics, as we see it. To us, the core of machine learning boils down to three things: 1) Building computer models to capture some desired structure of the data you are working on, 2) training such models on existing data to optimise them as well as we can, and 3) use them to make predictions on new data. In these lecture notes we start with some toy examples illustrating these steps. Later you will see a concrete example of this when building a gene finder using a hidden Markov model. At the end of the class you will see algorithms that do not quite follow the framework in these notes, just to see that there are other approaches. 1. Classifying strings To illustrate the three core tasks mentioned above, we use a toy example where we want to classify strings as coming from one class of strings rather than another. It is a very simple example, and probably not quite an approach we would actually take in a real application. It illustrates many of the core ideas you will see when you work on the hidden Markov model project later in the class, though. The setup we imagine is this: we somehow get strings that are generated from one of two processes, and given a string we want to classify it according to which process it comes from. To do this, we have to 1) build a model that captures strings, 2) train this model to classify strings, and 3) use the model on new strings. Going through the example we ll switch 2) and 3), though; we need to know how to actually classify strings using the model before we can train the model to do it. Anyway, those are the tasks. 2. Modelling strings from different processes By modelling we mean constructing an algorithm or some mathematics we can apply to our data. Think of it as constructing some function, f, that maps a data point, x, to some value y = f(x). In the general case, both x and y can be vectors. A good model is a function where f extracts the relevant features of the input, x, and gives us a y we can use to make predictions about x; in this case that just means that y should be something we can use to classify x. That s a bit abstract, but in our string classification problem it simply means that we want to construct a function that given a string gives us a classification. 1

2 2 THOMAS MAILUND 2.1. Modelling, probabilities, and likelihoods. In machine learning we are rarely so lucky that we can get perfect models, that is models that with 100% accuracy classifies correctly. So we cannot expect that f will always give us a perfect y; at best we can hope for a good f. We need to quantify what good means, in order to know exactly how good a model we have, to compare too model to know which is better, and in order to optimise a model to be as good as we can make it. Probability theory and statistics gives us a very strong framework to measure how good a given model is, and general approaches we can use to train models. It is not the only approach to machine learning, but practically all classical machine learning models and algorithms can be framed in terms of probabilistic models and statistical inference, so as a basic framework it is very powerful. For a probabilistic model of strings from two different classes, we can look at the joint probability of seeing a string x Σ from class C i : Pr(x, C i ). In section 3, Classifying Strings, we will see how to classify strings from this, but for now let us just consider how to specify such a probability. In general we will build models with parameters we can tweak to fit them to data, so rather than having a specifying Pr(x, C i ) we have a whole class of probabilities indexed by parameters θ: Pr(x, C i ; θ) where θ can be continuous or discrete, a single value or an arbitrary long vector of values, whatever we come up with for our model. Training our model will boil down to picking a good parameter point, ˆθ, where we can then use the function (x, C i ) Pr(x, C i ; ˆθ) for classifying x. This function we call the probability of (x, C i ) given parameters ˆθ (and implicitly given the assumed model). If we imagine keeping the data point fixed instead, at some point (ˆx, Ĉi), we have a function mapping parameters to values: θ Pr(ˆx, Ĉi ; θ). This we call the likelihood of θ given the data (ˆx, Ĉi), and we sometimes write this lhd(θ ; ˆx, Ĉi) instead of Pr(ˆx, Ĉi ; θ). The only difference between probability of the data given the parameters, or the likelihood of the parameters given the data, is which part we keep fixed and which we vary. We require that Pr(x, C i ; θ) is a probability distribution over (x, C i ) though, which means that the sum over all possible values of x and C i (or integrating if we had continuous variables) must be 1, while we do not require that summing (or integrating) over all possible values of θ should be Modelling strings from two classes. How exactly to define a probability like Pr(x, C i ; θ) is often subjective and somewhat arbitrary, and there rarely is one right way of doing it. So it is somewhat like programming: there are many ways you can solve a problem and you can be more or less creative about it. There are some general strategies that are often useful, but it always depends on the application and there are no guarantees that these strategies will work. You just have to try and see how it goes. One strategy that is often successful when we want to classify data, is to look at the probability of a given data point conditional on the class, that is the probability Pr(x C i ; θ). We will specify the probability of a string x in each of the two classes, C 1 and C 2 and then uses the differences in these probabilities to decide which class x most likely comes from. Since we are unlikely to guess the true model in any real application of machine learning, constructing models all boils down to constructing something that is fast to

3 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 3 compute and good enough for our purpose. As with any programming task, it makes sense to start simple. For a string x = x 1 x 2 x k we need to define Pr(x 1, x 2,..., x k C i ; θ). A simple model assumes first that the letters in x are independent and second that the probability of seeing a given letter a Σ is independent on the index in x. With these assumptions, the probability of the string, which is the joint probability of the letters in the string, becomes Pr(x 1 C i ; θ) Pr(x 2 C i ; θ) Pr(x k C i ; θ). A set of parameters for such a model could specify the probability of seeing each alphabet in the alphabet, Pr(a C i ; θ) so our parameters could specify those. Let θ = (p (1), p (2) ) where p i is a vector indexed by letters a in our alphabet and with a Σ p( i) a = 1. We use p (i) as the distribution of letters in class C i then and consider it a parameter we can fit to the data when we later train the model. To compute the probability of any given string, assuming it came from class C i, you simply look up index x[j], j = 1,..., k in θ i and multiply them together: Pr(x = x 1 x 2 x k C i ; θ) = k j=1 p (i) x[j]. Since a Σ p(i) a = 1 we have Σ 1 parameters from each of the two classes, and for each choice of parameters we get slightly different distributions over strings from the two classes. It is through the differences between p (1) and p (2) we will be able to classify a string x. Now, whether this is a good model for our application depends a lot on what the real data looks like. It might not capture important structure in the real data. For instance, the assumption that the letter probability is independent of the index in the string might be incorrect (and we will see a model where the distribution depends on the index in next week s lectures), or the probability of letters might not be independent between them (we will see an example of this when we work with hidden Markov models). Deciding whether you have made a good model often is a question of comparing data you simulate under your constructed model and comparing it with real data to see if there are large differences. If there are, you should improve on your model to fit the data better, but quite often simple models are good enough for our application. Of course, even if this model is a completely accurate model of the real data it doesn t mean that we are going to be able to easily classify strings. If you are equally likely to see each letter in the alphabet whether a string comes from C 1 or C 2 then each class will give roughly the same probability to each string x and this model will not be able to distinguish them. Nevertheless, this is going to be our model for string classification Exercises. Write a function in your preferred programming language that simulate strings of length k given a vector of letter probabilities p and another function that given a string and a vector of letter probabilities computes the probability of the string. Use the simulator to simulate a string and compute the probability of that string both using the true p you used when simulating and some different probability vector p. You can measure how how far p is from p using the Kullback-Leibler divergence D KL (p p) = ( ) p log a p a p a a

4 4 THOMAS MAILUND ( If you plot this distance between p and p against the ratio between the probability of x under the two models Pr(x ; p)/ Pr(x ; p ) what happens? 1 (You might want to try this with a number of different simulated strings, since choosing random strings gives different results each time). Intuitively, you would expect longer strings to contain more information about the process that generated them than shorter strings does. What happens with Pr(x ; p) and Pr(x ; p ) when you simulate longer and longer strings? Try plotting Pr(x ; p )/ Pr(x ; p) against the length of x. Again, for each string length you might want to sample several strings to take stochastic variation into account. If you simulate long strings you will probably quickly run into underflow problems. You avoid this if you compute the log-likelihood instead of the likelihood, i.e. instead of x log Pr(x ; p) = log(p x[j] ) j=1 Pr(x ; p) = x p x[j] j=1 and if you do that you want to look at the difference log Pr(x ; p ) log Pr(x ; p) instead of the ratio Pr(x ; p ) Pr(x ; p) If instead of having a single string x had a set of strings D = {x 1, x 2,..., x n } then how would you write the probability of the set D coming from the distribution p? If you do the exercises above with a set of strings rather than a single string, what changes? 3. Classifying strings Now, what we wanted to build was a model that, given a string x, would tell us if x came from C 1 or C 2. From the model of Pr(x C i ; θ) we developed above we therefore want to get a function x Pr(C i x ; θ) instead. If we know Pr(C i x ; θ) we would classify x as belong to class C i if Pr(C i x ; θ) is high enough. If we have two classes to choose from, this typically means that we would classify x as coming from C 1 if Pr(C i x ; θ) > 0.5 and classify it as coming from C 2 otherwise. We don t always have to classify x though, and sometimes we might have an application where we should only 1 The probability of x gets exponentially smaller as the length increases (do you see why?) so comparing different lengths can be difficult. The ratio here shows have probably x is from one model over the other, and while both nominator and denominator shrinks exponentially the fraction still tells you the relative support of one model compared to the other. This particular ratio is called the likelihood ratio since it is just another way of writing lhd(p ; x)/lhd(p ; x). If we think of p and p as two different models, rather than two different parameter points, it is called the Bayes factor (

5 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 5 classify if we are relatively certain that we are right. This we also get from knowing Pr(C i x ; θ) since we then simply require that the support for the chosen class is high enough, and refrain from classifying strings where there is not high enough probability for a single class. We get the formula we want from Bayes formula which for our example means Pr(B A) = Pr(A B) Pr(B) Pr(A) Pr(C i x ; θ) = Pr(x C i ; θ) Pr(C i ; θ) Pr(x ; θ) which introduces two new probabilities: Pr(C i ; θ) and Pr(x ; θ). We can compute Pr(x ; θ) from the other two probabilities since Pr(x ; θ) = Pr(x C 1 ; θ) Pr(C 1 ; θ) + Pr(x C 2 ; θ) Pr(C 2 ; θ) assuming that there are only the two classes C 1 and C 2. 2 The other probability, Pr(C i ; θ), we have to specify. The probability Pr(C i ; θ) is independent of x and can be thought of as how likely it is that any given string would be chosen from that class in the first place. This can just be another parameter of our model, π such that the set of parameters is now θ = (π, p i a) where p i, i = 1, 2, are the probabilities of the letters for the two classes as before, and Pr(C 1 ; θ) = π and Pr(C 2 ; θ) = 1 π. The parameter π is something we must set, either explicitly or train from data as we see in the next section. For now, let us just consider what the functions Pr(C i ; θ) and Pr(x C i ; θ) tells us, and how they help us pick the right class for a string x. The so-called prior probability, Pr(C i ; θ) is a probability that describes how likely we think it is that class C i produces a string to begin with. If we think that C 1 and C 2 are equally likely to produce strings it doesn t matter so much when going from Pr(x C i ; θ) to Pr(C i x ; θ), but if we expect for example only one in a hundred string to come from C 1 would need more evidence that a specific string, x, is likely to have come from C 1 if we want to classify it as such. The probability of the string given the class, Pr(x C i ; θ), on the other hand tells us how likely it is that class would produce the string x. For that reason we can call it the likelihood, although we have already used that term for the probability as a function of the parameters θ. Still, you might sometimes see it called the likelihood, and in most ways it behaves like a likelihood, where as you recall lhd(θ ; x) is just a way of saying Pr(x ; θ). If you think of C i as a parameter of the model rather than a stochastic variable we condition on you see the resemblance. 3 If Pr(x C 1 ; θ) Pr(x C 2 ; θ), that is C 1 is much more likely to produce the string x than C 2 is, then observing x weighs the 2 This follows from how we calculate with probabilities. We can marginalise over some of the parameters in a joint distribution so Pr(A) = i Pr(A, Bi) and by definition of conditional distributions Pr(A, B) = Pr(A B) Pr(B). 3 I have been careful to distinguish between conditional probabilities, Pr(A B) and parameterised distributions Pr(A ; θ) but in all the arithmetic we do there really isn t much of a difference. A conditional probability is just a parameterised distribution and the only difference from having a conditional distribution and a parameterised distribution is whether we think parameters can be thought of as stochastic

6 6 THOMAS MAILUND odds towards C 1 rather than C 2, so even if we a priori thought that we would only see a string from C 1 one times in a hundred, if Pr(x C 1 ; θ) is a thousand times higher than Pr(x C 2 ; θ), then observing x it would still be more likely that it came from x. It is by combining the prior probability of seeing the class C i with how likely it is to produce the string we observe that we get the posterior probability of C i : Pr(C i x ; θ). We often write this intuition in the following form: Pr(C 1 x ; θ) Pr(C 2 x ; θ) = Pr(x C 1 ; θ) Pr(x C 2 ; θ) Pr(C 1 ; θ) Pr(C 2 ; θ) and you can think of Pr(C 1 ; θ) Pr(C 2 ; θ) as the prior odds, that is the odds of seeing something from C 1 rather than C 2 to begin with, and of Pr(C 1 x ; θ) Pr(C 2 x ; θ) as the posterior odds, that is the odds that the x you saw came from C 1 rather than C 2. If C 1 is unlikely to happen to begin with, the prior odds are small. However, if we then observe a string that C 1 is very likely to produce and C 2 is unlikely to produce the odds changes. The stronger the prior odds are against C 1 the more evidence we demand to see before we select C 1 over C 2. Since Pr(C 1 x ; θ) Pr(C 2 x ; θ) > 0 Pr(C 1 x ; θ) > Pr(C 2 x ; θ) we would classify x as coming from C 1 if the posterior odds are higher than 1 (or sufficiently higher than 1 if we want to avoid less certain cases) and classify it as coming from C 2 if the posterior odds are below 1. If the posterior odds are exactly 1 it is probably best not to make a decision Exercises. Pick two letter distributions, p and p and simulate n strings of length k from each. Classify a string x as class C 1 if Pr(x C 1 ; θ) > 0.5 and as C 2 otherwise and measure how well you do (how many strings you assign to the right class divided by the number of strings, 2n). How well do you classify as a function of how far p is from p? How well do you classify as a function of the length of the strings? Now simulate strings by first randomly choosing p or p so you choose p with some probability π. Classify the strings both as above and by using their posterior odds (or posterior probabilities, whichever you prefer, it gives you the same result). Compare the accuracy of the classification when the prior probabilities / prior odds are taken into account versus when they are not. Plot the accuracy with both approaches as a function of π. 4. Training the string classifier Finally we come to training the model, that is, how to set the parameters of the model θ = (π, p (1), p (2 ). We of course want to choose the parameters in such a way that we maximise the probability of classifying a new string that might show up. We don t know which strings we are likely to see, however, nor which classes they come from, and until we actually have a set of parameters we cannot even make educated guesses about it. Just saying that we want to optimise how well we can do in the future is thus and having a distribution or not. This philosophical distinction is the difference between Bayesian and Frequentist statistics.

7 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 7 not something we can tell a computer to do; we need some algorithm for setting the parameters, such that at least we are likely to get a good classifier for future data. One can show as a general result that if you have the right model, then on average you cannot do better than using the true parameters. That is, the best average performance you can achieve on new data you get if you use the true parameters of the model. If we don t really have the right model all bets are off, really, and unfortunately this is almost always the case. It s a bit of a fatalistic thought, though, so we are going to assume that we have the right model (and if not we always go back to modelling to at least get one that is as close as possible), because if we have the right model we have some general approaches to estimating the true parameters. If we do not have any data to work with, we can do nothing but guess at the parameters, but typically we can get a set of data D = {(x 1, t 1 ), (x 2, t 2 ),..., (x n, t n )} of data points x j and targets t j ; in our application strings x j Σ and associated classes t j {C 1, C 2 }. From this data we need to set the parameters. There are three approaches that are frequently used and many machine learning algorithms are just concrete algorithms for one of these general approaches. They are not always the best choice, but always a good choice and unless you can show that an alternative can do better you should use one of these. The approaches are: (1) you maximise the likelihood, (2) you maximise the posterior, (3) or you make predictions using a posterior distribution, see equations (1), (2) and (3) below. The first is a Frequentists approach ( inference) while the second and third are Bayesian ( Bayesian_inference). We will only use the first two in this class but just mention the third in case you run into it in the future Maximum likelihood estimates. For maximum likelihood estimation (http: //en.wikipedia.org/wiki/maximum_likelihood), as the name suggests, is based on maximising the likelihood function lhd(θ ; D) = Pr(D ; θ) with respect to the parameters θ: (1) ˆθMLE = argmax θ Pr(D ; θ) This you do in whatever fashion you can, just as with maximising any other function. Sometimes you can do this analytically by setting the derivative to zero, θ lhd(θ ; D) = 0 (when θ is a vector you set the gradient to zero lhd = 0), or more often the log likelihood since those are often easier to take the derivative off. Sometimes there are constraints on what values θ can legally take which complicates this slightly, and sometimes we simply cannot solve this analytically. So not surprisingly there are algorithms and heuristics in the literature for how to maximise this function for specific machine learning methods and the Baum-Welch algorithm you will see for hidden Markov models is one such algorithm and an instance of a general class of optimisation algorithms called Expectation-Maximisation or EM. EM is a numerical optimisation that is guaranteed to find a local but not necessarily a global maximum. Quite often you have to use heuristics and numerical algorithms to optimise the likelihood.

8 8 THOMAS MAILUND Intuitively, maximising Pr(D ; θ) is a sensible thing to do; you are picking the parameters that make the data you have observed most likely. There are also more theoretical properties with the maximum likelihood estimates that makes them a good choice, not least that they are guaranteed to converge towards the real parameters as the number of data points grows. They can often be biased, meaning that on average they slightly over- or under-estimate the true parameters, but this bias is guaranteed to get smaller and smaller as the number of data points grows. Still, in many algorithms you will see estimators that corrects for the bias in the maximum likelihood estimator to get an unbiased estimate that still converges to the true value. We won t worry about this and just maximise likelihoods (and hope that we have enough data and are sufficiently converged to the true value that we needn t worry about the bias) Bayesian estimates. For Bayesian estimation we tread the parameter not just as an unknown value but as a stochastic one with its own distribution. So rather than having a likelihood lhd(θ ; D) = Pr(D ; θ) we have a conditional distribution lhd(θ D) = Pr(D θ). From this we can get a posterior distribution over parameters, Pr(θ D), 4 and we can use this in two different ways in our classification model. With a posterior distribution over parameters it makes more sense to choose the parameters with the maximal probability rather than the maximal likelihood, and we call this estimator the maximum a posteriori estimator: (2) ˆθMAP = argmax θ Pr(θ D) = argmax θ Pr(D θ) Pr(θ) where in the last equality we ignore dividing by Pr(D) using Bayes rule since this is a constant when optimising with respect to θ. With a Bayesian approach, however, you do not need to maximise the posterior. You have a distribution of parameters and by integrating over all possible parameters, weighted by their probability, you can make predictions as well. So if you need to make predictions for a data point x, rather than using a single estimated parameter ˆθ (whether maximum likelihood estimate or maximum a posteriori estimate) and the probability Pr(x ; ˆθ) you can get the probability of x given all the previous data, Pr(x D) using (3) Pr(x D) = Pr(x θ) Pr(θ D) dθ = Pr(D θ) Pr(θ) Pr(x θ) dθ Pr(D) A main benefit of using Bayesian approaches is that we can alleviate some of the problems we have with the stochastic variation of estimates when we have very little data. The maximum likelihood estimates will converge to the true parameters, but when there is little data the estimate can be far from the truth just by random chance. Imagine flipping a coin to estimate the probability p is seeing heads rather than tail. 5 If you flip a coin n times and see h heads, the maximum likelihood estimate for p is ˆp MLE = h/n. As n the estimate will go to the true value, ˆp MLE p, but for small 4 Strictly speaking, if θ is continuous then we have a density for it rather than a probability, but I am not going to bother making a distinction in these notes; when you see me use the notation Pr(x) for a continuous variable x just substitute it with a density f(x) if you prefer. Likewise, if I integrate f(x) dx and f(x) is discrete then read it as a sum over all values of x. If there are risks for confusion when doing this, I will point them out, but there very rarely is. 5 Coin flipping is a classical example because it is simple, but it is physically very hard to get a biased coin, see

9 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 9 n it can be quite far from the true value. For the extreme that n = 1 we have ˆp = 0 or ˆp = 1 regardless of the true value of p. Using a prior distribution over parameter values we can nudge the estimated parameters away from extremes, and if we have some idea about what likely values are going to look like we can capture this with the prior distribution. For the coin toss example, for example, we can use the prior to make it more likely that p is around 0.5 if we a priori believe that the coin is unbiased. Using the full posterior probability to make prediction, as in (3), rather than just the most likely value, as in (2), captures how certain you are in the parameter estimate. If you use the ˆθ MAP estimator for prediction, you predict with the same certainly regardless of how concentrated the posterior probability is around the maximum point ˆθ MAP. Using (3) is better in the sense that it takes into account all your knowledge about parameters that you have learned from the data. It is not always easy to get a computational fast model where you can do it, though, so we won t see more of this in this class Conjugate priors. Taking a Bayesian approach means that we have two new probabilities, Pr(θ) and Pr(D). The latter is the probability for the data without conditioning on the parameters, something that might look odd since we have modelled the probability of data given parameters, but it comes from treating parameters as stochastic and marginalising: Pr(D) = Pr(D, θ) dθ = Pr(D θ) Pr(θ) dθ For maximising the posterior we don t need this probability, however, since it doesn t depend on the θ we maximise with respect to. For using using the full posterior distribution (3) it is needed for normalisation but in many cases we can avoid computing it explicitly through the integration, as we will see below. The probability Pr(θ) is something we have to provide and not something we can train from the data (it is independent of the data unlike the likelihood, after all). So it becomes part of our modelling and it is up to our intuition and inventiveness to come up with a good distribution. A good choice is a so-called conjugate prior which is a function that makes it especially easy to combine prior distributions and likelihoods into posterior distributions ( en.wikipedia.org/wiki/conjugate_prior). The idea behind conjugate priors is that the prior is chosen such that both prior and posterior distribution is from the the same parameterised family of functions f( ; ξ), i.e. the difference between prior and posterior is the (meta-)parameter of this function: Pr(θ) = f(θ ; ξ 0 ) and Pr(θ D) = f(θ ; ξ D ). To use conjugate priors, you then simply need a function that combines prior and data meta-parameter into the posterior meta-parameter: g(ξ 0, D) = ξ D. If we take coin-flipping again as an example, we can write the likelihood of the head probability p from a single coin toss as lhd(p h) = Pr(h p) = p h (1 p) 1 h where h is 1 if we observe head and 0 if we observe tail. A series of independent coin toss will just be a product of these for different outcomes, so n tosses with h heads and

10 10 THOMAS MAILUND t = n h tails has the likelihood ( ) h + t lhd(p h, t) = p h (1 p) t h where ( ) h+t h is the binomial coefficient, needed to normalise the function as a probability over the number of heads and tails. For a conjugate prior we want a function f(p ; ξ 0 ) such that p h (1 p) 1 h f(p ; ξ 0 ) = f(p ; ξ h,t ) for some ξ h,t. If we take something on the same form as the likelihood, with the metaparameter similar to the heads and tail observations, we can define f(p ; ξ 0 ) = C α,β p α (1 p) β where ξ 0 = (α, β) and C α,β the normalisation constant so this is a distribution over p. 6 Combining this prior with the likelihood, ignoring for now normalising constants, we get [ lhd(p h, t) f(p ; α, β) p h (1 p) t] [p α (1 p) β] where the last line comes from how we define = p h+α (1 p) t+β f(p ; h + α, t + β) f(p ; h + α, t + β) = C h+α,t+β p h+α (1 p) t+β p h+α (1 p) t+β Since this shows that the posterior is proportional to f(p ; h + α, t + β) and since f(p ; h + α, t + β) by definition integrates to 1 and the posterior does as well by the property of being a density, they must be equal. So we move from prior to posterior by modifying the meta-parameters based on the observed data: g(ξ 0, D) = g((α, β), (h, t)) = (α + h, β + t) = ξ h,t. The new meta-parameters can be combined with more data if we get more, and this way that distribution for p can be updated each time we observe more data using this procedure again and again. The conjugate prior for a given likelihood of course depends on the form of the likelihood, but most standard probability distributions have a corresponding conjugate that you can look up if you need it. The conjugate prior for the coin toss above is called a Beta distribution, Beta(α, β), and a way of thinking about the meta-parameters α and β is as pseudo-counts of heads and tails, respectively. Using the prior, we pretend that we have already observed α heads and β tails. If we set α = β we imply that we believe that there is an equal chance of seeing heads and tails. The larger numbers we use for α and β the stronger we make the influence of the prior; if we pretend that we have seen 50 heads and 50 tails, observing 5 new tosses is not going to move our posterior distribution far away from 0.5. As h + t grows higher and higher compared to α + β, the more the posterior probability is influenced by the observed data rather than the prior, and the posterior will look more and more like just the likelihood. Consequently, the maximum a posteriori estimator will converge to the same point as the maximum 6 1 For f(p ; α, β) to be a density over p it is necessary that f(p ; α, β) dp = 1 which means that 0 1 C α,β = 1 0 pα (1 p) β dp.

11 MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION 11 likelihood estimator and thus share the nice property that it converges to the true value. It is just potentially less sensitive to stochastic fluctuations due to little initial data Training the string classifier. Going back to our string classification problem, we want to estimate parameters π, the probability of seeing class C 1 rather than C 2, and the two letter distributions p (1) and p (2). Training data would be a set of strings paired with their class D = {(s 1, c 1 ), (s 2, c 2 ),..., (s n, c n )}. If the training data is generated such that the probability of a string coming from C 1 is actually π we can estimate π the same way as we estimated the probability of seeing head rather than tail with a coin toss. The maximum likelihood estimator of π would be ˆπ MLE = n C 1 n where n C1 is the number of c i from C 1. With a Beta(α, β) prior for π we could instead use a maximum a posteriori estimate ˆπ MAP = n C 1 + α n + α + β where again we see that α and β works as pseudo counts for C 1 and C 2, respectively. The strings paired with C 1 are independent from the strings paired with C 2, and p (1) only depend on the first set and p (2) only on the second. For convenience and without lack of generality we assume that s 1, s 2,..., s m are the strings paired with C 1. The distribution p (1) is a multinomial distribution and the maximum likelihood estimator has ˆp (1) a = n(1) a L (1) for all a Σ, where n (1) is the number of times a occur in s 1,..., s m and L (1) = m j=1 s j is the total string length of the strings from C 1. The distribution from the other class, p (2) is estimated exactly the same way, but using the strings s m+1,..., s n. We can also add priors to multinomial distributions. The conjugate is called a Dirichlet distribution but it works just as the pseudo counts we have already seen. If we have a pseudo count α a (1) for all a Σ and let α = a Σ α a, then the MAP estimator would be ˆp (1) a + α a L (1) + α = n(1) a 4.5. Exercises. Use your simulator from earlier to simulate data sets of strings paired with the class that produced them. Estimate π from this data and plot the distance from your estimate, ˆπ, to the simulated value, π sim : ˆπ π sim, as a function of the number of strings you have simulated. You want to simulate several times for each size of the data to see the stochastic variation from the simulations. Try this for several values of π sim. Try both the maximum likelihood estimator, ˆπ MLE = n C1 /n and a maximum a posteriori estimator, ˆπ MAP = (n C1 + α)/(n + α + β). Try different values of α and β to see how this affects the estimation accuracy. Simulate strings with a letter probability distribution p sim and try to estimate this distribution. Try both the ˆp MLE and ˆp MAP estimation, with different pseudo counts for

12 12 THOMAS MAILUND the maximum a posteriori estimator. Plot the Kullback-Leibler divergences between the simulated and estimated distribution as a function of to total string length simulated. Finally, put it all together so you can simulate a set of strings, paired with classes, estimate all the parameters of this model, and then make predictions on the strings to test how well the prediction matches the simulated values.

CS 361: Probability & Statistics

CS 361: Probability & Statistics October 17, 2017 CS 361: Probability & Statistics Inference Maximum likelihood: drawbacks A couple of things might trip up max likelihood estimation: 1) Finding the maximum of some functions can be quite

More information

Conditional probabilities and graphical models

Conditional probabilities and graphical models Conditional probabilities and graphical models Thomas Mailund Bioinformatics Research Centre (BiRC), Aarhus University Probability theory allows us to describe uncertainty in the processes we model within

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 CS 70 Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14 Introduction One of the key properties of coin flips is independence: if you flip a fair coin ten times and get ten

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics March 14, 2018 CS 361: Probability & Statistics Inference The prior From Bayes rule, we know that we can express our function of interest as Likelihood Prior Posterior The right hand side contains the

More information

1 Review of The Learning Setting

1 Review of The Learning Setting COS 5: Theoretical Machine Learning Lecturer: Rob Schapire Lecture #8 Scribe: Changyan Wang February 28, 208 Review of The Learning Setting Last class, we moved beyond the PAC model: in the PAC model we

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 EECS 70 Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10 Introduction to Basic Discrete Probability In the last note we considered the probabilistic experiment where we flipped

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

Probability and Estimation. Alan Moses

Probability and Estimation. Alan Moses Probability and Estimation Alan Moses Random variables and probability A random variable is like a variable in algebra (e.g., y=e x ), but where at least part of the variability is taken to be stochastic.

More information

Bayesian Learning (II)

Bayesian Learning (II) Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen Bayesian Learning (II) Niels Landwehr Overview Probabilities, expected values, variance Basic concepts of Bayesian learning MAP

More information

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING

SAMPLE CHAPTER. Avi Pfeffer. FOREWORD BY Stuart Russell MANNING SAMPLE CHAPTER Avi Pfeffer FOREWORD BY Stuart Russell MANNING Practical Probabilistic Programming by Avi Pfeffer Chapter 9 Copyright 2016 Manning Publications brief contents PART 1 INTRODUCING PROBABILISTIC

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay

Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Natural Language Processing Prof. Pushpak Bhattacharyya Department of Computer Science & Engineering, Indian Institute of Technology, Bombay Lecture - 21 HMM, Forward and Backward Algorithms, Baum Welch

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

Math 350: An exploration of HMMs through doodles.

Math 350: An exploration of HMMs through doodles. Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or

More information

Review of Maximum Likelihood Estimators

Review of Maximum Likelihood Estimators Libby MacKinnon CSE 527 notes Lecture 7, October 7, 2007 MLE and EM Review of Maximum Likelihood Estimators MLE is one of many approaches to parameter estimation. The likelihood of independent observations

More information

Discrete Binary Distributions

Discrete Binary Distributions Discrete Binary Distributions Carl Edward Rasmussen November th, 26 Carl Edward Rasmussen Discrete Binary Distributions November th, 26 / 5 Key concepts Bernoulli: probabilities over binary variables Binomial:

More information

Probabilistic and Bayesian Machine Learning

Probabilistic and Bayesian Machine Learning Probabilistic and Bayesian Machine Learning Lecture 1: Introduction to Probabilistic Modelling Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Why a

More information

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio Estimation of reliability parameters from Experimental data (Parte 2) This lecture Life test (t 1,t 2,...,t n ) Estimate θ of f T t θ For example: λ of f T (t)= λe - λt Classical approach (frequentist

More information

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides Probabilistic modeling The slides are closely adapted from Subhransu Maji s slides Overview So far the models and algorithms you have learned about are relatively disconnected Probabilistic modeling framework

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use? Today Statistical Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Continuous case Learning Parameters for a Bayesian Network Naive Bayes Maximum Likelihood estimates

More information

Bayesian Models in Machine Learning

Bayesian Models in Machine Learning Bayesian Models in Machine Learning Lukáš Burget Escuela de Ciencias Informáticas 2017 Buenos Aires, July 24-29 2017 Frequentist vs. Bayesian Frequentist point of view: Probability is the frequency of

More information

Modeling Environment

Modeling Environment Topic Model Modeling Environment What does it mean to understand/ your environment? Ability to predict Two approaches to ing environment of words and text Latent Semantic Analysis (LSA) Topic Model LSA

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction 15-0: Learning vs. Deduction Artificial Intelligence Programming Bayesian Learning Chris Brooks Department of Computer Science University of San Francisco So far, we ve seen two types of reasoning: Deductive

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Bayesian Inference. Introduction

Bayesian Inference. Introduction Bayesian Inference Introduction The frequentist approach to inference holds that probabilities are intrinsicially tied (unsurprisingly) to frequencies. This interpretation is actually quite natural. What,

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation Machine Learning CMPT 726 Simon Fraser University Binomial Parameter Estimation Outline Maximum Likelihood Estimation Smoothed Frequencies, Laplace Correction. Bayesian Approach. Conjugate Prior. Uniform

More information

P (E) = P (A 1 )P (A 2 )... P (A n ).

P (E) = P (A 1 )P (A 2 )... P (A n ). Lecture 9: Conditional probability II: breaking complex events into smaller events, methods to solve probability problems, Bayes rule, law of total probability, Bayes theorem Discrete Structures II (Summer

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018 Bayesian Methods David S. Rosenberg New York University March 20, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 20, 2018 1 / 38 Contents 1 Classical Statistics 2 Bayesian

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Ways to make neural networks generalize better

Ways to make neural networks generalize better Ways to make neural networks generalize better Seminar in Deep Learning University of Tartu 04 / 10 / 2014 Pihel Saatmann Topics Overview of ways to improve generalization Limiting the size of the weights

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

1 Probabilities. 1.1 Basics 1 PROBABILITIES

1 Probabilities. 1.1 Basics 1 PROBABILITIES 1 PROBABILITIES 1 Probabilities Probability is a tricky word usually meaning the likelyhood of something occuring or how frequent something is. Obviously, if something happens frequently, then its probability

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

A primer on Bayesian statistics, with an application to mortality rate estimation

A primer on Bayesian statistics, with an application to mortality rate estimation A primer on Bayesian statistics, with an application to mortality rate estimation Peter off University of Washington Outline Subjective probability Practical aspects Application to mortality rate estimation

More information

Joint, Conditional, & Marginal Probabilities

Joint, Conditional, & Marginal Probabilities Joint, Conditional, & Marginal Probabilities The three axioms for probability don t discuss how to create probabilities for combined events such as P [A B] or for the likelihood of an event A given that

More information

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling Due: Tuesday, May 10, 2016, at 6pm (Submit via NYU Classes) Instructions: Your answers to the questions below, including

More information

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

Bayesian Inference. STA 121: Regression Analysis Artin Armagan Bayesian Inference STA 121: Regression Analysis Artin Armagan Bayes Rule...s! Reverend Thomas Bayes Posterior Prior p(θ y) = p(y θ)p(θ)/p(y) Likelihood - Sampling Distribution Normalizing Constant: p(y

More information

Machine Learning for Signal Processing Bayes Classification and Regression

Machine Learning for Signal Processing Bayes Classification and Regression Machine Learning for Signal Processing Bayes Classification and Regression Instructor: Bhiksha Raj 11755/18797 1 Recap: KNN A very effective and simple way of performing classification Simple model: For

More information

Lecture 18: Learning probabilistic models

Lecture 18: Learning probabilistic models Lecture 8: Learning probabilistic models Roger Grosse Overview In the first half of the course, we introduced backpropagation, a technique we used to train neural nets to minimize a variety of cost functions.

More information

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2 1 Basic Probabilities The probabilities that we ll be learning about build from the set theory that we learned last class, only this time, the sets are specifically sets of events. What are events? Roughly,

More information

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul Generative vs Discriminative The classification algorithms we have seen so far

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: Bayesian Estimation Chris Lucas (Slides adapted from Frank Keller s) School of Informatics University of Edinburgh clucas2@inf.ed.ac.uk 17 October, 2017 1 / 28

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining MLE and MAP Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due tonight. Assignment 5: Will be released

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 8: Frank Keller School of Informatics University of Edinburgh keller@inf.ed.ac.uk Based on slides by Sharon Goldwater October 14, 2016 Frank Keller Computational

More information

Quadratic Equations Part I

Quadratic Equations Part I Quadratic Equations Part I Before proceeding with this section we should note that the topic of solving quadratic equations will be covered in two sections. This is done for the benefit of those viewing

More information

CS 361: Probability & Statistics

CS 361: Probability & Statistics February 19, 2018 CS 361: Probability & Statistics Random variables Markov s inequality This theorem says that for any random variable X and any value a, we have A random variable is unlikely to have an

More information

CHAPTER 2 Estimating Probabilities

CHAPTER 2 Estimating Probabilities CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2017. Tom M. Mitchell. All rights reserved. *DRAFT OF September 16, 2017* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is

More information

Bayesian Estimation An Informal Introduction

Bayesian Estimation An Informal Introduction Mary Parker, Bayesian Estimation An Informal Introduction page 1 of 8 Bayesian Estimation An Informal Introduction Example: I take a coin out of my pocket and I want to estimate the probability of heads

More information

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1

The Exciting Guide To Probability Distributions Part 2. Jamie Frost v1.1 The Exciting Guide To Probability Distributions Part 2 Jamie Frost v. Contents Part 2 A revisit of the multinomial distribution The Dirichlet Distribution The Beta Distribution Conjugate Priors The Gamma

More information

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Logistics CSE 446: Point Estimation Winter 2012 PS2 out shortly Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2 Last Time Random variables, distributions Marginal, joint & conditional

More information

Regression I: Mean Squared Error and Measuring Quality of Fit

Regression I: Mean Squared Error and Measuring Quality of Fit Regression I: Mean Squared Error and Measuring Quality of Fit -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD 1 The Setup Suppose there is a scientific problem we are interested in solving

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4. I. Probability basics (Sections 4.1 and 4.2) Flip a fair (probability of HEADS is 1/2) coin ten times. What is the probability of getting exactly 5 HEADS? What is the probability of getting exactly 10

More information

MS&E 226: Small Data

MS&E 226: Small Data MS&E 226: Small Data Lecture 12: Frequentist properties of estimators (v4) Ramesh Johari ramesh.johari@stanford.edu 1 / 39 Frequentist inference 2 / 39 Thinking like a frequentist Suppose that for some

More information

Week 3: Linear Regression

Week 3: Linear Regression Week 3: Linear Regression Instructor: Sergey Levine Recap In the previous lecture we saw how linear regression can solve the following problem: given a dataset D = {(x, y ),..., (x N, y N )}, learn to

More information

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20. 10-601 Machine Learning, Midterm Exam: Spring 2008 Please put your name on this cover sheet If you need more room to work out your answer to a question, use the back of the page and clearly mark on the

More information

Some Probability and Statistics

Some Probability and Statistics Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my

More information

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over Point estimation Suppose we are interested in the value of a parameter θ, for example the unknown bias of a coin. We have already seen how one may use the Bayesian method to reason about θ; namely, we

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Computational Perception. Bayesian Inference

Computational Perception. Bayesian Inference Computational Perception 15-485/785 January 24, 2008 Bayesian Inference The process of probabilistic inference 1. define model of problem 2. derive posterior distributions and estimators 3. estimate parameters

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

Discrete Probability and State Estimation

Discrete Probability and State Estimation 6.01, Fall Semester, 2007 Lecture 12 Notes 1 MASSACHVSETTS INSTITVTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.01 Introduction to EECS I Fall Semester, 2007 Lecture 12 Notes

More information

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017 CPSC 340: Machine Learning and Data Mining MLE and MAP Fall 2017 Assignment 3: Admin 1 late day to hand in tonight, 2 late days for Wednesday. Assignment 4: Due Friday of next week. Last Time: Multi-Class

More information

CS 124 Math Review Section January 29, 2018

CS 124 Math Review Section January 29, 2018 CS 124 Math Review Section CS 124 is more math intensive than most of the introductory courses in the department. You re going to need to be able to do two things: 1. Perform some clever calculations to

More information

Quantitative Understanding in Biology 1.7 Bayesian Methods

Quantitative Understanding in Biology 1.7 Bayesian Methods Quantitative Understanding in Biology 1.7 Bayesian Methods Jason Banfelder October 25th, 2018 1 Introduction So far, most of the methods we ve looked at fall under the heading of classical, or frequentist

More information

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com 1 School of Oriental and African Studies September 2015 Department of Economics Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com Gujarati D. Basic Econometrics, Appendix

More information

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ). CS 70 Discrete Mathematics for CS Spring 2006 Vazirani Lecture 8 Conditional Probability A pharmaceutical company is marketing a new test for a certain medical condition. According to clinical trials,

More information

Computational Cognitive Science

Computational Cognitive Science Computational Cognitive Science Lecture 9: A Bayesian model of concept learning Chris Lucas School of Informatics University of Edinburgh October 16, 218 Reading Rules and Similarity in Concept Learning

More information

MAT Mathematics in Today's World

MAT Mathematics in Today's World MAT 1000 Mathematics in Today's World Last Time We discussed the four rules that govern probabilities: 1. Probabilities are numbers between 0 and 1 2. The probability an event does not occur is 1 minus

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016

Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation. EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 Introduction to Bayesian Statistics and Markov Chain Monte Carlo Estimation EPSY 905: Multivariate Analysis Spring 2016 Lecture #10: April 6, 2016 EPSY 905: Intro to Bayesian and MCMC Today s Class An

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Great Theoretical Ideas in Computer Science

Great Theoretical Ideas in Computer Science 15-251 Great Theoretical Ideas in Computer Science Probability Theory: Counting in Terms of Proportions Lecture 10 (September 27, 2007) Some Puzzles Teams A and B are equally good In any one game, each

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted

More information

(1) Introduction to Bayesian statistics

(1) Introduction to Bayesian statistics Spring, 2018 A motivating example Student 1 will write down a number and then flip a coin If the flip is heads, they will honestly tell student 2 if the number is even or odd If the flip is tails, they

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01 Nasser Sadeghkhani a.sadeghkhani@queensu.ca There are two main schools to statistical inference: 1-frequentist

More information

Discrete Mathematics and Probability Theory Fall 2010 Tse/Wagner MT 2 Soln

Discrete Mathematics and Probability Theory Fall 2010 Tse/Wagner MT 2 Soln CS 70 Discrete Mathematics and Probability heory Fall 00 se/wagner M Soln Problem. [Rolling Dice] (5 points) You roll a fair die three times. Consider the following events: A first roll is a 3 B second

More information