Machine Learning Lecture Notes

Size: px
Start display at page:

Download "Machine Learning Lecture Notes"

Transcription

1 Machine Learning Lecture Notes Predrag Radivojac January 25, 205 Basic Principles of Parameter Estimation In probabilistic modeling, we are typically presented with a set of observations and the objective is to find a model, or function, ˆM that shows good agreement with the data and respects certain additional requirements. We shall roughly categorize these requirements into three groups: (i)theabilitytogeneralizewell, (ii) the ability to incorporate prior nowledge and assumptions into modeling, and (iii) scalability. First,themodelshouldbeabletostandthetestoftime; that is, its performance on the previously unseen data should not deteriorate once this new data is presented. Models with such performance are said to generalize well. Second, ˆM must be able to incorporate information about the model space M from which it is selected and the process of selecting a model should be able to accept training advice from an analyst. Finally, when large amounts of data are available, learning algorithms must be able to provide solutions in reasonable time given the resources such as memory or CPU power. In summary, the choice of a model ultimately depends on the observations at hand, our experience with modeling real-life phenomena, and the ability of algorithms to find good solutions given limited resources. An easy way to thin about finding the best model is through learning parameters of a distribution. Suppose we are given a set of observations D {x i } n i, where x i 2 R and have nowledge that M is a family of all univariate Gaussian distributions, e.g. M Gaussian(µ, 2 ), with µ 2 R and 2 R +. In this case, the problem of finding the best model (by which we mean function) can be seen as finding the best parameters µ and,i.e. theproblemcan be seen as parameter estimation. We call this process estimation because the typical assumption is that the data was generated by an unnown model from M whose parameters we are trying to recover from data. We will formalize parameter estimation using probabilistic techniques and will subsequently find solutions through optimization, occasionally with constrains in the parameter space. The main assumption throughout this part will be that the set of observations D was generated (or collected) independently and according to the same distribution p X (x). The statistical framewor for model inference is shown in Figure.

2 Observations Knowledge + Assumptions Data generator Data set Experience Optimization n D { x i } i M, pm ( ),... Parameter estimation e.g., MMAP arg max { p( D M) p( M) } M 2M MB E[ M D D] Model Model inference: Observations + Knowledge and Assumptions + Optimization Figure : Statistical framewor for model inference. The estimates of the parameters are made using a set of observations D as well as experience in the form of model space M, priordistributionp(m), orspecificstartingsolutions in the optimization step. 2

3 . Maximum a posteriori and maximum lielihood estimation The idea behind the maximum a posteriori (MAP) estimation is to find the most probable model for the observed data. Given the data set D, weformalize the MAP solution as M MAP arg max {p(m D)}, M2M where p(m D) is the posterior distribution of the model given the data. In discrete model spaces, p(m D) is the probability mass function and the MAP estimate is exactly the most probable model. Its counterpart in continuous spaces is the model with the largest value of the posterior density function. Note that we use words model, which is a function, and its parameters, which are the coefficients of that function, somewhat interchangeably. However, we should eep in mind the difference, even if only for pedantic reasons. To calculate the posterior distribution we start by applying the Bayes rule as p(m D) p(d M) p(m), () p(d) where p(d M) is called the lielihood function, p(m) is the prior distribution of the model, and p(d) is the marginal distribution of the data. Notice that we use D for the observed data set, but that we usually thin of it as a realization of a multidimensional random variable D drawn according to some distribution p(d). Usingtheformulaoftotalprobability,wecanexpressp(D) as 8P >< M2M p(d M)p(M) M : discrete p(d) >: M p(d M)p(M)dM M : continuous Therefore, the posterior distribution can be fully described using the lielihood and the prior. The field of research and practice involving ways to determine this distribution and optimal models is referred to as inferential statistics. The posterior distribution is sometimes referred to as inverse probability. Finding M MAP can be greatly simplified because p(d) in the denominator does not affect the solution. We shall re-write Eq. () as p(m D) p(d M) p(m) p(d) / p(d M) p(m), where / is the proportionality symbol. Thus, we can find the MAP solution by solving the following optimization problem 3

4 M MAP arg max {p(d M)p(M)}. M2M In some situations we may not have a reason to prefer one model over another and can thin of p(m) as a constant over the model space M. Then, the maximum a posteriori estimation reduces to the maximization of the lielihood function; i.e. M ML arg max {p(d M)}. M2M We will refer to this solution as the maximum lielihood solution. Formally speaing, the assumption that p(m) is constant is problematic because a uniform distribution cannot be always defined (say, over R). Thus, it may be useful to thin of the maximum lielihood approach as a separate technique, rather than a special case of MAP estimation, with a conceptual caveat. Observe that MAP and ML approaches report solutions corresponding to the mode of the posterior distribution and the lielihood function, respectively. We shall later contrast this estimation technique with the view of the Bayesian statistics in which the goal is to minimize the posterior ris. Such estimation typically results in calculating conditional expectations, which can be complex integration problems. From a different point of view, MAP and ML estimates are called point estimates, asopposedtoestimatedthatreportconfidenceintervals for particular group of parameters. Example 8. Suppose data set D {2, 5, 9, 5, 4, 8} is an i.i.d. sample from a Poisson distribution with an unnown parameter t. Findthemaximum lielihood estimate of t. The probability density function of a Poisson distribution is expressed as p(x ) x e /x!, with some parameter 2 R +. We will estimate this parameter as We can write the lielihood function as ML arg max {p(d )}. 2(0,) p(d )p({x i } n i ) p(x i ) i i xi e n Q n i x i!. To find that maximizes the lielihood, we will first tae a logarithm (a monotonic function) to simplify the calculation, then find its first derivative with respect to, and finally equate it with zero to find the maximum. Specifically, we express the log-lielihood ll(d, ) lnp(d ) as 4

5 ll(d, )ln i and proceed with the first derivative as x i n ln (x i!) n X 0. i x i n By substituting n 6and values from D, wecancomputethesolutionas ML n 5.5, which is simply a sample mean. The second derivative of the lielihood function is always negative because must be positive; thus, the previous expression indeed maximizes the lielihood. We have ignored the anomalous cases when D contains all zeros. Example 9. Let D {2, 5, 9, 5, 4, 8} again be an i.i.d. sample from Poisson( t ),butnowwearealsogivenadditionalinformation. Supposethe prior nowledge about t can be expressed using a gamma distribution (x, ) with parameters 3and. Find the maximum a posteriori estimate of t. First, we write the probability density function of the gamma family as i x i e x (x, ) x (), where x>0, >0, and >0. () is the gamma function that generalizes the factorial function; when is an integer, we have () ( )!. The MAP estimate of the parameters can be found as MAP arg max {p(d )p( )}. 2(0,) As before, we can write the lielihood function as p(d ) i xi e n Q n i x i! and the prior distribution as 5

6 p( ) e (). Now, we can maximize the logarithm of the posterior distribution p( D) using ln p( D) / ln p(d )+lnp( ) ln ( + x i ) (n + n ) X ln x i! ln ln () to obtain i i MAP 5 + i x i n + after incorporating all data. Aquiclooat MAP and ML suggests that as n grows, both numerators and denominators in the expressions above become increasingly more similar. Unfortunately, equality between MAP and ML when n!cannot be guaranteed because it could occur by (a distant) chance that s n i x i does not grow with n. Onewaytoproceedistocalculatetheexpectationofthedifference between MAP and ML and then investigate what happens when n!.let us do this formally. We shall first note that both estimates can be considered to be random variables. This follows from the fact that D is assumed to be an i.i.d. sample from a Poisson distribution with some true parameter t. Withthis, wecan write MAP ( +S n )/(n + ) and ML S n /n, where S n i X i and X i Poisson( t ). We shall now prove the following convergence (in mean) result lim E D [ MAP ML ] 0, n! where the expectation is performed with respect to the random variable D. Let us first express the absolute difference between the two estimates as MAP ML +S n n + / n + / apple S n n S n n(n + / ) n + + S n / n(n + / ) where is a random variable that bounds the absolute difference between and ML.Weshallnowexpresstheexpectationof as MAP 6

7 E [ ] and therefore calculate that " n # n(n + / ) E X X i + n + i / n + t + / n + / lim E [ ] 0. n! This result shows that the MAP estimate approaches the ML solution for large data sets. In other words, large data diminishes the importance of prior nowledge. This is an important conclusion because it simplifies mathematical apparatus necessary for practical inference... The relationship with Kullbac-Leibler divergence We now investigate the relationship between maximum lielihood estimation and Kullbac-Leibler divergence. Kullbac-Leibler divergence between two probability distributions p(x) and q(x) is defined as D KL (p q) ˆ p(x) log p(x) q(x) dx. In information theory, Kullbac-Leibler divergence has a natural interpretation of the inefficiency of signal compression when the code is constructed using a suboptimal distribution q(x) instead of the correct (but unnown) distribution p(x) according to which the data has been generated. However, more often than not, Kullbac-Leibler divergence is simply considered to be a measure of divergence between two probability distributions. Although this divergence is not a metric (it is not symmetric and does not satisfy the triangle inequality) it has important theoretical properties in that (i) itisalwaysnon-negativeand (ii) itisequaltozeroifandonlyifp(x) q(x). Consider now a divergence between an estimated probability distribution p(x ) and an underlying (true) distribution p(x t ) according to which the data set D {x i } n i was generated. The Kullbac-Leibler divergence between p(x ) and p(x t ) is D KL (p(x t ) p(x )) ˆ ˆ p(x t ) log p(x t) p(x ) dx p(x t ) log p(x ) dx ˆ p(x t ) log p(x t ) dx. The second term in the above equation is simply the entropy of the true distribution and is not influenced by our choice of the model. The first term, on the other hand, can be expressed as 7

8 ˆ p(x t ) log p(x ) dx E[log p(x )] Therefore, maximizing E[log p(x )] minimizes the Kullbac-Leibler divergence between p(x ) and p(x t ). Using the strong law of large numbers, we now that n log p(x i ) i a.s.! E[log p(x )] when n!. Thus, when the data set is sufficiently large, maximizing the lielihood function minimizes the Kullbac-Leibler divergence and leads to the conclusion that p(x ML )p(x t ),iftheunderlyingassumptionsaresatisfied. Under reasonable conditions, we can infer from it that ML t. This will hold for families of distributions for which a set of parameters uniquely determines the probability distribution; e.g. it will not generally hold for mixtures of distributions but we will discuss this situation later. This result is only one of the many connections between statistics and information theory..2 Parameter estimation for mixtures of distributions We now investigate parameter estimation for mixture models, which is most commonly carried out using the expectation-maximization (EM) algorithm. As before, we are given a set of i.i.d. observations D {x i } n i, with the goal of estimating the parameters of the mixture distribution p(x ) w j p(x j ). j In the equation above, we used (w,w 2,...,w m,, 2,..., m ) to combine all parameters. Just to be more concrete, we shall assume that each p(x i j ) is an exponential distribution with parameter j, i.e. p(x j ) j e jx, where j > 0. Finally,we shall assume that m is given and will address simultaneous estimation of and m later. Let us attempt to find the maximum lielihood solution first. By plugging the formula for p(x ) into the lielihood function we obtain p(d ) p(x i ) i w j p(x i j ) A, (2) i j which, unfortunately, is difficult to maximize using differential calculus (why?). Note that although p(d ) has O(m n ) terms, it can be calculated in O(mn) time as a log-lielihood. 8

9 Before introducing the EM algorithm, let us for a moment present two hypothetical scenarios that will help us to understand the algorithm. First, suppose that information is available as to which mixing component generated which data point. That is, suppose that D {(x i,y i )} n i is an i.i.d. sample from some distribution p XY (x, y), where y 2Y {, 2,...,m} specifies the mixing component. How would the maximization be performed then? Let us write the lielihood function as p(d ) p(x i,y i ) i p(x i y i, )p(y i ) i w yi p(x i yi ), (3) i where w j p Y (j) P (Y j). The log-lielihood is log p(d ) (log w yi + log p(x i yi )) i n j log w j + j log p(x i yi ), where n j is the number of data points in D generated by the j-th mixing component. It is useful to observe here that when y (y,y 2,...,y n ) is nown, the internal summation operator in Eq. (2) disappears. More importantly, it follows that Eq. (3) can be maximized in a relatively straightforward manner. Let us show how. To find w (w,w 2,...,w m ) we need to solve a constrained optimization problem, which we will do by using the method of Lagrange multipliers. We shall first form the Lagrangian function as 0 L(w, ) n j log w j w j A j where is the Lagrange multiplier. Then, L(w, )0for every L(w, )0,wederivethatw n and n. Thus, w n i j I(y i ), i where I( ) is the indicator function. To find all j,werecallthatweassumeda 9

10 mixture of exponential distributions. Thus, we proceed log p(x i y i )0, i for each 2Y.Weobtainthat n i I(y i ) x i, which is simply the inverse mean over those data points generated by the -th mixture component. In summary, we observe that if the mixing component designations y are nown, the parameter estimation is greatly simplified. This was achieved by decoupling the estimation of mixing proportions and all parameters of the mixing distributions. In the second hypothetical scenario, suppose that parameters are nown, and that we would lie to estimate the best configuration of the mixture designations y (one may be tempted to call them class labels ). This tas loos lie clustering in which cluster memberships need to be determined based on the nown set of mixing distributions and mixing probabilities. To do this we can calculate the posterior distribution of y as p(y D, ) p(y i x i, ) i i w yi p(x i yi ) P m j w jp(x i j ) (4) and subsequently find the best configuration out of m n possibilities. Obviously, because of the i.i.d. assumption each element y i can be estimated separately and, thus, this estimation can be completed in O(mn) time. The MAP estimate for y i can be found as ŷ i arg max y i 2Y ( ) w yi p(x i yi ) P m j w jp(x i j ) for each i 2{, 2,...,n}. In reality, neither class labels y nor the parameters are nown. Fortunately, we have just seen that the optimization step is relatively straightforward if one of them is nown. Therefore, the intuition behind the EM algorithm is to form an iterative procedure by assuming that either y or is nown and calculate the other. For example, we can initially pic some value for, say (0),and then estimate y by computing p(y D, (0) ) as in Eq. (4). We can refer to this estimate as y (0). Using y (0) we can now refine the estimate of to () using Eq. (3). We can then iterate these two steps until convergence. In the case of mixture of exponential distributions, we arrive at the following algorithm: 0

11 . Initialize 2. Calculate y (0) i 3. Set t 0 (0) and w (0) arg max 2Y for 8 2Y w (0) p(x i (0) P ) m j w(0) j p(x i (0) j ) for 8i 2{, 2,...,n} 4. Repeat until convergence (a) w (t+) (b) (t+) (c) t t + (d) y (t+) i 5. Report (t) n i I(y(t) i I(y(t) i i I(y(t) i arg max 2Y and w (t) i j) ) ) x i w (t) P m j w(t) j for 8 2Y p(x i (t) ) p(x i (t) j ) This procedure is not quite yet the EM algorithm; rather, it is a version of it referred to as classification EM algorithm. In the next section we will introduce the EM algorithm..2. The expectation-maximization algorithm The previous procedure was designed to iteratively estimate class memberships and parameters of the distribution. In reality, it is not necessary to compute y; after all, we only need to estimate. Toaccomplishthis,ateachstept, wecan use p(y D, (t) ) to maximize the expected log-lielihood of both D and y E Y [log p(d, y ) (t) ] X y log p(d, y )p(y D, (t) ), (5) which can be carried out by integrating the log-lielihood function of D and y over the posterior distribution for y in which the current values of the parameters (t) are assumbed to be nown. We can now formulate the expression for the parameters in step t +as (t+) arg max n o E[log p(d, y ) (t) ]. (6) The formula above is all that is necessary to create the update rule for the EM algorithm. Note, however, that inside of it we always have to re-compute E Y [log p(d, y ) (t) ] function because the parameters (t) have been updated from the previous step. We then can perform maximization. Hence the name expectation-maximization, although it is perfectly valid to thin of the EM algorithm as an interative maximization of expectation from Eq. (5), i.e. expectation maximization. We now proceed as follows

12 E[log p(d, y ) (t) ] y y y log p(d, y )p(y D, (t) ) y n y n i y n i log p(x i,y i ) p(y l x l, (t) ) l log (w yi p(x i yi )) p(y l x l, (t) ). After several simplification steps, that we omit for space reasons, the expectation of the lielihood can be written as E[log p(d, y ) (t) ] i j l log (w j p(x i j )) p Yi (j x i, (t) ), from which we can see that w and { j } m j can be separately found. In the final two steps, we will first derive the update rule for the mixing probabilities and then by assuming the mixing distributions are exponential, derive the update rules for their parameters. To maximize E[log p(d, y ) (t) ] with respect to w, weobservethatthisis an instance of constrained optimization because it must hold that P m i w i. We will use the method of Lagrange multipliers; thus, for each 2Ywe need to solve 0 X log w j p Yi (j x i, (t) w j AA j i where is the Lagrange multiplier. It is relatively straightforward to show that w (t+) n j p Yi ( x i, (t) ). (7) i Similarly, to find the solution for the parameters of the mixture distributions, we obtain that (t+) i p Y i ( x i, (t) ) i x ip Yi ( x i, (t) ) (8) for 2Y. As previously shown, we have p Yi ( x i, (t) ) w (t) (t) ) p(x i P m j w(t) j p(x i (t) j ), (9) which can be computed and stored as an n m matrix. In summary, for the mixture of m exponential distributions, we summarize the EM algorithm by combining Eqs. (7-9) as follows: 2

13 . Initialize 2. Set t 0 (0) and w (0) for 8 2Y 3. Repeat until convergence (a) p Yi ( x i, (t) ) w(t) P m j w(t) j p(x i (t) ) p(x i (t) j (b) w (t+) n i p Y i ( x i, (t) ) (c) (t+) (d) t t + 4. Report (t) i p Y i ( x i, (t) ) i x ip Yi ( x i, (t) ) and w (t) for 8 2Y ) for 8(i, ) Similar update rules can be obtained for different probability distributions; however, a separate derivatives have to be found. It is important to observe and undertand the difference between the CEM and the EM algorithms..3 Bayesian estimation Maximum a posteriori and maximum lielihood approaches report the solution that corresponds to the mode of the posterior distribution and the lielihood function, respectively. This approach, however, does not consider the possibility of sewed distributions, multimodal distributions or simply large regions with similar values of p(m D). Bayesianestimationaddressesthoseconcerns. The main idea in Bayesian statistics is minimization of the posterior ris ˆ R `(M, ˆM) p(m D)dM M where ˆM is our estimate and `(M, ˆM) is some loss function between two models. When `(M, ˆM) (M ˆM) 2 (ignore the abuse of notation), we can minimize the posterior ris as ˆM R 2ˆM 2 M p(m D)dM M 0 from which it can be derived that the minimizer of the posterior ris is the posterior mean function, i.e. ˆ M B M M p(m D)dM E M [M D]. 3

14 We shall refer to M B as the Bayes estimator. It is important to mention that computing the posterior mean usually involves solving complex integrals. In some situations, these integrals can be solved analytically; in others, numerical integration is necessary. Example. Let D {2, 5, 9, 5, 4, 8} yet again be an i.i.d. sample from Poisson( t ). Suppose the prior nowledge about the parameter of the exponential distribution can be expressed using a gamma distribution with parameters 3and.FindtheBayesianestimateof t. We want to find E[ D]. Let us first write the posterior distribution as p(d )p( ) p( D) p(d) p(d )p( ) p(d )p( )d, 0 where, as shown in previous examples, we have that p(d ) i xi e n Q n i x i! and p( ) e (). Before calculating p(d), letusfirstnotethat Now, we can derive that ˆ x 0 e x dx ( ). and subsequently that p(d) ˆ 0 ˆ 0 p(d )p( )d i xi e Q n e n i x i! () d ( + i x i) () Q n i x i!(n + ) i x i+ 4

15 p(d )p( ) p( D) p(d) i xi e n Q n i x i! Finally, e () + i xi e (n+ / ) (n + ) ( + i x i) () Q n i x i!(n + ) i x i+ ( + i x i) i x i+. E[ D] ˆ 0 p( D)d + i x i n which is nearly the same solution as the MAP estimate found in Example 9. It is evident from the previous example that selection of the prior distribution has important implications on calculation of the posterior mean. We have not piced the gamma distribution by chance; that is, when the lielihood was multiplied by the prior, the resulting distribution remained in the same class of functions as the lielihood. We shall refer to such prior distributions as conjugate priors. Conjugate priors are also simplifying the mathematics; in fact, this is a major reason for their consideration. Interestingly, in addition to the Poisson distribution, the gamma distribution is a conjugate prior to the exponential distribution as well as the gamma distribution itself. 5

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

An Introduction to Expectation-Maximization

An Introduction to Expectation-Maximization An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative

More information

Technical Details about the Expectation Maximization (EM) Algorithm

Technical Details about the Expectation Maximization (EM) Algorithm Technical Details about the Expectation Maximization (EM Algorithm Dawen Liang Columbia University dliang@ee.columbia.edu February 25, 2015 1 Introduction Maximum Lielihood Estimation (MLE is widely used

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.

1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM. Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning

Bayesian Approach 2. CSC412 Probabilistic Learning & Reasoning CSC412 Probabilistic Learning & Reasoning Lecture 12: Bayesian Parameter Estimation February 27, 2006 Sam Roweis Bayesian Approach 2 The Bayesian programme (after Rev. Thomas Bayes) treats all unnown quantities

More information

Expectation Propagation Algorithm

Expectation Propagation Algorithm Expectation Propagation Algorithm 1 Shuang Wang School of Electrical and Computer Engineering University of Oklahoma, Tulsa, OK, 74135 Email: {shuangwang}@ou.edu This note contains three parts. First,

More information

Two Useful Bounds for Variational Inference

Two Useful Bounds for Variational Inference Two Useful Bounds for Variational Inference John Paisley Department of Computer Science Princeton University, Princeton, NJ jpaisley@princeton.edu Abstract We review and derive two lower bounds on the

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2

STATS 306B: Unsupervised Learning Spring Lecture 2 April 2 STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Lecture 9: PGM Learning

Lecture 9: PGM Learning 13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and

More information

But if z is conditioned on, we need to model it:

But if z is conditioned on, we need to model it: Partially Unobserved Variables Lecture 8: Unsupervised Learning & EM Algorithm Sam Roweis October 28, 2003 Certain variables q in our models may be unobserved, either at training time or at test time or

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Introduction to Machine Learning. Lecture 2

Introduction to Machine Learning. Lecture 2 Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for

More information

Algorithmisches Lernen/Machine Learning

Algorithmisches Lernen/Machine Learning Algorithmisches Lernen/Machine Learning Part 1: Stefan Wermter Introduction Connectionist Learning (e.g. Neural Networks) Decision-Trees, Genetic Algorithms Part 2: Norman Hendrich Support-Vector Machines

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectation-Maximization Algorithm Francisco S. Melo In these notes, we provide a brief overview of the formal aspects concerning -means, EM and their relation. We closely follow the presentation in

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 218 Outlines Overview Introduction Linear Algebra Probability Linear Regression 1

More information

Estimating Gaussian Mixture Densities with EM A Tutorial

Estimating Gaussian Mixture Densities with EM A Tutorial Estimating Gaussian Mixture Densities with EM A Tutorial Carlo Tomasi Due University Expectation Maximization (EM) [4, 3, 6] is a numerical algorithm for the maximization of functions of several variables

More information

Overfitting, Bias / Variance Analysis

Overfitting, Bias / Variance Analysis Overfitting, Bias / Variance Analysis Professor Ameet Talwalkar Professor Ameet Talwalkar CS260 Machine Learning Algorithms February 8, 207 / 40 Outline Administration 2 Review of last lecture 3 Basic

More information

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li. https://funglee.github.io Machine Learning Lecture 4: Regularization and Bayesian Statistics Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 207 Overfitting Problem

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 2 - Spring 2017 Lecture 6 Jan-Willem van de Meent (credit: Yijun Zhao, Chris Bishop, Andrew Moore, Hastie et al.) Project Project Deadlines 3 Feb: Form teams of

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z)

Variables which are always unobserved are called latent variables or sometimes hidden variables. e.g. given y,x fit the model p(y x) = z p(y x,z)p(z) CSC2515 Machine Learning Sam Roweis Lecture 8: Unsupervised Learning & EM Algorithm October 31, 2006 Partially Unobserved Variables 2 Certain variables q in our models may be unobserved, either at training

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 20: Expectation Maximization Algorithm EM for Mixture Models Many figures courtesy Kevin Murphy s

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables

More information

Unsupervised Learning

Unsupervised Learning Unsupervised Learning Bayesian Model Comparison Zoubin Ghahramani zoubin@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit, and MSc in Intelligent Systems, Dept Computer Science University College

More information

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation Lecture 15. Pattern Classification (I): Statistical Formulation Outline Statistical Pattern Recognition Maximum Posterior Probability (MAP) Classifier Maximum Likelihood (ML) Classifier K-Nearest Neighbor

More information

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014 Learning with Noisy Labels Kate Niehaus Reading group 11-Feb-2014 Outline Motivations Generative model approach: Lawrence, N. & Scho lkopf, B. Estimating a Kernel Fisher Discriminant in the Presence of

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Quantitative Biology II Lecture 4: Variational Methods

Quantitative Biology II Lecture 4: Variational Methods 10 th March 2015 Quantitative Biology II Lecture 4: Variational Methods Gurinder Singh Mickey Atwal Center for Quantitative Biology Cold Spring Harbor Laboratory Image credit: Mike West Summary Approximate

More information

Bioinformatics: Biology X

Bioinformatics: Biology X Bud Mishra Room 1002, 715 Broadway, Courant Institute, NYU, New York, USA Model Building/Checking, Reverse Engineering, Causality Outline 1 Bayesian Interpretation of Probabilities 2 Where (or of what)

More information

10-701/15-781, Machine Learning: Homework 4

10-701/15-781, Machine Learning: Homework 4 10-701/15-781, Machine Learning: Homewor 4 Aarti Singh Carnegie Mellon University ˆ The assignment is due at 10:30 am beginning of class on Mon, Nov 15, 2010. ˆ Separate you answers into five parts, one

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes Pattern Recognition and Machine Learning James L. Crowley ENSIMAG 3 - MMIS Fall Semester 2016 Lesson 1 5 October 2016 Learning and Evaluation of Pattern Recognition Processes Outline Notation...2 1. The

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

an introduction to bayesian inference

an introduction to bayesian inference with an application to network analysis http://jakehofman.com january 13, 2010 motivation would like models that: provide predictive and explanatory power are complex enough to describe observed phenomena

More information

Machine Learning Lecture Notes

Machine Learning Lecture Notes Machine Learning Lecture Notes Predrag Radivojac February 2, 205 Given a data set D = {(x i,y i )} n the objective is to learn the relationship between features and the target. We usually start by hypothesizing

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

U Logo Use Guidelines

U Logo Use Guidelines Information Theory Lecture 3: Applications to Machine Learning U Logo Use Guidelines Mark Reid logo is a contemporary n of our heritage. presents our name, d and our motto: arn the nature of things. authenticity

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection (non-examinable material) Matthew J. Beal February 27, 2004 www.variational-bayes.org Bayesian Model Selection

More information

1 Expectation Maximization

1 Expectation Maximization Introduction Expectation-Maximization Bibliographical notes 1 Expectation Maximization Daniel Khashabi 1 khashab2@illinois.edu 1.1 Introduction Consider the problem of parameter learning by maximizing

More information

Expectation maximization

Expectation maximization Expectation maximization Subhransu Maji CMSCI 689: Machine Learning 14 April 2015 Motivation Suppose you are building a naive Bayes spam classifier. After your are done your boss tells you that there is

More information

Expectation Maximization Mixture Models HMMs

Expectation Maximization Mixture Models HMMs 11-755 Machine Learning for Signal rocessing Expectation Maximization Mixture Models HMMs Class 9. 21 Sep 2010 1 Learning Distributions for Data roblem: Given a collection of examples from some data, estimate

More information

Lecture 16: Introduction to Neural Networks

Lecture 16: Introduction to Neural Networks Lecture 16: Introduction to Neural Networs Instructor: Aditya Bhasara Scribe: Philippe David CS 5966/6966: Theory of Machine Learning March 20 th, 2017 Abstract In this lecture, we consider Bacpropagation,

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Lecture 8: Bayesian Estimation of Parameters in State Space Models in State Space Models March 30, 2016 Contents 1 Bayesian estimation of parameters in state space models 2 Computational methods for parameter estimation 3 Practical parameter estimation in state space

More information

11. Learning graphical models

11. Learning graphical models Learning graphical models 11-1 11. Learning graphical models Maximum likelihood Parameter learning Structural learning Learning partially observed graphical models Learning graphical models 11-2 statistical

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

The Regularized EM Algorithm

The Regularized EM Algorithm The Regularized EM Algorithm Haifeng Li Department of Computer Science University of California Riverside, CA 92521 hli@cs.ucr.edu Keshu Zhang Human Interaction Research Lab Motorola, Inc. Tempe, AZ 85282

More information

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth

Application: Can we tell what people are looking at from their brain activity (in real time)? Gaussian Spatial Smooth Application: Can we tell what people are looking at from their brain activity (in real time? Gaussian Spatial Smooth 0 The Data Block Paradigm (six runs per subject Three Categories of Objects (counterbalanced

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2 Agenda Expectation-maximization

More information

Clustering and Gaussian Mixture Models

Clustering and Gaussian Mixture Models Clustering and Gaussian Mixture Models Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 25, 2016 Probabilistic Machine Learning (CS772A) Clustering and Gaussian Mixture Models 1 Recap

More information

Statistical Machine Learning Lectures 4: Variational Bayes

Statistical Machine Learning Lectures 4: Variational Bayes 1 / 29 Statistical Machine Learning Lectures 4: Variational Bayes Melih Kandemir Özyeğin University, İstanbul, Turkey 2 / 29 Synonyms Variational Bayes Variational Inference Variational Bayesian Inference

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 7: Learning Fully Observed BNs Theo Rekatsinas 1 Exponential family: a basic building block For a numeric random variable X p(x ) =h(x)exp T T (x) A( ) = 1

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CSC321 Lecture 18: Learning Probabilistic Models

CSC321 Lecture 18: Learning Probabilistic Models CSC321 Lecture 18: Learning Probabilistic Models Roger Grosse Roger Grosse CSC321 Lecture 18: Learning Probabilistic Models 1 / 25 Overview So far in this course: mainly supervised learning Language modeling

More information

Introduction to Bayesian Learning. Machine Learning Fall 2018

Introduction to Bayesian Learning. Machine Learning Fall 2018 Introduction to Bayesian Learning Machine Learning Fall 2018 1 What we have seen so far What does it mean to learn? Mistake-driven learning Learning by counting (and bounding) number of mistakes PAC learnability

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Machine Learning Srihari. Information Theory. Sargur N. Srihari

Machine Learning Srihari. Information Theory. Sargur N. Srihari Information Theory Sargur N. Srihari 1 Topics 1. Entropy as an Information Measure 1. Discrete variable definition Relationship to Code Length 2. Continuous Variable Differential Entropy 2. Maximum Entropy

More information

Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1) HW 1 due today Parameter Estimation Biometrics CSE 190 Lecture 7 Today s lecture was on the blackboard. These slides are an alternative presentation of the material. CSE190, Winter10 CSE190, Winter10 Chapter

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University

Another Walkthrough of Variational Bayes. Bevan Jones Machine Learning Reading Group Macquarie University Another Walkthrough of Variational Bayes Bevan Jones Machine Learning Reading Group Macquarie University 2 Variational Bayes? Bayes Bayes Theorem But the integral is intractable! Sampling Gibbs, Metropolis

More information

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning)

Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) Exercises Introduction to Machine Learning SS 2018 Series 6, May 14th, 2018 (EM Algorithm and Semi-Supervised Learning) LAS Group, Institute for Machine Learning Dept of Computer Science, ETH Zürich Prof

More information

Expectation Maximization and Mixtures of Gaussians

Expectation Maximization and Mixtures of Gaussians Statistical Machine Learning Notes 10 Expectation Maximiation and Mixtures of Gaussians Instructor: Justin Domke Contents 1 Introduction 1 2 Preliminary: Jensen s Inequality 2 3 Expectation Maximiation

More information