ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

Size: px
Start display at page:

Download "ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015"

Transcription

1 Reading ECE 275B Homework #2 Due Thursday 2/12/2015 MIDTERM is Scheduled for Thursday, February 19, 2015 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11, pp Also read and understand the material in Section 7.8 of Kay. Read and understand the Expectation Maximization (EM) Algorithm described in Moon and stirling, Sections The following papers are highly recommended: Eliminating Multiple Root Problems in Estimation, C.G. Small, J. Wang, and Z. Yang, Statistical Science, Vol. 15, No. 4 (Nov., 2000), pp Maximum Likelihood from Incomplete Data via the EM Algorithm, A.P. Dempster; N.M. Laird, & D.B. Rubin, Journal of the Royal Statistical Society. Series B (Methodological),Vol. 39, No. 1, (1977), pp Highly Recommended. On the Convergence Properties of the EM Algorithm, C.F. Jeff Wu, The Annals of Statistics, Vol. 11, No. 1 (Mar., 1983), pp On the Convergence of the EM Algorithm, R.A. Boyles, Journal of the Royal Statistical Society. Series B (Methodological), Vol. 45, No. 1, 1983, pp Mixture Densities, Maximum Likelihood and the EM Algorithm, R.A. Redner & H.F. Walker, SIAM Review, Vol. 26, No. 2 (Apr., 1984), pp Highly Recommended. Direct Calculation of the Information Matrix via the EM Algorithm, D. Oakes, Journal of the Royal Statistical Society. Series B (Statistical Methodology), Vol. 61, No. 2 (1999), pp The Variational Approximation for Bayesian Inference: Life After the EM Algorithm, D.G. Tzikas, A.C. Likas, and N.P. Galatsanos, IEEE Signal Processing Magazine, November 2008, pp Comments on the Above Cited Papers As mentioned in class, For highly nonlinear problems of the type considered by Kay and for parametric mixture distributions 1 there will generally be multiple roots to the likelihood equation corresponding to multiple local maxima to the likelihood function. The question of how to deal with this problem, and in particular the question of how to track the zero of 1 In both cases we are leaving the nice world of regular exponential family distributions 1

2 the score function having the desired asymptotic property of efficiency (and which generally does not correspond to the global maximum of the likelihood function) is a difficult one of keen interest. In this regard, you might want to read the paper Eliminating Multiple Root Problems in Estimation cited above. Note from your readings in Kay that the Newton-Raphson procedure uses the Sample Information Matrix (SIM), while the Method of Scores uses the actual Fisher Information Matrix (FIM) which is the expected value of the SIM. As discussed in Kay, in the limit of an infinitely large number of samples it is the case that FIM SIM. Thus a perceived merit of the Newton-like numerical approaches is that an asymptotically valid estimate of the parameter error covariance matrix is provided by the availability of the SIM. 2 Important and useful background material on the history, applications, and convergence properties of the EM algorithm as a procedure for performing missing-data maximum likelihood estimation can be found in the classic paper by Dempster, Rubin, & Laird (DLR) cited above. The DLR paper discusses a variety of applications, including mixture-density estimation and the example given in Moon. In the DLR paper the complete data space is not a simple cartesian product of the actual data and hidden data spaces. The subsequent papers by Wu and Boyles, both cited above, discuss the lack of validity of the convergence proof given in the original DLR paper. Boyles gives a illuminating counterexample which contradicts the convergence claim made in DLR. Wu gives a rigorous and correct proof of convergence for the EM algorithm, and states sufficient conditions for convergence. Also cited above is the excellent and highly recommended SIAM Review paper by Redner & Walker on applying the EM algorithm to the problem of mixture density estimation. Despite the existence of excellent books now available of the subject of mixture density estimation, this classic paper is still well worth reading. The use of mixture densities to model arbitrary density functions and classification when class-labeled training data are not available are two immensely important uses of the EM algorithm. The convergence conditions given by DLR and Wu are clearly important for mixture density estimation. While mixtures of regular exponential family distributions (REFD s) result in convergent behavior, this is not true when one or more of the component densities of a mixture are not from an REFD. For example, a mixture of a uniform distribution and a gaussian distribution will not result in a convergent EM algorithm, as you will find in one of the homework problems below. One claimed potential drawback to the basic EM algorithm is the general lack of the availability of the SIM or FIM for use as a parameter error covariance matrix. 3 In fact, a 2 Recall that for regular statistical families the MLE asymptotically attains the C-R lower bound, which is the inverse of the FIM. The availability of the error covariance matrix coupled with the fact that the MLE is asymptotically normal allows one to easily construct confidence intervals (error bars) about the estimated parameter values. 3 On the one hand, the ability to avoid the explicit computation of the FIM (usually required to be updated at every iteration step) typically provides a huge computational savings relative to the Newton 2

3 careful reading of DLR shows that the SIM can be readily computed. This fact is discussed and made salient in the above-cited paper Direct Calculation of the Information Matrix via the EM Algorithm which discusses the additional computations (over and above the basic EM algorithm computations) needed to compute an estimate of the information matrix. Finally, the review paper by Tzikas et al. on the relationship between the EM algorithm and Variational approximation is a nice, short synopsis of material to be found in the textbook Pattern Recognition and Machine Learning, C. Bishop, Springer, The derivation of the EM algorithm given in this paper is slightly simpler than the one I give in lecture as (unlike DLR) they assume that the complete data space is the cartesian product of the actual data and hidden data spaces. 4 Homework Problems and Matlab Programming Exercises 1. Let the scalar random variables X 1,, X n and Y 1,, Y n be all mutually independent, where Y i Poisson(β τ i ) and X i Poisson(τ i ), i = 1, n. The two data sets {X i } n i=1 and {Y i} n i=1 taken together comprise the complete data. The parameters β and τ i, i = 1,, n, are deterministic and unknown. 5 This provides a simple model of the incidence of a disease Y i seen in hospital i, where the underlying rate is a function of an overall (hospital independent) effect β due to the intrinsic virulence of the disease and an additional hospital specific factor τ i. The availability of the measurements X i allows one to directly estimate the hospital specific factor τ i. 6 (a) Find the complete information maximum likelihood estimate (MLE) of the unknown parameters. (b) Assume that the observation for X 1 is missing. (Thus each hospital i, except for hospital 1, is able to provide a measurement of the quantity X i.) Determine an EM algorithm for finding the actual information MLE of the unknown parameters. algorithm. On the other hand, having this matrix available once the algorithm has finally converged to an estimate of the unknown parameters allows one to construct confidence intervals (parameter error bars) for these parameters. 4 Bishop makes this same simplifying assumption, which allows one to use the standard definition of conditional probability and thereby avoid having to understand what it means to condition on a cut (or section ). From our lecture discussions we know that a conditioning on a cut yields a type of generalized condition probability. 5 Recall that a Poisson(λ) distribution has the form p Z (z λ) = e λ λ z z! z = 0, 1, with mean E {z} = λ. 6 For example, the parameter τ i could depend upon the size and overall health of the population in the region served by hospital i and X i could be the waiting time in minutes before an entering patient is first examined by a trained medical professional. Or τ i could depend on the quality of the health care provided by hospital i and X i could be the number of nurses with advanced training. Et cetera. 3

4 2. EFDs & EM Algorithm for missing data MLE. (NEW, no solution provided) Consider the bivariate normal distribution for a single full-data vector z = ( z 1 ) z 2, ( ) 1 p θ (z) = 2π 1 θ exp z2 1 2θz 1 z 2 + z2 2 for θ Θ = {θ 1 < θ < 1} R (1 θ 2 ) The scalar parameter θ is the correlation coefficient; it provides a measure of the degree of correlation between z 1 and z 2. 7 A useful property of the bivariate normal distribution is that the conditional distribution is also Gaussian, p θ (z i zj ) = N ( θz j, (1 θ 2 ) ) i j. It is assumed that n iid full-data samples are realized from the bivariate normal distribution, ( ) z1,j Z = {z 1, z 2,, z n } with z j =. Unfortunately, because of an unreliable sensor, sometimes data is not measured (and is therefore missing ), and during this particular realization the sensor fails to detect the single data component value z 2,1. (I.e., the second component of the first data sample is missing.) Thus, the observed data (aka, actual data or measurement data) which is available for statistical inference is, z 2,j Y = {z 1,1, z 2,, z n } = Z {z 2,1 }. (a) Place the single-sample full-data bivariate normal distribution shown above in minimal Exponential Family Distribution (EFD) form. Be sure to clearly define all of the quantities comprising the EFD. i. Prove that the natural statistics determined by your EFD are sufficient. ii. Prove that the natural statistics are minimal. iii. Explain why it is not immediately obvious whether or not the natural statistics are complete. iv. Show that your model is identifiable. (b) Determine the EFD form of the full-data distribution p θ (Z) for the iid collection of full-data samples Z. Be sure to clearly define all of the quantities comprising the EFD. i. Relate the Z-natural statistics to the sample mean of the z j -natural statistics. 7 This ( model ) is just the zero-mean multivariate Gaussian distribution, z N(0, C), with covariance matrix 1 θ C =. When θ θ 1 2 = 1, the bivariate case becomes degenerate (essentially reducing to the univariate case). 4

5 ii. Is the Z-EFD that you have determined minimal and identifiable? Explain. full (c) Show that θ ml (Z) is found as the zero of a polynomial. i. Write the relevant polynomial in terms of the sample means of the z j -natural statistics. (How to do this should be evident once you ve related the Z- natural statistics to the sample mean of the z j -natural statistics as requested above.) ii. You do not have to explicitly solve for the root, just explain why it can be expressed as a known closed-form solution of the full-data natural (equivalently, of the sample means of the z j natural statistics). Henceforth you can assume that you know the function that relates the full-data natural statistics to the full-data MLE. Denote this function by the symbol F. obs (d) Develop an EM Algorithm to iteratively compute θ ml (Y). i. Clearly show the exact values (numerical and otherwise) of the quantities computed in the E-step of the EM Algorithm. ii. Clearly argue why having already formally solved for the full-data MLE case, you can perform the M-step of the EM Algorithm. iii. Write the resulting EM Algorithm in a very few lines of pseudocode. 3. Let A = [a ij ] be an arbitrary real m n matrix with ij-th component a ij. For f a differentiable scalar real-valued function of A, we define the matrix derivative and the matrix gradient of f(a) in a component-wise manner respectively as f(a) A [ a ji f(a) ] ; A f(a) ( A f(a) ) T = [ a ij f(a) Note that this convention is consistent with the convention used to define the (covariant) vector derivative and (contravariant cartesian) gradient in ECE275A. For Σ square and invertible (but not necessarily symmetric) prove that log det Σ = Σ 1 and Σ Σ tr Σ 1 W = Σ 1 W Σ 1. Note that the condition for the first identity to be well-defined is that det Σ > 0. Assuming only that det Σ 0, what is log det Σ? Σ 4. Let y be a gaussian random vector with unknown mean µ and unknown covariance matrix Σ which is assumed to be full rank. Given a collection of N iid samples of y use the result of the previous homework problem to find the maximum likelihood estimates of µ and Σ. 8 Why is just determining a stationary point of the log likelihood function 8 Do NOT use the gradient formulas for matrices constrained to be symmetric as is done by Moon and Stirling on page 558. Although the symmetrically constrained derivative formulas are sometimes useful, they are not needed here because the stationary point of the likelihood just happens to turn out to be symmetric (a happy accident!) without the need to impose this constraint. This means that the ( accidently symmetric) solution is optimal in the space of all matrices, not just on the manifold of symmetric matrices. I have never seen the symmetry constraint applied when taking the derivative except in Moon and Stirling. 5 ]

6 generally sufficient to claim uniqueness and optimality of the solution? 5. (i) In order to model the pdf of a scalar random variable, y, completely set up the M component Gaussian mixture density problem as a hidden data problem and then derive the EM algorithm for identifying the unknown mixture parameters. Assume that the individual mixture parameters are all independent. (ii) Write a program (in Matlab, or your favorite programming language) to implement your algorithm as requested in the problems given below. 6. Note that the algorithm derived in problem 4 above requires no derivatives, hessians, or step-size parameters in its implementation. Is this true for a complete implementation of the EM solution given by Kay as equations (7.58) and (7.59)? Why? Can one always avoid derivatives when using the EM algorithm? 7. On pages 290, 291, 293, and 294 (culminating in Example 9.3), Kay shows how to determine consistent estimates for the parameters of a zero mean, unimodal gaussian mixture. What utility could these parameters have in determining MLE s of the parameters via numerical means? 8. Consider a two-gaussian mixture density p(y Θ), with true mixture parameters α = α 1 = 0.6, α = α 2 = 1 α = 0.4, means µ 1 = 0, µ 2 = 10, and variances σ 2 1, and σ 2 2. Thus Θ = {α, α, θ}, θ = {µ 1, µ 2, σ 2 1, σ 2 2}. The computer experiments described below are to be performed for the two cases σ1 2 = σ2 2 = 1 (i.e., both component standard deviations equal to 1) and σ1 2 = σ2 2 = 25 (i.e., both component standard deviations equal to 5). Note that in the first case the means of the two component densities are located 10 standard deviations apart (probably a rarely seen situation in practice) whereas in the second case the means of the two component densities are located 2 standard deviations apart (probably a not so rare situation). (a) Plot the true pdf p(y Θ) as a function of y. Generate and save 400 iid samples from the pdf. Show frequency histograms of the data for 25 samples, 50 samples, 100 samples, 200 samples, and 400 samples, plotted on the same plot as the pdf. (b) (i) Derive an analytic expression for the mean µ, and variance σ 2 of y as a function of the mixture parameters, and evaluate this expression using the true parameter values. (ii) Compute the sample mean µ and sample variance σ 2 of y using the 400 data samples and compare to the analytically determined true values. (iii) Show that the sample mean and sample variance are the ML estimates for the mean and variance of a (single) gaussian model for the data. Plot this estimated single gaussian model against the true two gaussian mixture density. (c) (i) Assume that the true density is unknown and estimate the parameters for 6

7 a two-component gaussian mixture 9 using the EM algorithm starting from the following initial conditions, What happens and why? µ 1 = µ 2 = µ, σ 2 1 = σ 2 2 = σ 2, α = 1 2. (ii) Perturb the initial conditions slightly to destroy their symmetry and describe the resulting behavior of the EM algorithm in words. (iii) Again using all 400 samples, estimate the parameters for a variety of initial conditions and try to trap the algorithm in a false local optimum. Describe the results. (iv) Once you have determined a good set of initial conditions, estimate the mixture parameters for the two component model via the ME algorithm for 25, 50, 100, 200, and 400 samples. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. Plot your final estimated parameter values versus log 2 (Sample Number). (d) Discuss the difference between the case where the component standard deviations both have the value 1 and the case where they are both have the value (i) For the two component gaussian mixture case derive the EM algorithm when the constraint σ 2 1 = σ 2 2 = σ 2 is known to hold. (I.e., incorporate this known constraint into the algorithm.) Compare the resulting algorithm to the one derived in Problem 4 above. (ii) Estimate the unknown parameters of a two-component gaussian mixture using the synthetic data generated in Problem 4 above using your modified algorithm and compare the results. 10. EM computer experiment when one of the components is non-regular. Consider a twocomponent mixture model where the first component is uniformly distributed between 0 and θ 1, U[0, θ 1 ], with mixture parameter α = 0.55, and the second component is normally distributed with mean µ and standard deviation σ 2, θ 2 = (µ, σ 2 ) T, with mixture parameter ᾱ = 1 α = Assume that all of the parameters are independent of each other. Assume that 0 log 0 = 0. (a) i. Generate 1000 samples from the distribution described above for θ 1 = 1, µ = 3, and 2 σ2 = 1. Plot a histogram of the data and the density function 16 on the same graph. 9 By visual inspection of the data frequency histograms, one can argue that whatever the unknown distribution is, it is smooth with bimodal symmetric humps and hence should be adequately modelled by a two component gaussian mixture. 7

8 ii. Now assume that you do not known how the samples were generated and learn the unknown distribution using 1, 2, 3,10, and 50 component gaussian mixture models by training on the synthesized data. Plot the true mixture distribution and the 5 learned gaussian mixture models on the same graph. Recall that the MLE-based density estimate is asymptotically closest to the true density in the Kullback-Liebler Divergence sense. Note that our learned models approximate an unknown non-regular mixture density by regular gaussian mixtures. Show that the actual data log-likelihood function increases monotonically with each iteration of the EM algorithm. iii. Repeat the above for the values µ = 1, 1 and 2, with all other parameters 2 unchanged. (b) Suppose we now try to solve the non-regular two component mixture problem described above by forming a uniform distribution plus gaussian two-component mixture model and applying the EM algorithm in an attempt to learn the components parameters θ 1, µ and σ 2 directly. Convince yourself that in setting up the EM algorithm the Gaussian component can be handled as has already been done above, so that the novelty, and potential difficulty, is in dealing with the non-regular uniform distribution component. i. In the M-step of the EM algorithm show that if ˆx l,1 0 for some sample y l > θ 1 then the sum N ˆx j,1 log p 1 (y j ; θ 1 ), j=1 where p 1 (y j ; θ 1 ) denotes the uniform distribution, is not bounded from below. What does this say about the admissible values of ˆθ 1 +? ii. Show that maximizing N ˆx j,1 log p 1 (y j ; θ 1 ) j=1 with respect to θ 1 is equivalent to maximizing ( ) ˆN1 1 θ 1 j 1 [0,θ1 ](y j ) with θ 1 y j for all y j > 0 (1) where {y j } are the data samples for which ˆx j, [0,θ1 ](y j ) is the indicator function that indicates whether or not y j is in the closed interval [0, θ 1 ]. Maximize the expression (1) to obtain the estimate ˆθ + 1. (Hint: Compare to the full information MLE solution you derived last quarter.) Explain why the estimate provided by the EM algorithm remains stuck at the value of the very first update ˆθ +, and never changes thereafter, and why 8

9 the value one gets stuck at depends on the initialization value of ˆθ 1. Verify these facts in simulation (see below for additional simulation requests). Try various initialization values for ˆθ 1, including ˆθ 1 > max y j. (Note that only j initialization values that have an actual data log-likelihood that is bounded from below need be considered.) Show that the actual data log-likelihood function monotonically increases with each iteration of the EM algorithm (i.e., that we have monotonic hill-climbing on the actual data log-likelihood function). iii. Let us attempt to heuristically attempt to fix the stickiness problem of the EM algorithm with an ad hoc attempt at stabilizing the algorithm. There is no guarantee that the procedure outlined below will work as we are not conforming to the theory of the EM algorithm. Thus, monotonic hill-climbing on the actual data log-likelihood function is not guaranteed. First put a floor on the value of ˆx j,1 for positive y j : If ˆx j,1 < ɛ and y j > 0 set ˆx j,1 = ɛ and ˆx j,2 = 1 ɛ Choose ɛ to be a very small value. In particular choose ɛ 1/ˆθ 1 where ˆθ 1 > max y j is the initialization value of the estimate of θ 1. j Next define two intermediate estimates, ˆθ 1 and ˆθ 1, of θ 1 as follows. ˆθ 1 = 2ˆN1 N j=1 ˆx j,1 y j with ˆN1 = N j=1 ˆx j,1 ˆθ 1 = max ˆx j,1 y j e j ( yj ˆθ 1 ) 2 τ In the complete information case, which corresponds to setting ˆx j,1 = x j,1 and ˆN 1 = N 1, the estimate ˆθ 1 corresponds to the BLUE (and method of moments) estimate of θ 1. We can refer to ˆθ 1 as the pseudo-blue estimate. The estimate ˆθ 1 modifies the EM update estimate by adding the factor ˆx j,1, which weights y j according to the probability that it was generated by the uniform distribution, and an exponential factor which weights y j according to how far it is from the pseudo-blue estimate of θ 1. We can refer to ˆθ 1 as the weighted EM update of θ 1. The parameter τ determines how aggressively we penalize the deviation of y j from ˆθ 1. We construct our final Modified EM update of θ 1 as ˆθ + 1 = β ˆθ 1 + β ˆθ 1 for 0 β 1 and β = 1 β 9

10 For example, setting β = 1/2 yields ˆθ + 1 = 1 2 (ˆθ 1 + We see that there are three free parameters to be set when implementing the Modified EM update step. Namely, ɛ, τ, and β. Attempt to run the modified EM algorithm on 1000 samples generated from data for the cases θ 1 = 1, σ 2 = 1 and µ = 1, 1, 3, and 2. Plot the actual data log-likelihood function to see if it fails to increase monotonically with each iteration of the EM algorithm. If you can t get the algorithm to work, don t worry I thought up this heuristic fix on the fly and it definitely is not motivated by any rigorous mathematics. 10 If you can get the procedure to work, describe its performance relative to the pure Gaussian mixture cases investigated above and show its performance for various choices of β, including β = 0, 0.5, and 1. ) ˆθ 1 10 I dreamed up this fix a few years ago. So far only one student has actually got it to work and work surprisingly well! Of course I m always curious as to how many students can get it to work every time I teach this course. 10

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012 Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 The Generative Model POV We think of the data as being generated from some process. We assume

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Machine Learning Lecture 5

Machine Learning Lecture 5 Machine Learning Lecture 5 Linear Discriminant Functions 26.10.2017 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de leibe@vision.rwth-aachen.de Course Outline Fundamentals Bayes Decision Theory

More information

ECE 275A Homework 7 Solutions

ECE 275A Homework 7 Solutions ECE 275A Homework 7 Solutions Solutions 1. For the same specification as in Homework Problem 6.11 we want to determine an estimator for θ using the Method of Moments (MOM). In general, the MOM estimator

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Solutions Thursday, September 19 What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart 1 Motivation and Problem In Lecture 1 we briefly saw how histograms

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Reading Group on Deep Learning Session 1

Reading Group on Deep Learning Session 1 Reading Group on Deep Learning Session 1 Stephane Lathuiliere & Pablo Mesejo 2 June 2016 1/31 Contents Introduction to Artificial Neural Networks to understand, and to be able to efficiently use, the popular

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

ECE531 Lecture 10b: Maximum Likelihood Estimation

ECE531 Lecture 10b: Maximum Likelihood Estimation ECE531 Lecture 10b: Maximum Likelihood Estimation D. Richard Brown III Worcester Polytechnic Institute 05-Apr-2011 Worcester Polytechnic Institute D. Richard Brown III 05-Apr-2011 1 / 23 Introduction So

More information

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition Last updated: Oct 22, 2012 LINEAR CLASSIFIERS Problems 2 Please do Problem 8.3 in the textbook. We will discuss this in class. Classification: Problem Statement 3 In regression, we are modeling the relationship

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

CS281 Section 4: Factor Analysis and PCA

CS281 Section 4: Factor Analysis and PCA CS81 Section 4: Factor Analysis and PCA Scott Linderman At this point we have seen a variety of machine learning models, with a particular emphasis on models for supervised learning. In particular, we

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Multivariate statistical methods and data mining in particle physics

Multivariate statistical methods and data mining in particle physics Multivariate statistical methods and data mining in particle physics RHUL Physics www.pp.rhul.ac.uk/~cowan Academic Training Lectures CERN 16 19 June, 2008 1 Outline Statement of the problem Some general

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Curve Fitting Re-visited, Bishop1.2.5

Curve Fitting Re-visited, Bishop1.2.5 Curve Fitting Re-visited, Bishop1.2.5 Maximum Likelihood Bishop 1.2.5 Model Likelihood differentiation p(t x, w, β) = Maximum Likelihood N N ( t n y(x n, w), β 1). (1.61) n=1 As we did in the case of the

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Multivariate Gaussians Mark Schmidt University of British Columbia Winter 2019 Last Time: Multivariate Gaussian http://personal.kenyon.edu/hartlaub/mellonproject/bivariate2.html

More information

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so. CS 89 Spring 07 Introduction to Machine Learning Midterm Please do not open the exam before you are instructed to do so. The exam is closed book, closed notes except your one-page cheat sheet. Electronic

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Statistica Sinica 15(2005), 831-840 PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Florin Vaida University of California at San Diego Abstract: It is well known that the likelihood sequence of the EM algorithm

More information

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets

Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Parameter Estimation in the Spatio-Temporal Mixed Effects Model Analysis of Massive Spatio-Temporal Data Sets Matthias Katzfuß Advisor: Dr. Noel Cressie Department of Statistics The Ohio State University

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

ECE521 week 3: 23/26 January 2017

ECE521 week 3: 23/26 January 2017 ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear

More information

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

More information

Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods

More information

MIXTURE MODELS AND EM

MIXTURE MODELS AND EM Last updated: November 6, 212 MIXTURE MODELS AND EM Credits 2 Some of these slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Simon Prince, University College London Sergios Theodoridis,

More information

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K

More information

6.867 Machine Learning

6.867 Machine Learning 6.867 Machine Learning Problem set 1 Due Thursday, September 19, in class What and how to turn in? Turn in short written answers to the questions explicitly stated, and when requested to explain or prove.

More information

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13) STA561: Probabilistic machine learning Introduction: MLE, MAP, Bayesian reasoning (28/8/13) Lecturer: Barbara Engelhardt Scribes: K. Ulrich, J. Subramanian, N. Raval, J. O Hollaren 1 Classifiers In this

More information

The Multivariate Gaussian Distribution [DRAFT]

The Multivariate Gaussian Distribution [DRAFT] The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Statistics: Learning models from data

Statistics: Learning models from data DS-GA 1002 Lecture notes 5 October 19, 2015 Statistics: Learning models from data Learning models from data that are assumed to be generated probabilistically from a certain unknown distribution is a crucial

More information

Lecture 7 Introduction to Statistical Decision Theory

Lecture 7 Introduction to Statistical Decision Theory Lecture 7 Introduction to Statistical Decision Theory I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 20, 2016 1 / 55 I-Hsiang Wang IT Lecture 7

More information

Introduction to Maximum Likelihood Estimation

Introduction to Maximum Likelihood Estimation Introduction to Maximum Likelihood Estimation Eric Zivot July 26, 2012 The Likelihood Function Let 1 be an iid sample with pdf ( ; ) where is a ( 1) vector of parameters that characterize ( ; ) Example:

More information

Lecture 5: Logistic Regression. Neural Networks

Lecture 5: Logistic Regression. Neural Networks Lecture 5: Logistic Regression. Neural Networks Logistic regression Comparison with generative models Feed-forward neural networks Backpropagation Tricks for training neural networks COMP-652, Lecture

More information

Mixture Models and EM

Mixture Models and EM Mixture Models and EM Goal: Introduction to probabilistic mixture models and the expectationmaximization (EM) algorithm. Motivation: simultaneous fitting of multiple model instances unsupervised clustering

More information

Machine Learning 4771

Machine Learning 4771 Machine Learning 4771 Instructor: Tony Jebara Topic 11 Maximum Likelihood as Bayesian Inference Maximum A Posteriori Bayesian Gaussian Estimation Why Maximum Likelihood? So far, assumed max (log) likelihood

More information

Probability Theory for Machine Learning. Chris Cremer September 2015

Probability Theory for Machine Learning. Chris Cremer September 2015 Probability Theory for Machine Learning Chris Cremer September 2015 Outline Motivation Probability Definitions and Rules Probability Distributions MLE for Gaussian Parameter Estimation MLE and Least Squares

More information

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007

More information

ECE 275B Homework # 1 Solutions Winter 2018

ECE 275B Homework # 1 Solutions Winter 2018 ECE 275B Homework # 1 Solutions Winter 2018 1. (a) Because x i are assumed to be independent realizations of a continuous random variable, it is almost surely (a.s.) 1 the case that x 1 < x 2 < < x n Thus,

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Linear Methods for Prediction

Linear Methods for Prediction Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we

More information

ECE 275B Homework # 1 Solutions Version Winter 2015

ECE 275B Homework # 1 Solutions Version Winter 2015 ECE 275B Homework # 1 Solutions Version Winter 2015 1. (a) Because x i are assumed to be independent realizations of a continuous random variable, it is almost surely (a.s.) 1 the case that x 1 < x 2

More information

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University OVERVIEW This class will cover model-based

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012 Parametric Models Dr. Shuang LIANG School of Software Engineering TongJi University Fall, 2012 Today s Topics Maximum Likelihood Estimation Bayesian Density Estimation Today s Topics Maximum Likelihood

More information

EM Algorithm. Expectation-maximization (EM) algorithm.

EM Algorithm. Expectation-maximization (EM) algorithm. EM Algorithm Outline: Expectation-maximization (EM) algorithm. Examples. Reading: A.P. Dempster, N.M. Laird, and D.B. Rubin, Maximum likelihood from incomplete data via the EM algorithm, J. R. Stat. Soc.,

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Accelerating the EM Algorithm for Mixture Density Estimation

Accelerating the EM Algorithm for Mixture Density Estimation Accelerating the EM Algorithm ICERM Workshop September 4, 2015 Slide 1/18 Accelerating the EM Algorithm for Mixture Density Estimation Homer Walker Mathematical Sciences Department Worcester Polytechnic

More information

Disentangling Gaussians

Disentangling Gaussians Disentangling Gaussians Ankur Moitra, MIT November 6th, 2014 Dean s Breakfast Algorithmic Aspects of Machine Learning 2015 by Ankur Moitra. Note: These are unpolished, incomplete course notes. Developed

More information

Discussion of Maximization by Parts in Likelihood Inference

Discussion of Maximization by Parts in Likelihood Inference Discussion of Maximization by Parts in Likelihood Inference David Ruppert School of Operations Research & Industrial Engineering, 225 Rhodes Hall, Cornell University, Ithaca, NY 4853 email: dr24@cornell.edu

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Machine learning - HT Maximum Likelihood

Machine learning - HT Maximum Likelihood Machine learning - HT 2016 3. Maximum Likelihood Varun Kanade University of Oxford January 27, 2016 Outline Probabilistic Framework Formulate linear regression in the language of probability Introduce

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Linear Regression and Its Applications

Linear Regression and Its Applications Linear Regression and Its Applications Predrag Radivojac October 13, 2014 Given a data set D = {(x i, y i )} n the objective is to learn the relationship between features and the target. We usually start

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

ECE 275A Homework 6 Solutions

ECE 275A Homework 6 Solutions ECE 275A Homework 6 Solutions. The notation used in the solutions for the concentration (hyper) ellipsoid problems is defined in the lecture supplement on concentration ellipsoids. Note that θ T Σ θ =

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question Linear regression Reminders Thought questions should be submitted on eclass Please list the section related to the thought question If it is a more general, open-ended question not exactly related to a

More information

COM336: Neural Computing

COM336: Neural Computing COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Mixture Models, Density Estimation, Factor Analysis Mark Schmidt University of British Columbia Winter 2016 Admin Assignment 2: 1 late day to hand it in now. Assignment 3: Posted,

More information

Label Switching and Its Simple Solutions for Frequentist Mixture Models

Label Switching and Its Simple Solutions for Frequentist Mixture Models Label Switching and Its Simple Solutions for Frequentist Mixture Models Weixin Yao Department of Statistics, Kansas State University, Manhattan, Kansas 66506, U.S.A. wxyao@ksu.edu Abstract The label switching

More information

10708 Graphical Models: Homework 2

10708 Graphical Models: Homework 2 10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves

More information

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process

Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Applied Mathematical Sciences, Vol. 4, 2010, no. 62, 3083-3093 Sequential Procedure for Testing Hypothesis about Mean of Latent Gaussian Process Julia Bondarenko Helmut-Schmidt University Hamburg University

More information

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology

FE670 Algorithmic Trading Strategies. Stevens Institute of Technology FE670 Algorithmic Trading Strategies Lecture 3. Factor Models and Their Estimation Steve Yang Stevens Institute of Technology 09/12/2012 Outline 1 The Notion of Factors 2 Factor Analysis via Maximum Likelihood

More information

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26 Clustering Professor Ameet Talwalkar Professor Ameet Talwalkar CS26 Machine Learning Algorithms March 8, 217 1 / 26 Outline 1 Administration 2 Review of last lecture 3 Clustering Professor Ameet Talwalkar

More information

Chapter 9. Non-Parametric Density Function Estimation

Chapter 9. Non-Parametric Density Function Estimation 9-1 Density Estimation Version 1.2 Chapter 9 Non-Parametric Density Function Estimation 9.1. Introduction We have discussed several estimation techniques: method of moments, maximum likelihood, and least

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013 Exam policy: This exam allows two one-page, two-sided cheat sheets; No other materials. Time: 2 hours. Be sure to write your name and

More information

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall Machine Learning Gaussian Mixture Models Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall 2012 1 Discriminative vs Generative Models Discriminative: Just learn a decision boundary between your

More information

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments STAT 135 Lab 3 Asymptotic MLE and the Method of Moments Rebecca Barter February 9, 2015 Maximum likelihood estimation (a reminder) Maximum likelihood estimation Suppose that we have a sample, X 1, X 2,...,

More information

Lecture 14. Clustering, K-means, and EM

Lecture 14. Clustering, K-means, and EM Lecture 14. Clustering, K-means, and EM Prof. Alan Yuille Spring 2014 Outline 1. Clustering 2. K-means 3. EM 1 Clustering Task: Given a set of unlabeled data D = {x 1,..., x n }, we do the following: 1.

More information

Likelihood and Fairness in Multidimensional Item Response Theory

Likelihood and Fairness in Multidimensional Item Response Theory Likelihood and Fairness in Multidimensional Item Response Theory or What I Thought About On My Holidays Giles Hooker and Matthew Finkelman Cornell University, February 27, 2008 Item Response Theory Educational

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information