A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from
|
|
- Florence Baldwin
- 5 years ago
- Views:
Transcription
1 1 The EM algorithm In this set of notes, we discuss the EM (Expectation-Maximization) algorithm, which is a common algorithm used in statistical estimation to try and find the MLE. It is often used in situations that are not exponential families, but are derived from exponential families. A common mechanism by which these likelihoods are derived is through missing data, i.e. we only observe some of the sufficient statistics of the family. 1.1 Mixture model A canonical application of the EM algorithm is its use in fitting a mixture model where we assume we observe an IID sample of (X i ) 1 i n from Y Multinomial(1, π), π R L X Y l P ηl with the simplest example of P η being the univariate normal model P ηl N(µ l, σ 2 l ) keeping in mind that the parameters on the right are the mean space parameters, not the natural parameters Exercise 1. Show that the joint distribution of (X, Y ) is an exponential family. What is its reference measure, its sufficient statistics? Write out the log-likelihood based on observing an IID sample (X i, Y i ) 1 i n for this model. Call this l c (η; X, Y ) the complete likelihood. 2. What is the marginal density of X? 3. Write out the log-likelihood l(η; X) based on observing an IID sample (X i ) 1 i n from this model. What are its parameters? In the mixture model, we only observe X, though the marginal distribution of X is the same as if we had generated pairs (X, Y ) and marginalized over Y. In this problem, Y is missing data which we might call M, and X is observed data which we might call O. Formally, then, we partition our sufficient statistic into two sets: those observed, and those missing. 1.2 The EM algorithm The EM algorithm usually has two steps, both of which are based on the following function Q(η; η) E η ( lc (η; O, M) O ) The basis of the EM algorithm is the following result: Q(η; η) Q( η; η) l(η; O) l( η; O). 1
2 Therefore, any sequence (η (k) ) k 1 satisfying Q(η (k+1) ; η (k) ) Q(η (k) ; η (k) ) has l(η (k) ; O) non-decreasing. An algorithm that produces such a sequence is called a GEM algorithm (generalized EM algorithm). The proof of this is fairly straightforward after some initial slight of hand. After this slight of hand, we see the main ingredient in the proof is deviance of the conditional distribution of M O. In the general case, this deviance is not expressed in terms natural parameters but the argument is the same. Here is the proof: writing the joint distribution of (O, M) (assuming it has a density with respect to P 0 ) as dp η dp 0 f η,(o,m) (o, m) f η,o (o) f η,m O (m o) where the f s are densities with respect to P 0. Or, f η,o (o) f η,(o,m)(o, m) f η,m O (m o). Although the RHS seems to depend on m, the above equality shows that it is actually measurable with respect to o. We see that l(η; O) log f η (O i ) [log f η (O i, M i ) log f η (M i O i )] [log f η (O i, M i ) log f η (M i O i )] where we know that f η (m o) is an exponential family for O fixed. The right hand side is measurable with respect to O so its conditional expectation with respect 2
3 to O leaves it unchanged. Therefore, for any η we have the equality l(η; O) log f η (O i ) [log f η (O i, M i ) log f η (M i O i )] [ ( E η log fη (O i, M i ) ) ( O E η log fη (M i O i ) )] O E η ( lc (η; O, M) O ) Q(η; η) ( E η log fη (M i O i ) ) O i ( E η log fη (M i O i ) ) O i. Now, l(η; O) l( η; O) Q(η; η) Q( η; η) [ ( + E η log f η (M i O i ) ) ( Oi E η log fη (M i O i ) )] Oi The term [ ( E η log f η (M i O i ) ) ( O E η log fη (M i O i ) )] O is essentially half the deviance of the exponential family of conditional distributions for M O with sufficient statistics M. To see this, recall our general form of the conditional density of T 1 T 2 s 2 for an R p valued sufficient statistic partitioned as T 1 R k, T 2 R p k : f T1,T f T1 T 2 s 2 (t 1 ) 2 (t 1, s 2 ) R f k T1,T 2 (s 1, s 2 ) ds 1 e ηt 1 t 1+η T 2 s2 m 0 (t 1, s 2 ) R e k ηt 1 s 1+η2 T s 2 m 0 (s 1, s 2 ) ds 1 e ηt 1 t1 m 0 (t 1, s 2 ) R e k ηt 1 s 1 m 0 (s 1, s 2 ) ds 1 Therefore, with C a function independent of η ( ) log f η (M i O i ) ηmm T i log e ηt s M m 0 (s, O i ) ds + C(M i, O i ) R k η T MM i Λ(η M, O i ) + C(M i, O i ) where Λ(η M, O i ) is the appropriate CGF for this conditional distribution. 3
4 We see then, that log f η (M i O i ) log f η (M i O i ) Λ(η M, O i ) Λ( η M, O i ) (η M η M ) T M i. Taking conditional expectation with respect to O yields at η yields 1.3 The two basic steps ( E η log f η (M i O i ) log f η (M i O i ) ) 1 O D( η; η O) 0. 2 The algorithm is often described as having two steps the E step and the M step. Formally, the E step can be described as evaluating Q(η; η) with η fixed. That is, fix η and compute q η (η) E η ( lc (η; O, M) O ) as a function of η. The M is the maximization step and amounts to finding ˆη( η) argmax η Q(η; η) argmax η q η (η). 1.4 EM algorithm for exponential families The EM algorithm for exponential families takes a particularly nice form when the MLE map is nice in the complete data problem. Expressed sequentially, it can be expressed by the recursion ] ˆη (k+1) argmax η [η T E η (k)((m, O) O) Λ(η). In other words, we need to form the conditional expectation of all the sufficient statistics given the sufficient statistics we did observe. Following this, we just return the MLE as if we had observed those sufficient statistics. Another way to phrase this is ( ) ˆη (k+1) Λ E η (k)((m, O) O) 1.5 Mixture model example In the mixture model, if we write Y i (Y i1,..., Y il ) example the sufficient statistics can be taken to be ( ) t(x, Y ) Y ij, Y ij X i, Y ij Xi 2. where only L j1 Y ijx i X i, 1 i n is observed. 1 j L 4
5 1.5.1 Exercise Use Bayes rule to show that, in our univariat e normal mixture model P η (Y l X x) π l φ(x, µ l, σ 2 l ) L j1 π jφ(x, µ j, σ 2 l ) where φ(x, µ, σ 2 ) is the univariate density of N(µ, σ 2 l ). If we set ˆγ l (x, η) P η (Y l X x) The above exercise shows that E η ( Y il X i X E η ( Y il X 2 i E η ( ) ) X Y il X ) ˆγ l (X i, η)x i ˆγ l (X i, η)x i 2 ˆγ l (X i, η) The usual MLE map (for the mean parameters) in this model can be expressed as ˆπ l ˆµ l ˆσ 2 l Y il /n Y ilx i Y il Y il(x i ˆµ l ) 2 Y il Y ilx 2 i Y il ( Y ) 2 ilx i n Y il This leads to the algorithm, given an initial set of parameters η (0) we repeat the following updates for k 0: Form the responsibilities ˆγ l (X i ; η (k) ), 1 l L, 1 i n. Compute ˆπ (k+1) l ˆγ l (X i ; η (k) )/n ˆµ (k+1) l ˆσ 2(k+1) l ˆγ l(x i ; η (k) )X i ˆγ l(x i ; η (k) ) ˆγ l(x i ; η (k) )X 2 i ˆγ l(x i ; η (k) ) ( ) ˆµ (k+1) 2 l Repeat 5
6 Let s test out our algorithm on some data from the mixture model. mu1, sigma1 2, 1 mu2, sigma2-1, 0.8 X1 np.random.standard_normal(200)*sigma1 + mu1 X2 np.random.standard_normal(600)*sigma2 + mu2 X np.hstack([x1,x2]) %R -i X plot(density(x)) def phi(x, mu, sigma): """ Normal density """ return np.exp(-(x-mu)**2 / (2 * sigma**2)) / np.sqrt(2 * np.pi * sigma**2) def responsibilities(x, params): """ Compute the responsibilites, as well as the likelihood at the same time. """ mu1, mu2, sigma1, sigma2, pi1, pi2 params 6
7 gamma1 phi(x, mu1, sigma1) * pi1 gamma2 phi(x, mu2, sigma2) * pi2 denom gamma1 + gamma2 gamma1 / denom gamma2 / denom return np.array([gamma1, gamma2]).t, np.log(denom).sum() mu1, mu2, sigma1, sigma2, pi1, pi2 0, 1, 1, 4, 0.5, 0.5 gamma, likelihood responsibilities(x, (mu1, mu2, sigma1, sigma2, pi1, pi2)) Here is our recursive estimation procedure, which is fairly straightforward here. niter 20 n X.shape[0] values [] for _ in range(niter): gamma, likelihood responsibilities(x, (mu1, mu2, sigma1, sigma2, pi1, pi2)) pi1, pi2 gamma.sum(0) / n mu1 (gamma[:,0] * X).sum() / (pi1*n) mu2 (gamma[:,1] * X).sum() / (pi2*n) sigma1_sq (gamma[:,0] * X**2).sum() / (n*pi1) - mu1**2 sigma2_sq (gamma[:,1] * X**2).sum() / (n*pi2) - mu2**2 sigma1 np.sqrt(sigma1_sq) sigma2 np.sqrt(sigma2_sq) values.append(likelihood) We can track the value of the likelihood and, since we have an EM algorithm, the likelihood should be monotone with iterations. plt.plot(values) plt.gca().set_ylabel(r $\ell^{(k)}$ ) plt.gca().set_xlabel(r Iteration $k$ ) <matplotlib.text.text at 0xdbd6fb0> 7
8 Let s plot our density estimate to see how well the mixture model was fit. %%R -i pi1,pi2,sigma1,sigma2,mu1,mu2 X sort(x) plot(x, pi1*dnorm(x,mu1,sigma1)+pi2*dnorm(x,mu2,sigma2), col red, lwd2, type l, ylab Density ) lines(density(x)) 8
9 1.5.2 Exercise 1. Refit the mixture model assuming the variance is the same within each class, i.e. σ 2 l σ 2, independent of class l. 2. Try fitting 3 and 4 component mixture models to the above data which only has two. What do you expect to see in the fitted density? 1.6 Gaussian random effects model Another application of the EM algorithm is to random or linear mixed effects models. One version of a linear mixed effect model is Y X, Z N ( Xβ, σ 2 I + ZΣZ T ) where X is a fixed effects design matrix, Z is a random effect design matrix and Σ is a covariance matrix that must be estimated along with σ. The covariance matrix Σ might not be estimated in a completely unrestricted fashion. In the example below, the model is Σ σ 2 α I for some constant. This distribution is the same as the distribution of Xβ + Zα + ɛ X, Z 9
10 where α N(0, Σ), ɛ N(0, σ 2 I) independently given X, Z. The simplest version of such a random effects model would one in which observations were grouped by subjects and each subject had a random intercept Y ij X T i β + α i + ɛ ij, ɛ ij N(0, σ 2 ) α i N(0, σ 2 α) 1 i n, 1 j n i with the ɛ s and α s being independent. This corresponds to Z being a design matrix of indicator variables for a factor that has n levels, i.e. subject. Here, the matrix Σ σ 2 α I n n Exercise Define the complete data to be (Y ij, α i, X i ) 1 i n,1 j ni (Y ij, X i ) 1 i n,1 j ni. and assume you are only able to observe 1. What are the sufficient statistics for the joint likelihood of the complete data (conditional on X)? 2. What is the conditional distribution of α i Y ij, X i 1 j n? 3. Describe the EM algorithm to estimate (β, σ 2, σ 2 α). 4. How would you estimate the accuracy of σ 2 α? 10
Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that
1 More examples 1.1 Exponential families under conditioning Exponential families also behave nicely under conditioning. Specifically, suppose we write η = η 1, η 2 R k R p k so that dp η dm 0 = e ηt 1
More informationReview and continuation from last week Properties of MLEs
Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationAn Introduction to Expectation-Maximization
An Introduction to Expectation-Maximization Dahua Lin Abstract This notes reviews the basics about the Expectation-Maximization EM) algorithm, a popular approach to perform model estimation of the generative
More informationPh.D. Qualifying Exam Friday Saturday, January 6 7, 2017
Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationExponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm. by Korbinian Schwinger
Exponential Family and Maximum Likelihood, Gaussian Mixture Models and the EM Algorithm by Korbinian Schwinger Overview Exponential Family Maximum Likelihood The EM Algorithm Gaussian Mixture Models Exponential
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Guy Lebanon February 19, 2011 Maximum likelihood estimation is the most popular general purpose method for obtaining estimating a distribution from a finite sample. It was
More informationMaximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS
Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationGaussian Models (9/9/13)
STA561: Probabilistic machine learning Gaussian Models (9/9/13) Lecturer: Barbara Engelhardt Scribes: Xi He, Jiangwei Pan, Ali Razeen, Animesh Srivastava 1 Multivariate Normal Distribution The multivariate
More informationStatistical Estimation
Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from
More informationSTATS 306B: Unsupervised Learning Spring Lecture 2 April 2
STATS 306B: Unsupervised Learning Spring 2014 Lecture 2 April 2 Lecturer: Lester Mackey Scribe: Junyang Qian, Minzhe Wang 2.1 Recap In the last lecture, we formulated our working definition of unsupervised
More informationParametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory
Statistical Inference Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory IP, José Bioucas Dias, IST, 2007
More informationLecture 4: Probabilistic Learning
DD2431 Autumn, 2015 1 Maximum Likelihood Methods Maximum A Posteriori Methods Bayesian methods 2 Classification vs Clustering Heuristic Example: K-means Expectation Maximization 3 Maximum Likelihood Methods
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationLinear Methods for Prediction
Chapter 5 Linear Methods for Prediction 5.1 Introduction We now revisit the classification problem and focus on linear methods. Since our prediction Ĝ(x) will always take values in the discrete set G we
More informationInformation in Data. Sufficiency, Ancillarity, Minimality, and Completeness
Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties
More informationA Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute
More informationPMR Learning as Inference
Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning
More informationLecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions
DD2431 Autumn, 2014 1 2 3 Classification with Probability Distributions Estimation Theory Classification in the last lecture we assumed we new: P(y) Prior P(x y) Lielihood x2 x features y {ω 1,..., ω K
More informationStatistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation
Statistics - Lecture One Charlotte Wickham wickham@stat.berkeley.edu http://www.stat.berkeley.edu/~wickham/ Outline 1. Basic ideas about estimation 2. Method of Moments 3. Maximum Likelihood 4. Confidence
More informationMaster s Written Examination
Master s Written Examination Option: Statistics and Probability Spring 05 Full points may be obtained for correct answers to eight questions Each numbered question (which may have several parts) is worth
More informationSTAT 730 Chapter 4: Estimation
STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum
More informationExam 2. Jeremy Morris. March 23, 2006
Exam Jeremy Morris March 3, 006 4. Consider a bivariate normal population with µ 0, µ, σ, σ and ρ.5. a Write out the bivariate normal density. The multivariate normal density is defined by the following
More informationLatent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent
Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More informationCOM336: Neural Computing
COM336: Neural Computing http://www.dcs.shef.ac.uk/ sjr/com336/ Lecture 2: Density Estimation Steve Renals Department of Computer Science University of Sheffield Sheffield S1 4DP UK email: s.renals@dcs.shef.ac.uk
More informationStatistics 3858 : Maximum Likelihood Estimators
Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,
More informationPh.D. Qualifying Exam Monday Tuesday, January 4 5, 2016
Ph.D. Qualifying Exam Monday Tuesday, January 4 5, 2016 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Find the maximum likelihood estimate of θ where θ is a parameter
More informationIntroduction: exponential family, conjugacy, and sufficiency (9/2/13)
STA56: Probabilistic machine learning Introduction: exponential family, conjugacy, and sufficiency 9/2/3 Lecturer: Barbara Engelhardt Scribes: Melissa Dalis, Abhinandan Nath, Abhishek Dubey, Xin Zhou Review
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationCSE446: Clustering and EM Spring 2017
CSE446: Clustering and EM Spring 2017 Ali Farhadi Slides adapted from Carlos Guestrin, Dan Klein, and Luke Zettlemoyer Clustering systems: Unsupervised learning Clustering Detect patterns in unlabeled
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationGaussian Mixture Models, Expectation Maximization
Gaussian Mixture Models, Expectation Maximization Instructor: Jessica Wu Harvey Mudd College The instructor gratefully acknowledges Andrew Ng (Stanford), Andrew Moore (CMU), Eric Eaton (UPenn), David Kauchak
More informationLecture 5: LDA and Logistic Regression
Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant
More informationOutline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution
Outline A short review on Bayesian analysis. Binomial, Multinomial, Normal, Beta, Dirichlet Posterior mean, MAP, credible interval, posterior distribution Gibbs sampling Revisit the Gaussian mixture model
More informationLinear Methods for Prediction
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationChapter 5 continued. Chapter 5 sections
Chapter 5 sections Discrete univariate distributions: 5.2 Bernoulli and Binomial distributions Just skim 5.3 Hypergeometric distributions 5.4 Poisson distributions Just skim 5.5 Negative Binomial distributions
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More information1 EM algorithm: updating the mixing proportions {π k } ik are the posterior probabilities at the qth iteration of EM.
Université du Sud Toulon - Var Master Informatique Probabilistic Learning and Data Analysis TD: Model-based clustering by Faicel CHAMROUKHI Solution The aim of this practical wor is to show how the Classification
More informationLast lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton
EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares
More informationCS Lecture 19. Exponential Families & Expectation Propagation
CS 6347 Lecture 19 Exponential Families & Expectation Propagation Discrete State Spaces We have been focusing on the case of MRFs over discrete state spaces Probability distributions over discrete spaces
More informationCS6220: DATA MINING TECHNIQUES
CS6220: DATA MINING TECHNIQUES Matrix Data: Clustering: Part 2 Instructor: Yizhou Sun yzsun@ccs.neu.edu November 3, 2015 Methods to Learn Matrix Data Text Data Set Data Sequence Data Time Series Graph
More informationExponential Families
Exponential Families David M. Blei 1 Introduction We discuss the exponential family, a very flexible family of distributions. Most distributions that you have heard of are in the exponential family. Bernoulli,
More informationPh.D. Qualifying Exam Friday Saturday, January 3 4, 2014
Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently
More informationCS 195-5: Machine Learning Problem Set 1
CS 95-5: Machine Learning Problem Set Douglas Lanman dlanman@brown.edu 7 September Regression Problem Show that the prediction errors y f(x; ŵ) are necessarily uncorrelated with any linear function of
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationParameter estimation: ACVF of AR processes
Parameter estimation: ACVF of AR processes Yule-Walker s for AR processes: a method of moments, i.e. µ = x and choose parameters so that γ(h) = ˆγ(h) (for h small ). 12 novembre 2013 1 / 8 Parameter estimation:
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationState-Space Methods for Inferring Spike Trains from Calcium Imaging
State-Space Methods for Inferring Spike Trains from Calcium Imaging Joshua Vogelstein Johns Hopkins April 23, 2009 Joshua Vogelstein (Johns Hopkins) State-Space Calcium Imaging April 23, 2009 1 / 78 Outline
More informationSTATS 306B: Unsupervised Learning Spring Lecture 3 April 7th
STATS 306B: Unsupervised Learning Spring 2014 Lecture 3 April 7th Lecturer: Lester Mackey Scribe: Jordan Bryan, Dangna Li 3.1 Recap: Gaussian Mixture Modeling In the last lecture, we discussed the Gaussian
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationGeneralized linear models
Generalized linear models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 202 Contents Densities for generalized linear models. Mean and variance...............................
More informationIntroduction to Machine Learning. Lecture 2
Introduction to Machine Learning Lecturer: Eran Halperin Lecture 2 Fall Semester Scribe: Yishay Mansour Some of the material was not presented in class (and is marked with a side line) and is given for
More informationGaussian Mixtures and the EM algorithm
Gaussian Mixtures and the EM algorithm 1 sigma=1.0 sigma=1.0 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 sigma=0.2 sigma=0.2 Responsibilities 0.0 0.2 0.4 0.6 0.8 1.0 2 Details of figure Left panels: two Gaussian
More informationLinear Regression (9/11/13)
STA561: Probabilistic machine learning Linear Regression (9/11/13) Lecturer: Barbara Engelhardt Scribes: Zachary Abzug, Mike Gloudemans, Zhuosheng Gu, Zhao Song 1 Why use linear regression? Figure 1: Scatter
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS Parametric Distributions Basic building blocks: Need to determine given Representation: or? Recall Curve Fitting Binary Variables
More informationParametric Techniques Lecture 3
Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to
More information1 One parameter exponential families
1 One parameter exponential families The world of exponential families bridges the gap between the Gaussian family and general distributions. Many properties of Gaussians carry through to exponential families
More informationBasic math for biology
Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More information10708 Graphical Models: Homework 2
10708 Graphical Models: Homework 2 Due Monday, March 18, beginning of class Feburary 27, 2013 Instructions: There are five questions (one for extra credit) on this assignment. There is a problem involves
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationGeneralized Linear Models Introduction
Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,
More informationHidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010
Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data
More informationData Mining and Analysis: Fundamental Concepts and Algorithms
Data Mining and Analysis: Fundamental Concepts and Algorithms dataminingbook.info Mohammed J. Zaki 1 Wagner Meira Jr. 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY, USA
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationCS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm
CS6220 Data Mining Techniques Hidden Markov Models, Exponential Families, and the Forward-backward Algorithm Jan-Willem van de Meent, 19 November 2016 1 Hidden Markov Models A hidden Markov model (HMM)
More informationMaximum Smoothed Likelihood for Multivariate Nonparametric Mixtures
Maximum Smoothed Likelihood for Multivariate Nonparametric Mixtures David Hunter Pennsylvania State University, USA Joint work with: Tom Hettmansperger, Hoben Thomas, Didier Chauveau, Pierre Vandekerkhove,
More informationBayesian Machine Learning
Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning
More informationThe Multivariate Gaussian Distribution [DRAFT]
The Multivariate Gaussian Distribution DRAFT David S. Rosenberg Abstract This is a collection of a few key and standard results about multivariate Gaussian distributions. I have not included many proofs,
More information7 Gaussian Discriminant Analysis (including QDA and LDA)
36 Jonathan Richard Shewchuk 7 Gaussian Discriminant Analysis (including QDA and LDA) GAUSSIAN DISCRIMINANT ANALYSIS Fundamental assumption: each class comes from normal distribution (Gaussian). X N(µ,
More informationAn Introduction to Spectral Learning
An Introduction to Spectral Learning Hanxiao Liu November 8, 2013 Outline 1 Method of Moments 2 Learning topic models using spectral properties 3 Anchor words Preliminaries X 1,, X n p (x; θ), θ = (θ 1,
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationSome Probability and Statistics
Some Probability and Statistics David M. Blei COS424 Princeton University February 12, 2007 D. Blei ProbStat 01 1 / 42 Who wants to scribe? D. Blei ProbStat 01 2 / 42 Random variable Probability is about
More informationBayesian data analysis in practice: Three simple examples
Bayesian data analysis in practice: Three simple examples Martin P. Tingley Introduction These notes cover three examples I presented at Climatea on 5 October 0. Matlab code is available by request to
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationLikelihood Ratio tests
Likelihood Ratio tests For general composite hypotheses optimality theory is not usually successful in producing an optimal test. instead we look for heuristics to guide our choices. The simplest approach
More informationEM-based Reinforcement Learning
EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data
More informationEstimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking
FUSION 2012, Singapore 118) Estimation and Maintenance of Measurement Rates for Multiple Extended Target Tracking Karl Granström*, Umut Orguner** *Division of Automatic Control Department of Electrical
More informationBayesian Inference. Chapter 9. Linear models and regression
Bayesian Inference Chapter 9. Linear models and regression M. Concepcion Ausin Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering
More information. Also, in this case, p i = N1 ) T, (2) where. I γ C N(N 2 2 F + N1 2 Q)
Supplementary information S7 Testing for association at imputed SPs puted SPs Score tests A Score Test needs calculations of the observed data score and information matrix only under the null hypothesis,
More informationMIT Spring 2016
Generalized Linear Models MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Generalized Linear Models 1 Generalized Linear Models 2 Generalized Linear Model Data: (y i, x i ), i = 1,..., n where y i : response
More informationData Preprocessing. Cluster Similarity
1 Cluster Similarity Similarity is most often measured with the help of a distance function. The smaller the distance, the more similar the data objects (points). A function d: M M R is a distance on M
More informationParameter Estimation
Parameter Estimation Chapters 13-15 Stat 477 - Loss Models Chapters 13-15 (Stat 477) Parameter Estimation Brian Hartman - BYU 1 / 23 Methods for parameter estimation Methods for parameter estimation Methods
More informationThe Kalman filter, Nonlinear filtering, and Markov Chain Monte Carlo
NBER Summer Institute Minicourse What s New in Econometrics: Time Series Lecture 5 July 5, 2008 The Kalman filter, Nonlinear filtering, and Markov Chain Monte Carlo Lecture 5, July 2, 2008 Outline. Models
More informationRandom Vectors 1. STA442/2101 Fall See last slide for copyright information. 1 / 30
Random Vectors 1 STA442/2101 Fall 2017 1 See last slide for copyright information. 1 / 30 Background Reading: Renscher and Schaalje s Linear models in statistics Chapter 3 on Random Vectors and Matrices
More informationMATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models
1/13 MATH 829: Introduction to Data Mining and Analysis Graphical Models II - Gaussian Graphical Models Dominique Guillot Departments of Mathematical Sciences University of Delaware May 4, 2016 Recall
More informationPROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception
PROBABILITY DISTRIBUTIONS Credits 2 These slides were sourced and/or modified from: Christopher Bishop, Microsoft UK Parametric Distributions 3 Basic building blocks: Need to determine given Representation:
More informationBayesian Decision and Bayesian Learning
Bayesian Decision and Bayesian Learning Ying Wu Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208 http://www.eecs.northwestern.edu/~yingwu 1 / 30 Bayes Rule p(x ω i
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationSome Probability and Statistics
Some Probability and Statistics David M. Blei COS424 Princeton University February 13, 2012 Card problem There are three cards Red/Red Red/Black Black/Black I go through the following process. Close my
More informationWeb Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.
Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D. Ruppert A. EMPIRICAL ESTIMATE OF THE KERNEL MIXTURE Here we
More information