Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Size: px
Start display at page:

Download "Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X."

Transcription

1 Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may not be unique. It will be unique if f is strictly convex. A function g is strictly convex if g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. A twice continuously-differentiable function is strictly convex if and only if the Hessian matrix H ij (x) = 2 f(x)/ x i x j is positive definite x. If X is discrete the problem is known as combinatorial optimization. Newton s Method Suppose f has two continuous derivatives. Then all interior local maxima x of f satisfy f (x) = 0. More generally, write the vector of partial derivatives of f with respect to x (the gradient of f) as follows: f(x) = ( f/ x 1,..., f/ x m ). An interior local maximum f satisfies f(x) = 0. The Hessian of f, denoted H f, is an m m matrix for each value of x, defined by (H f ) ij = 2 f/ x i x j. For twice smooth functions H f is symmetric at each x. If f is convex in a neighborhood of a local maximizer x, then H f (x) is SPD. Suppose f has a continuous Hessian map at x 0. Then we can approximate f quadratically in a neighborhood of x 0 using f(x) f(x 0 ) + f(x 0 ) (x x 0 ) (x x 0) H f (x 0 )(x x 0 ). 1

2 This leads to the following approximation to the gradient: f(x) f(x 0 ) + H f (x 0 )(x x 0 ). We can solve this expression for the stationary point f(x) = 0, giving: x = x 0 H f (x 0 ) 1 f(x 0 ). This is the Newton-Raphson method for multidimensional optimization. We define a sequence of iterates starting at an arbitrary value x 0, and update using the rule x i+1 = x i H f (x i ) 1 f(x i ). The Newton-Raphson algorithm is globally convergent at quadratic rate whenever f is convex and has two continuous derivatives. Maximum Likelihood Suppose we observe independent realizations Y 1,..., Y n, where the density of Y i f i ( ; θ), where θ R d is an unknown parameter. Then the joint density is given by is and the log-likelihood is given by n p(y 1,..., Y n ) = f i (Y i ; θ), i=1 n L(θ Y 1,..., Y n ) = log f i (Y i ; θ). i=1 The maximum likelihood estimator (MLE) of θ is defined to be: argmax θ L(θ Y 1,..., Y n ). In this setting, the gradient L(θ) is known as the score function, and is often written s(θ). It has the form s(θ) = i f i (Y i ; θ)/f i (Y i ; θ) = i s i (θ). The score function is a random vector with expected value 0: E θ s(θ) = i = i = i ( f i (Y i ; θ)/f i (Y i ; θ))f i (Y i ; θ)dy i f i (Y i ; θ)dy i f i (Y i ; θ)dy i = 0. 2

3 The Hessian of the log-likelihood function has the form: We can write n f i (Y i ; θ)h fi (θ) f i (θ) f i (θ) H L (θ) =. fi 2 (Y i ; θ) i=1 E θ f i (Y i ; θ)h fi (θ)/fi 2 (Y i ; θ) = H fi (θ)dy i = 2 f i (Y i ; θ)dy i = 0. Thus, n E θ H L (θ) = i=1 s i (θ)s i (θ) f i (Y i ; θ)dy i = i cov θ (s i (θ), s i (θ)) = cov(s(θ), s(θ)). This quantity, the Fisher Information, is denoted I(θ). The Fisher information matrix is negative semidefinite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update: x i+1 = x i I(x i ) 1 s(x i ). This is the Fisher scoring algorithm. It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute I than H L. The Fisher information matrix also gives the inverse of the asymptotic variance of the MLE ˆθ. Example: Logistic regression. Suppose we observe independent pairs (Y i, X i ), i = 1,..., n with Y i {0, 1}, and the probability law is given by: P (Y i = 1 X i ) = exp(β X i ) 1 + exp(β X i ) P (Y i = 0 X i ) = exp(β X i ). our goal is to compute the MLE of β. The joint likelihood is P (Y 1,..., Y n X 1,..., X n ) = The log-likelihood function is: i:y i =1 exp(β X i ) 1 + exp(β X i ) i:y i = exp(β X i ). 3

4 L(β) = β i:y i =1 X i i log(1 + exp(β X i )). The score function is: s(β) = i:y i =1 X i i exp(β X i ) 1 + exp(β X i ) X i. The Hessian of L is: H L (β) = i exp(β X i ) (1 + exp(β X i )) 2 X ix i. The score function is a linear combination of the X i, and the information matrix is a linear combination of X i X i. This is typical in exponential family regression models. The coefficients of X i X i in the expression for H L (β) are negative. Thus H L is negative definite as long as the X i span R m, and it follows that L is globally convex, so there is a unique local maximizer, which is also the global maximizer. The Y i do not appear in H L (β), so H L (β) = I(β). Example: Heteroscedastic regression Suppose we are fitting a linear regression model in which E(Y X) = α + βx and var(y X) = cx 2. We wish to estimate α, β, and c using maximum likelihood. The log-likelihood is L(α, β, c) = 1 2 The gradient is (Y i α βx i ) 2 cx 2 i n log(c)/2 i log X 2 i /2. L(α, β, c) = 1 2 where r i = Y i α βx i. The Hessian matrix is 2r i /(cx 2 i ) 2r i /(cx i ) r 2 i /(c 2 X 2 i ) 1/c, H L = i 1/(cXi 2 ) 1/(cX i ) r i /(c 2 Xi 2 ) 1/(cX i ) 1/c r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) ri 2 /(c 3 Xi 2 ) 1/(2c 2 ). The expected Hessian is: 4

5 H L = i 1/(cXi 2 ) 1/(cX i ) 0 1/(cX i ) 1/c /(2c 2 ). Methods using only one derivative In many cases, it is practical to compute the first derivative of L(θ), but not the Hessian or Fisher information matrix. In this case, we know that L(θ) gives the steepest uphill direction from a given position θ. One strategy is to always move in the direction of L(θ). That is, if we are currently at a point θ i, the next iteration will take us to θ i+1 = θ i + λ L(θ i ). We do not know how far to move in the direction L(θ) (i.e. the value of λ), so we can use a one dimensional line search (e.g. bisection) to find the greatest value. The resulting algorithm is known as the steepest ascent algorithm: 1. Start at θ 0, set i = Set G i = L(θ i ). 3. Use a one dimensional line search to obtain argmax λ L(θ i + λg i ). 4. Set θ i+1 = θ i + λg i, return to step 2. There are many variants of this algorithm. The main issue is whether a complete line search is done in step 3, or whether we take a single uphill step, or a fixed number of uphill steps. The best strategy in a given situation depends on the relative expense of evaluating L(θ) and L(θ). The steepest ascent algorithm never takes a downhill step. The steepest ascent algorithm almost always converges to a local or global maximum in practice. Theoretically, the method can converge to a saddle point but this is an unstable solution hence difficult to achieve in practice. The trajectory followed by the steepest ascent algorithm can be characterized geometrically in terms of the level sets of the objective function. The gradient f(x 0 ) of f, evaluated at x 0, is perpendicular to the level sets of f at x 0. The extreme values of the restricted function g(λ) = f(x 0 + λ f(x 0 )) are located at the points λ such that f(x 0 ) is is tangent to the level curves of f at x 0 + λ f(x 0 ). Thus each iteration moves from x 0 in the direction f(x 0 ) such that the angle of f(x 0 ) relative to the level sets moves from π/2 to 0. 5

6 The steepest ascent method is globally convergent for concave functions, but tends to be very slow. It is not finitely convergent for any interesting class of functions. It is possible to construct optimization algorithms using only first derivatives that are finitely convergent for quadratic functions. These methods are generally preferred to steepest ascent. EM Algorithm Suppose we observe Y according to a density f θ (Y ), where L(θ Y ) = log f θ (Y ) is a complicated function of θ, and therefore is hard to optimize. Suppose further that the nature of the problem suggests a random variable Z such that if Z were observed, then the augmented log-likelihood L(θ Y, Z) = log f θ (Y, Z) is simpler, in that the following two conditions hold: 1. We can compute the following conditional expectation as a function of θ in closed form Q(θ θ ) = E θ (L θ (Y, Z) Y ). 2. We can optimize Q(θ θ ) as a function of θ for fixed θ. That is, we can compute M(θ ) = argmax θ Q(θ θ ). Under these conditions, the EM-algorithm constructs a sequence of iterates according to the rule θ i+1 = M(θ i ). Under certain additional regularity conditions (J. Wu, Annals Stat. 1983, pp ), the θ i converge to a stationary point of L(θ Y ). If L(θ Y ) is unimodal and has only one stationary point, then the convergence is to the MLE. When convergence occurs, the rate is typically linear. Step 1 is called the E-step, and step 2 is called the M-step. The EM algorithm, like the steepest ascent and conjugate gradient algorithms, but unlike Newton s method, is monotonic, in that the sequence θ n of iterates satisfies L(θ n+1 Y ) L(θ n Y ). Begin by observing that L(θ Y ) = log f θ (Y, Z) log f θ (Z Y ) = f θ (Z Y ) log f θ (Y, Z)dZ f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) K(θ θ ). Continuing, we have 6

7 L(θ Y ) L(θ Y ) = Q(θ θ ) Q(θ θ ) + K(θ θ ) K(θ θ ) = Q(θ θ ) Q(θ θ ) log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz. If we set θ = argmax τ Q(τ θ ), then Q(θ θ ) Q(θ θ ) 0. By Jensen s inequality, log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz log (f θ (Z Y )/f θ (Z Y ))f θ (Z Y )dz = 0. Thus L(θ Y ) L(θ Y ) 0. For any value of θ, L(θ Y ) Q(θ θ ) + L(θ Y ) Q(θ θ ) with equality for θ = θ. This leads to the characterization of the EM algorithm as a majorization or optimization transfer algorithm. Optimization of the surrogate function Q(θ θ ) always drives L(θ) uphill. Example: Finite mixture models. Suppose π 1,..., π K are positive constants with j π j = 1. Let f 1,..., f K denote density functions. Then j π j f j is also a density function. It is the density of an observation generated according to the following twostage procedure: (i) generate κ {1,..., K} with P (κ = k) = π k, (ii) generate Y f κ. We observe Y but not κ. Suppose we observe Y 1,..., Y n according to the mixture density, and we wish to estimate the parameters π = (π 1,..., π K ) using maximum likelihood. The log-likelihood for the observed data is: n K L(π {Y j }) = log( π k f k (Y j )). j=1 k=1 Suppose we were to observe as complete data the pair (Y j, κ j ), where κ j is the random variable defined in (i) above. Then the log-likelihood for the complete data is: n L(π {Y j, κ j }) = log(π κj ) + log(f κj (Y j )). j=1 The E-step of the EM algorithm requires that we compute the expectation of L(π {Y j, κ j }) given the Y j for a fixed probability vector π. This can be more easily accomplished if we rewrite the complete-data log-likelihood as follows: 7

8 n K L(π {Y j, κ j }) = log(π k )I(κ j = k) + log(f k (Y j ))I(κ j = k). j=1 k=1 This is linear in the random variables I(κ j = k), so the expectation is obtained by plugging in the values for the conditional expectations Using Bayes theorem, we have E π (I(κ j = k) Y j ) = P π (κ j = k Y j ). P π (κ j = k Y j ) P (Y j κ j = k)p π (κ j = k) = f k (Y j )π k. Therefore the expectation is given by: E π (I(κ j = k) Y j ) = f k (Y j )π k/ l f l (Y j )π l p jk, and the Q function takes the form n K Q(π π ) = log(π k )p jk + log(f k (Y j ))p jk. j=1 k=1 Note that only the first term involves π, so the M-step depends only on the first term. We update the parameters using the rule π k j p jk /n. Example: Gaussian Mixture Models Continuing with the previous example, suppose that f k (Y ) exp( (Y µ k ) Σ 1 k (Y µ k)/2)/ Σ k 1/2 is Gaussian, and we wish to estimate π 1,..., π K, µ 1,..., µ K and Σ 1,..., Σ K. The complete data log likelihood becomes L(π, {Σ k }, {µ k } {Y j, κ j }) = n K log(π k )I(κ j = k) 1 2 j=1 k=1 ( log Σk + (Y j µ k ) Σ 1 k (Y j µ k ) ) I(κ j = k). To carry out the E-step, the posterior probabilities p jk can be computed as in the previous example. To carry out the M-step for the mean and variance parameters, use the weighted moment updates 8

9 µ k j p jk Y j / j p jk, Σ k j p jk (Y j µ k )(Y j µ k ) / j p jk. The probability parameters π k are updated as in the previous example. Example: Contingency tables with collapsed cells. Suppose we observe a m n contingency table with independent rows and columns. If X ij denotes the number of observations in cell (i, j), then EX ij = np i q j where n is the total number of observations, p i is the probability of an observation falling in a cell in row i, and q j is the probability of an observation falling in a cell in column j. Now suppose that we are unable to distinguish between certain cells, so we can only observe the sum. To be concrete, consider a 2 3 table where we can only observe the following 4 values: Z 1 = X 11 + X 12 Z 2 = X 21 The likelihood for the observable data is: Z 3 = X 22 + X 13 Z 4 = X 23. L({p i }, {q j } {Z j }) (p 1 q 1 + p 1 q 2 ) Z 1 (p 2 q 1 ) Z 2 (p 2 q 2 + p 1 q 3 ) Z 3 (p 2 q 3 ) Z 4. We could optimize this function using gradient methods, but it would require a fair amount of programming. On the other hand, consider the likelihood for the {X ij } (some of which are unobservable): L({p i }, {q j } {X j }) ij X ij log(p i q j ). This is linear in the X ij, so to compute the Q( ) function, we need only to compute the conditional means of these values given the Z j, and then plug them in. Specifically, set X11 = E(X 11 {Z j }) = p 1 q 1 Z 1 /(p 1 q 1 + p 1 q 2 ) X12 = E(X 12 {Z j }) = p 1 q 2 Z 1 /(p 1 q 1 + p 1 q 2 ) X22 = E(X 22 {Z j }) = p 2 q 2 Z 3 /(p 2 q 2 + p 1 q 3 ) X13 = E(X 13 {Z j }) = p 1 q 3 Z 3 /(p 2 q 2 + p 1 q 3 ), 9

10 with the other X ij set to their directly-observable values. Finally, we can update the parameter estimates using p i j X ij/n q i i X ij/n. This defines one iteration of the EM algorithm. Example: Missing data in a multivariate normal model. Suppose that we observe Y j R d j with Y j N(µ j, Σ j ). Suppose further that there exist injective functions σ j : {1,..., d j } {1,..., d}, such that µ j (k) = µ(σ j (k)) and Σ j (k 1, k 2 ) = Σ(σ j (k 1 ), σ j (k 2 )). In words, the components of each Y j comprise a subset of the components of a common multivariate normal distribution. Suppose that we want to estimate θ = (µ, Σ) based on the following log-likelihood: L θ ({Y j }) = j 1 2 log Σ j 1 2 (Y j µ j ) Σ 1 j (Y j µ j ). Note that this might not be advisable if the σ j are viewed as random variables that could be dependent on the Y j. In general, L θ ({Y j }) can not be optimized in closed-form. A conjugate gradient algorithm would be a good choice here, but the EM algorithm is considerably simpler to derive and program (at least for a statistician). The complete data are random vectors Z j R d such that Z j N(µ, Σ) and Y j (k) = Z j (σ j (k)). The complete-data log-likelihood is: L θ ({Z j }) = j 1 2 log Σ 1 2 (Z j µ) Σ 1 (Z j µ) 1 2 log Σ tr(σ 1 (Z j µ)(z j µ) ). n j To compute the Q(θ θ ) function, we need the values η j = E µ,σ (Z j Y j ) and C j = E µ,σ (Z jz j Y j ). Once we have these values, the Q(θ θ ) function can be expressed: Q(θ θ ) = 1 2 log Σ 1 2 tr Σ 1 1 n C j η j µ µη j + µµ j = 1 2 log Σ 1 2 tr ( Σ 1 (µ η)(µ η) + Σ 1 ( C η η ) ), where η = j η j /n and C = j C j /n. The maximum of Q(θ θ ) occurs where µ = η, and Σ = C η η = j Var(Z j Y j )/n. 10

11 The values η j and C j can be easily computed using the usual formulas for the conditional mean and variance of a multivariate normal random vector. Specifically, let Z denote the missing components of Z. Then: E(Z Y ) = EZ + cov(z, Y )cov(y ) 1 (Z EY ) cov(z Y ) = cov(z ) cov(z, Y )cov(y ) 1 cov(y, Z ), where all of the values on the right hand side are computed with respect to θ = (µ, Σ ) using the usual formulas. General optimization transfer. The basic idea behind the EM algorithm can be extended to optimization problems where there is no data augmentation, or even to problems that have no likelihood. In fact the reason that the procedure works has only to do with convexity, and nothing whatsoever to do with probability. We retain the idea of the minorant Q(θ θ ) and shift it by Q(θ θ ), so that Q(θ θ ) = 0 and L(θ) Q(θ θ ) + L(θ ), so that optimizing Q(θ θ ) (with respect to θ) drives L(θ) uphill. There are many ways of constructing such a Q( ). For example, if L(θ) is convex and differentiable, then suggesting Q(θ θ ) = L (θ)(θ θ ). L(θ) L (θ)(θ θ ) + L(θ ), Simulated Annealing Suppose we want to maximize a real-valued function f with domain S. Simulated annealing is an attractive optimization procedure if any of the following hold: (i) f has many local extreme values, (ii) S is discrete, (iii) f is not smooth, or the derivative of f is much more expensive to compute than the function value. Suppose we have a family of proposal distributions on S indexed by X S: R(X X ). From a starting value X 0, define a sequence of points in S using the rules Z n R( X n 1 ) P (X n = Z n ) = min(exp((f(z n ) f(x n 1 ))/T n ), 1) P (X n = X n 1 ) = 1 P (X n = Z n ). Note that if f(z n ) > f(x n 1 ) then X n = Z n uphill proposals are always accepted, but some downhill proposals are accepted as well. 11

12 As T n 0, downhill proposals are accepted less often. If T n 0 sufficiently slowly, then a.s. convergence to the global extremum occurs for a reasonable class of problems. Theoretically, the rate of T n should have the form T n 1/ log(c + n). However this rate is too slow to be useful for most practical problems, since for n of manageable size the estimate X n will have substantial variability. There is some experimental evidence that the optimal value can be identified for many problems using a heuristic rule, such as decreasing T n by 10% each time a fixed number of steps are accepted. 12

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two

More information

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently

More information

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions

More information

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Scientific Computing: Optimization

Scientific Computing: Optimization Scientific Computing: Optimization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 March 8th, 2011 A. Donev (Courant Institute) Lecture

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Theory of Maximum Likelihood Estimation. Konstantin Kashin Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical

More information

Lecture 4 September 15

Lecture 4 September 15 IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

Approximate Normality, Newton-Raphson, & Multivariate Delta Method

Approximate Normality, Newton-Raphson, & Multivariate Delta Method Approximate Normality, Newton-Raphson, & Multivariate Delta Method Timothy Hanson Department of Statistics, University of South Carolina Stat 740: Statistical Computing 1 / 39 Statistical models come in

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed

Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

More information

Computing the MLE and the EM Algorithm

Computing the MLE and the EM Algorithm ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

Multivariate Analysis and Likelihood Inference

Multivariate Analysis and Likelihood Inference Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS

Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and

More information

Learning Parameters of Undirected Models. Sargur Srihari

Learning Parameters of Undirected Models. Sargur Srihari Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient

More information

A Very Brief Summary of Statistical Inference, and Examples

A Very Brief Summary of Statistical Inference, and Examples A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)

More information

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes Optimization Charles J. Geyer School of Statistics University of Minnesota Stat 8054 Lecture Notes 1 One-Dimensional Optimization Look at a graph. Grid search. 2 One-Dimensional Zero Finding Zero finding

More information

The Expectation Maximization or EM algorithm

The Expectation Maximization or EM algorithm The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The

More information

Gradient Descent. Sargur Srihari

Gradient Descent. Sargur Srihari Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors

More information

EM for Spherical Gaussians

EM for Spherical Gaussians EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical

More information

STAT 730 Chapter 4: Estimation

STAT 730 Chapter 4: Estimation STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

Linear and logistic regression

Linear and logistic regression Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis

More information

MIT Spring 2016

MIT Spring 2016 Generalized Linear Models MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Generalized Linear Models 1 Generalized Linear Models 2 Generalized Linear Model Data: (y i, x i ), i = 1,..., n where y i : response

More information

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016 Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic

More information

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is

More information

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your

More information

Statistics & Data Sciences: First Year Prelim Exam May 2018

Statistics & Data Sciences: First Year Prelim Exam May 2018 Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function

More information

EM & Variational Bayes

EM & Variational Bayes EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vmfs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of

More information

A minimalist s exposition of EM

A minimalist s exposition of EM A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability

More information

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1 Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in

More information

Maximum likelihood estimation

Maximum likelihood estimation Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization

More information

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2 Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Likelihood-Based Methods

Likelihood-Based Methods Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Information in a Two-Stage Adaptive Optimal Design

Information in a Two-Stage Adaptive Optimal Design Information in a Two-Stage Adaptive Optimal Design Department of Statistics, University of Missouri Designed Experiments: Recent Advances in Methods and Applications DEMA 2011 Isaac Newton Institute for

More information

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )

Expectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 ) Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select

More information

Lecture 3 - Linear and Logistic Regression

Lecture 3 - Linear and Logistic Regression 3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set

More information

Logistic Regression. Seungjin Choi

Logistic Regression. Seungjin Choi Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness

Information in Data. Sufficiency, Ancillarity, Minimality, and Completeness Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Computational statistics

Computational statistics Computational statistics EM algorithm Thierry Denœux February-March 2017 Thierry Denœux Computational statistics February-March 2017 1 / 72 EM Algorithm An iterative optimization strategy motivated by

More information

Finite Singular Multivariate Gaussian Mixture

Finite Singular Multivariate Gaussian Mixture 21/06/2016 Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Plan Singular Multivariate Normal Distribution 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Multivariate

More information

Gaussian Processes 1. Schedule

Gaussian Processes 1. Schedule 1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters

More information

Semiparametric Gaussian Copula Models: Progress and Problems

Semiparametric Gaussian Copula Models: Progress and Problems Semiparametric Gaussian Copula Models: Progress and Problems Jon A. Wellner University of Washington, Seattle European Meeting of Statisticians, Amsterdam July 6-10, 2015 EMS Meeting, Amsterdam Based on

More information

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality Existence of Global Minimum For example: f (x, y) = x 2 + y 2 is coercive on R 2 (global min. at (0, 0)) f (x) = x 3

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Neural Network Training

Neural Network Training Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 8 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics March 5, 2018 Distributional approximations 2 / 8 Distributional approximations are useful for quick inferences, as starting

More information

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Chapter 4: Asymptotic Properties of the MLE (Part 2) Chapter 4: Asymptotic Properties of the MLE (Part 2) Daniel O. Scharfstein 09/24/13 1 / 1 Example Let {(R i, X i ) : i = 1,..., n} be an i.i.d. sample of n random vectors (R, X ). Here R is a response

More information

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Chapter 4: Asymptotic Properties of the MLE

Chapter 4: Asymptotic Properties of the MLE Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider

More information

Graphical Models for Collaborative Filtering

Graphical Models for Collaborative Filtering Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,

More information

Stat 710: Mathematical Statistics Lecture 12

Stat 710: Mathematical Statistics Lecture 12 Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

Unsupervised Learning

Unsupervised Learning 2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and

More information

1. Fisher Information

1. Fisher Information 1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score

More information

Bayesian Paradigm. Maximum A Posteriori Estimation

Bayesian Paradigm. Maximum A Posteriori Estimation Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)

More information

HOMEWORK #4: LOGISTIC REGRESSION

HOMEWORK #4: LOGISTIC REGRESSION HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a

More information

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields 1 Introduction Jo Eidsvik Department of Mathematical Sciences, NTNU, Norway. (joeid@math.ntnu.no) February

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.

Unbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it

More information

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Statistica Sinica 15(2005), 831-840 PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Florin Vaida University of California at San Diego Abstract: It is well known that the likelihood sequence of the EM algorithm

More information

Classification: Logistic Regression from Data

Classification: Logistic Regression from Data Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification:

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information