Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
|
|
- Hollie Norman
- 5 years ago
- Views:
Transcription
1 Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may not be unique. It will be unique if f is strictly convex. A function g is strictly convex if g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. A twice continuously-differentiable function is strictly convex if and only if the Hessian matrix H ij (x) = 2 f(x)/ x i x j is positive definite x. If X is discrete the problem is known as combinatorial optimization. Newton s Method Suppose f has two continuous derivatives. Then all interior local maxima x of f satisfy f (x) = 0. More generally, write the vector of partial derivatives of f with respect to x (the gradient of f) as follows: f(x) = ( f/ x 1,..., f/ x m ). An interior local maximum f satisfies f(x) = 0. The Hessian of f, denoted H f, is an m m matrix for each value of x, defined by (H f ) ij = 2 f/ x i x j. For twice smooth functions H f is symmetric at each x. If f is convex in a neighborhood of a local maximizer x, then H f (x) is SPD. Suppose f has a continuous Hessian map at x 0. Then we can approximate f quadratically in a neighborhood of x 0 using f(x) f(x 0 ) + f(x 0 ) (x x 0 ) (x x 0) H f (x 0 )(x x 0 ). 1
2 This leads to the following approximation to the gradient: f(x) f(x 0 ) + H f (x 0 )(x x 0 ). We can solve this expression for the stationary point f(x) = 0, giving: x = x 0 H f (x 0 ) 1 f(x 0 ). This is the Newton-Raphson method for multidimensional optimization. We define a sequence of iterates starting at an arbitrary value x 0, and update using the rule x i+1 = x i H f (x i ) 1 f(x i ). The Newton-Raphson algorithm is globally convergent at quadratic rate whenever f is convex and has two continuous derivatives. Maximum Likelihood Suppose we observe independent realizations Y 1,..., Y n, where the density of Y i f i ( ; θ), where θ R d is an unknown parameter. Then the joint density is given by is and the log-likelihood is given by n p(y 1,..., Y n ) = f i (Y i ; θ), i=1 n L(θ Y 1,..., Y n ) = log f i (Y i ; θ). i=1 The maximum likelihood estimator (MLE) of θ is defined to be: argmax θ L(θ Y 1,..., Y n ). In this setting, the gradient L(θ) is known as the score function, and is often written s(θ). It has the form s(θ) = i f i (Y i ; θ)/f i (Y i ; θ) = i s i (θ). The score function is a random vector with expected value 0: E θ s(θ) = i = i = i ( f i (Y i ; θ)/f i (Y i ; θ))f i (Y i ; θ)dy i f i (Y i ; θ)dy i f i (Y i ; θ)dy i = 0. 2
3 The Hessian of the log-likelihood function has the form: We can write n f i (Y i ; θ)h fi (θ) f i (θ) f i (θ) H L (θ) =. fi 2 (Y i ; θ) i=1 E θ f i (Y i ; θ)h fi (θ)/fi 2 (Y i ; θ) = H fi (θ)dy i = 2 f i (Y i ; θ)dy i = 0. Thus, n E θ H L (θ) = i=1 s i (θ)s i (θ) f i (Y i ; θ)dy i = i cov θ (s i (θ), s i (θ)) = cov(s(θ), s(θ)). This quantity, the Fisher Information, is denoted I(θ). The Fisher information matrix is negative semidefinite, and can serve as a stand-in for the Hessian in the Newton-Raphson algorithm, giving the update: x i+1 = x i I(x i ) 1 s(x i ). This is the Fisher scoring algorithm. It has essentially the same convergence properties as Newton-Raphson, but it is often easier to compute I than H L. The Fisher information matrix also gives the inverse of the asymptotic variance of the MLE ˆθ. Example: Logistic regression. Suppose we observe independent pairs (Y i, X i ), i = 1,..., n with Y i {0, 1}, and the probability law is given by: P (Y i = 1 X i ) = exp(β X i ) 1 + exp(β X i ) P (Y i = 0 X i ) = exp(β X i ). our goal is to compute the MLE of β. The joint likelihood is P (Y 1,..., Y n X 1,..., X n ) = The log-likelihood function is: i:y i =1 exp(β X i ) 1 + exp(β X i ) i:y i = exp(β X i ). 3
4 L(β) = β i:y i =1 X i i log(1 + exp(β X i )). The score function is: s(β) = i:y i =1 X i i exp(β X i ) 1 + exp(β X i ) X i. The Hessian of L is: H L (β) = i exp(β X i ) (1 + exp(β X i )) 2 X ix i. The score function is a linear combination of the X i, and the information matrix is a linear combination of X i X i. This is typical in exponential family regression models. The coefficients of X i X i in the expression for H L (β) are negative. Thus H L is negative definite as long as the X i span R m, and it follows that L is globally convex, so there is a unique local maximizer, which is also the global maximizer. The Y i do not appear in H L (β), so H L (β) = I(β). Example: Heteroscedastic regression Suppose we are fitting a linear regression model in which E(Y X) = α + βx and var(y X) = cx 2. We wish to estimate α, β, and c using maximum likelihood. The log-likelihood is L(α, β, c) = 1 2 The gradient is (Y i α βx i ) 2 cx 2 i n log(c)/2 i log X 2 i /2. L(α, β, c) = 1 2 where r i = Y i α βx i. The Hessian matrix is 2r i /(cx 2 i ) 2r i /(cx i ) r 2 i /(c 2 X 2 i ) 1/c, H L = i 1/(cXi 2 ) 1/(cX i ) r i /(c 2 Xi 2 ) 1/(cX i ) 1/c r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) r i /(c 2 Xi 2 ) ri 2 /(c 3 Xi 2 ) 1/(2c 2 ). The expected Hessian is: 4
5 H L = i 1/(cXi 2 ) 1/(cX i ) 0 1/(cX i ) 1/c /(2c 2 ). Methods using only one derivative In many cases, it is practical to compute the first derivative of L(θ), but not the Hessian or Fisher information matrix. In this case, we know that L(θ) gives the steepest uphill direction from a given position θ. One strategy is to always move in the direction of L(θ). That is, if we are currently at a point θ i, the next iteration will take us to θ i+1 = θ i + λ L(θ i ). We do not know how far to move in the direction L(θ) (i.e. the value of λ), so we can use a one dimensional line search (e.g. bisection) to find the greatest value. The resulting algorithm is known as the steepest ascent algorithm: 1. Start at θ 0, set i = Set G i = L(θ i ). 3. Use a one dimensional line search to obtain argmax λ L(θ i + λg i ). 4. Set θ i+1 = θ i + λg i, return to step 2. There are many variants of this algorithm. The main issue is whether a complete line search is done in step 3, or whether we take a single uphill step, or a fixed number of uphill steps. The best strategy in a given situation depends on the relative expense of evaluating L(θ) and L(θ). The steepest ascent algorithm never takes a downhill step. The steepest ascent algorithm almost always converges to a local or global maximum in practice. Theoretically, the method can converge to a saddle point but this is an unstable solution hence difficult to achieve in practice. The trajectory followed by the steepest ascent algorithm can be characterized geometrically in terms of the level sets of the objective function. The gradient f(x 0 ) of f, evaluated at x 0, is perpendicular to the level sets of f at x 0. The extreme values of the restricted function g(λ) = f(x 0 + λ f(x 0 )) are located at the points λ such that f(x 0 ) is is tangent to the level curves of f at x 0 + λ f(x 0 ). Thus each iteration moves from x 0 in the direction f(x 0 ) such that the angle of f(x 0 ) relative to the level sets moves from π/2 to 0. 5
6 The steepest ascent method is globally convergent for concave functions, but tends to be very slow. It is not finitely convergent for any interesting class of functions. It is possible to construct optimization algorithms using only first derivatives that are finitely convergent for quadratic functions. These methods are generally preferred to steepest ascent. EM Algorithm Suppose we observe Y according to a density f θ (Y ), where L(θ Y ) = log f θ (Y ) is a complicated function of θ, and therefore is hard to optimize. Suppose further that the nature of the problem suggests a random variable Z such that if Z were observed, then the augmented log-likelihood L(θ Y, Z) = log f θ (Y, Z) is simpler, in that the following two conditions hold: 1. We can compute the following conditional expectation as a function of θ in closed form Q(θ θ ) = E θ (L θ (Y, Z) Y ). 2. We can optimize Q(θ θ ) as a function of θ for fixed θ. That is, we can compute M(θ ) = argmax θ Q(θ θ ). Under these conditions, the EM-algorithm constructs a sequence of iterates according to the rule θ i+1 = M(θ i ). Under certain additional regularity conditions (J. Wu, Annals Stat. 1983, pp ), the θ i converge to a stationary point of L(θ Y ). If L(θ Y ) is unimodal and has only one stationary point, then the convergence is to the MLE. When convergence occurs, the rate is typically linear. Step 1 is called the E-step, and step 2 is called the M-step. The EM algorithm, like the steepest ascent and conjugate gradient algorithms, but unlike Newton s method, is monotonic, in that the sequence θ n of iterates satisfies L(θ n+1 Y ) L(θ n Y ). Begin by observing that L(θ Y ) = log f θ (Y, Z) log f θ (Z Y ) = f θ (Z Y ) log f θ (Y, Z)dZ f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) f θ (Z Y ) log f θ (Z Y )dz = Q(θ θ ) K(θ θ ). Continuing, we have 6
7 L(θ Y ) L(θ Y ) = Q(θ θ ) Q(θ θ ) + K(θ θ ) K(θ θ ) = Q(θ θ ) Q(θ θ ) log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz. If we set θ = argmax τ Q(τ θ ), then Q(θ θ ) Q(θ θ ) 0. By Jensen s inequality, log (f θ (Z Y )/f θ (Z Y )) f θ (Z Y )dz log (f θ (Z Y )/f θ (Z Y ))f θ (Z Y )dz = 0. Thus L(θ Y ) L(θ Y ) 0. For any value of θ, L(θ Y ) Q(θ θ ) + L(θ Y ) Q(θ θ ) with equality for θ = θ. This leads to the characterization of the EM algorithm as a majorization or optimization transfer algorithm. Optimization of the surrogate function Q(θ θ ) always drives L(θ) uphill. Example: Finite mixture models. Suppose π 1,..., π K are positive constants with j π j = 1. Let f 1,..., f K denote density functions. Then j π j f j is also a density function. It is the density of an observation generated according to the following twostage procedure: (i) generate κ {1,..., K} with P (κ = k) = π k, (ii) generate Y f κ. We observe Y but not κ. Suppose we observe Y 1,..., Y n according to the mixture density, and we wish to estimate the parameters π = (π 1,..., π K ) using maximum likelihood. The log-likelihood for the observed data is: n K L(π {Y j }) = log( π k f k (Y j )). j=1 k=1 Suppose we were to observe as complete data the pair (Y j, κ j ), where κ j is the random variable defined in (i) above. Then the log-likelihood for the complete data is: n L(π {Y j, κ j }) = log(π κj ) + log(f κj (Y j )). j=1 The E-step of the EM algorithm requires that we compute the expectation of L(π {Y j, κ j }) given the Y j for a fixed probability vector π. This can be more easily accomplished if we rewrite the complete-data log-likelihood as follows: 7
8 n K L(π {Y j, κ j }) = log(π k )I(κ j = k) + log(f k (Y j ))I(κ j = k). j=1 k=1 This is linear in the random variables I(κ j = k), so the expectation is obtained by plugging in the values for the conditional expectations Using Bayes theorem, we have E π (I(κ j = k) Y j ) = P π (κ j = k Y j ). P π (κ j = k Y j ) P (Y j κ j = k)p π (κ j = k) = f k (Y j )π k. Therefore the expectation is given by: E π (I(κ j = k) Y j ) = f k (Y j )π k/ l f l (Y j )π l p jk, and the Q function takes the form n K Q(π π ) = log(π k )p jk + log(f k (Y j ))p jk. j=1 k=1 Note that only the first term involves π, so the M-step depends only on the first term. We update the parameters using the rule π k j p jk /n. Example: Gaussian Mixture Models Continuing with the previous example, suppose that f k (Y ) exp( (Y µ k ) Σ 1 k (Y µ k)/2)/ Σ k 1/2 is Gaussian, and we wish to estimate π 1,..., π K, µ 1,..., µ K and Σ 1,..., Σ K. The complete data log likelihood becomes L(π, {Σ k }, {µ k } {Y j, κ j }) = n K log(π k )I(κ j = k) 1 2 j=1 k=1 ( log Σk + (Y j µ k ) Σ 1 k (Y j µ k ) ) I(κ j = k). To carry out the E-step, the posterior probabilities p jk can be computed as in the previous example. To carry out the M-step for the mean and variance parameters, use the weighted moment updates 8
9 µ k j p jk Y j / j p jk, Σ k j p jk (Y j µ k )(Y j µ k ) / j p jk. The probability parameters π k are updated as in the previous example. Example: Contingency tables with collapsed cells. Suppose we observe a m n contingency table with independent rows and columns. If X ij denotes the number of observations in cell (i, j), then EX ij = np i q j where n is the total number of observations, p i is the probability of an observation falling in a cell in row i, and q j is the probability of an observation falling in a cell in column j. Now suppose that we are unable to distinguish between certain cells, so we can only observe the sum. To be concrete, consider a 2 3 table where we can only observe the following 4 values: Z 1 = X 11 + X 12 Z 2 = X 21 The likelihood for the observable data is: Z 3 = X 22 + X 13 Z 4 = X 23. L({p i }, {q j } {Z j }) (p 1 q 1 + p 1 q 2 ) Z 1 (p 2 q 1 ) Z 2 (p 2 q 2 + p 1 q 3 ) Z 3 (p 2 q 3 ) Z 4. We could optimize this function using gradient methods, but it would require a fair amount of programming. On the other hand, consider the likelihood for the {X ij } (some of which are unobservable): L({p i }, {q j } {X j }) ij X ij log(p i q j ). This is linear in the X ij, so to compute the Q( ) function, we need only to compute the conditional means of these values given the Z j, and then plug them in. Specifically, set X11 = E(X 11 {Z j }) = p 1 q 1 Z 1 /(p 1 q 1 + p 1 q 2 ) X12 = E(X 12 {Z j }) = p 1 q 2 Z 1 /(p 1 q 1 + p 1 q 2 ) X22 = E(X 22 {Z j }) = p 2 q 2 Z 3 /(p 2 q 2 + p 1 q 3 ) X13 = E(X 13 {Z j }) = p 1 q 3 Z 3 /(p 2 q 2 + p 1 q 3 ), 9
10 with the other X ij set to their directly-observable values. Finally, we can update the parameter estimates using p i j X ij/n q i i X ij/n. This defines one iteration of the EM algorithm. Example: Missing data in a multivariate normal model. Suppose that we observe Y j R d j with Y j N(µ j, Σ j ). Suppose further that there exist injective functions σ j : {1,..., d j } {1,..., d}, such that µ j (k) = µ(σ j (k)) and Σ j (k 1, k 2 ) = Σ(σ j (k 1 ), σ j (k 2 )). In words, the components of each Y j comprise a subset of the components of a common multivariate normal distribution. Suppose that we want to estimate θ = (µ, Σ) based on the following log-likelihood: L θ ({Y j }) = j 1 2 log Σ j 1 2 (Y j µ j ) Σ 1 j (Y j µ j ). Note that this might not be advisable if the σ j are viewed as random variables that could be dependent on the Y j. In general, L θ ({Y j }) can not be optimized in closed-form. A conjugate gradient algorithm would be a good choice here, but the EM algorithm is considerably simpler to derive and program (at least for a statistician). The complete data are random vectors Z j R d such that Z j N(µ, Σ) and Y j (k) = Z j (σ j (k)). The complete-data log-likelihood is: L θ ({Z j }) = j 1 2 log Σ 1 2 (Z j µ) Σ 1 (Z j µ) 1 2 log Σ tr(σ 1 (Z j µ)(z j µ) ). n j To compute the Q(θ θ ) function, we need the values η j = E µ,σ (Z j Y j ) and C j = E µ,σ (Z jz j Y j ). Once we have these values, the Q(θ θ ) function can be expressed: Q(θ θ ) = 1 2 log Σ 1 2 tr Σ 1 1 n C j η j µ µη j + µµ j = 1 2 log Σ 1 2 tr ( Σ 1 (µ η)(µ η) + Σ 1 ( C η η ) ), where η = j η j /n and C = j C j /n. The maximum of Q(θ θ ) occurs where µ = η, and Σ = C η η = j Var(Z j Y j )/n. 10
11 The values η j and C j can be easily computed using the usual formulas for the conditional mean and variance of a multivariate normal random vector. Specifically, let Z denote the missing components of Z. Then: E(Z Y ) = EZ + cov(z, Y )cov(y ) 1 (Z EY ) cov(z Y ) = cov(z ) cov(z, Y )cov(y ) 1 cov(y, Z ), where all of the values on the right hand side are computed with respect to θ = (µ, Σ ) using the usual formulas. General optimization transfer. The basic idea behind the EM algorithm can be extended to optimization problems where there is no data augmentation, or even to problems that have no likelihood. In fact the reason that the procedure works has only to do with convexity, and nothing whatsoever to do with probability. We retain the idea of the minorant Q(θ θ ) and shift it by Q(θ θ ), so that Q(θ θ ) = 0 and L(θ) Q(θ θ ) + L(θ ), so that optimizing Q(θ θ ) (with respect to θ) drives L(θ) uphill. There are many ways of constructing such a Q( ). For example, if L(θ) is convex and differentiable, then suggesting Q(θ θ ) = L (θ)(θ θ ). L(θ) L (θ)(θ θ ) + L(θ ), Simulated Annealing Suppose we want to maximize a real-valued function f with domain S. Simulated annealing is an attractive optimization procedure if any of the following hold: (i) f has many local extreme values, (ii) S is discrete, (iii) f is not smooth, or the derivative of f is much more expensive to compute than the function value. Suppose we have a family of proposal distributions on S indexed by X S: R(X X ). From a starting value X 0, define a sequence of points in S using the rules Z n R( X n 1 ) P (X n = Z n ) = min(exp((f(z n ) f(x n 1 ))/T n ), 1) P (X n = X n 1 ) = 1 P (X n = Z n ). Note that if f(z n ) > f(x n 1 ) then X n = Z n uphill proposals are always accepted, but some downhill proposals are accepted as well. 11
12 As T n 0, downhill proposals are accepted less often. If T n 0 sufficiently slowly, then a.s. convergence to the global extremum occurs for a reasonable class of problems. Theoretically, the rate of T n should have the form T n 1/ log(c + n). However this rate is too slow to be useful for most practical problems, since for n of manageable size the estimate X n will have substantial variability. There is some experimental evidence that the optimal value can be identified for many problems using a heuristic rule, such as decreasing T n by 10% each time a fixed number of steps are accepted. 12
EM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationLinear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52
Statistics for Applications Chapter 10: Generalized Linear Models (GLMs) 1/52 Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52 Components of a linear model The two
More informationLast lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton
EM Algorithm Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares
More informationBiostat 2065 Analysis of Incomplete Data
Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies
More informationLecture 3 September 1
STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have
More informationLecture 4: Types of errors. Bayesian regression models. Logistic regression
Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture
More informationPh.D. Qualifying Exam Friday Saturday, January 3 4, 2014
Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently
More informationMaximum Likelihood, Logistic Regression, and Stochastic Gradient Training
Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training Charles Elkan elkan@cs.ucsd.edu January 17, 2013 1 Principle of maximum likelihood Consider a family of probability distributions
More informationPh.D. Qualifying Exam Friday Saturday, January 6 7, 2017
Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a
More informationReview and continuation from last week Properties of MLEs
Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that
More informationScientific Computing: Optimization
Scientific Computing: Optimization Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course MATH-GA.2043 or CSCI-GA.2112, Spring 2012 March 8th, 2011 A. Donev (Courant Institute) Lecture
More informationP n. This is called the law of large numbers but it comes in two forms: Strong and Weak.
Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to
More informationTheory of Maximum Likelihood Estimation. Konstantin Kashin
Gov 2001 Section 5: Theory of Maximum Likelihood Estimation Konstantin Kashin February 28, 2013 Outline Introduction Likelihood Examples of MLE Variance of MLE Asymptotic Properties What is Statistical
More informationLecture 4 September 15
IFT 6269: Probabilistic Graphical Models Fall 2017 Lecture 4 September 15 Lecturer: Simon Lacoste-Julien Scribe: Philippe Brouillard & Tristan Deleu 4.1 Maximum Likelihood principle Given a parametric
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationThe Expectation-Maximization Algorithm
1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable
More informationMixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate
Mixture Models & EM icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Previously We looed at -means and hierarchical clustering as mechanisms for unsupervised learning -means
More informationFor iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.
Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of
More informationApproximate Normality, Newton-Raphson, & Multivariate Delta Method
Approximate Normality, Newton-Raphson, & Multivariate Delta Method Timothy Hanson Department of Statistics, University of South Carolina Stat 740: Statistical Computing 1 / 39 Statistical models come in
More informationStatistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach
Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score
More informationNotes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed
18.466 Notes, March 4, 2013, R. Dudley Maximum likelihood estimation: actual or supposed 1. MLEs in exponential families Let f(x,θ) for x X and θ Θ be a likelihood function, that is, for present purposes,
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a
More informationStatistical Estimation
Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from
More informationChap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University
Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems
More informationComputing the MLE and the EM Algorithm
ECE 830 Fall 0 Statistical Signal Processing instructor: R. Nowak Computing the MLE and the EM Algorithm If X p(x θ), θ Θ, then the MLE is the solution to the equations logp(x θ) θ 0. Sometimes these equations
More informationMachine Learning. Lecture 3: Logistic Regression. Feng Li.
Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification
More informationMaximum Likelihood Estimation
Chapter 8 Maximum Likelihood Estimation 8. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationMultivariate Analysis and Likelihood Inference
Multivariate Analysis and Likelihood Inference Outline 1 Joint Distribution of Random Variables 2 Principal Component Analysis (PCA) 3 Multivariate Normal Distribution 4 Likelihood Inference Joint density
More informationGeneralized Linear Models. Kurt Hornik
Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general
More informationIntroduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf
1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Difficulties due to Global Normalization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate
More informationGaussian Mixture Models
Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some
More informationMaximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS
Maximum Likelihood (ML), Expectation Maximization (EM) Pieter Abbeel UC Berkeley EECS Many slides adapted from Thrun, Burgard and Fox, Probabilistic Robotics Outline Maximum likelihood (ML) Priors, and
More informationLearning Parameters of Undirected Models. Sargur Srihari
Learning Parameters of Undirected Models Sargur srihari@cedar.buffalo.edu 1 Topics Log-linear Parameterization Likelihood Function Maximum Likelihood Parameter Estimation Simple and Conjugate Gradient
More informationA Very Brief Summary of Statistical Inference, and Examples
A Very Brief Summary of Statistical Inference, and Examples Trinity Term 2008 Prof. Gesine Reinert 1 Data x = x 1, x 2,..., x n, realisations of random variables X 1, X 2,..., X n with distribution (model)
More informationOptimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes
Optimization Charles J. Geyer School of Statistics University of Minnesota Stat 8054 Lecture Notes 1 One-Dimensional Optimization Look at a graph. Grid search. 2 One-Dimensional Zero Finding Zero finding
More informationThe Expectation Maximization or EM algorithm
The Expectation Maximization or EM algorithm Carl Edward Rasmussen November 15th, 2017 Carl Edward Rasmussen The EM algorithm November 15th, 2017 1 / 11 Contents notation, objective the lower bound functional,
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Assume X P θ, θ Θ, with joint pdf (or pmf) f(x θ). Suppose we observe X = x. The Likelihood function is L(θ x) = f(x θ) as a function of θ (with the data x held fixed). The
More informationGradient Descent. Sargur Srihari
Gradient Descent Sargur srihari@cedar.buffalo.edu 1 Topics Simple Gradient Descent/Ascent Difficulties with Simple Gradient Descent Line Search Brent s Method Conjugate Gradient Descent Weight vectors
More informationEM for Spherical Gaussians
EM for Spherical Gaussians Karthekeyan Chandrasekaran Hassan Kingravi December 4, 2007 1 Introduction In this project, we examine two aspects of the behavior of the EM algorithm for mixtures of spherical
More informationSTAT 730 Chapter 4: Estimation
STAT 730 Chapter 4: Estimation Timothy Hanson Department of Statistics, University of South Carolina Stat 730: Multivariate Analysis 1 / 23 The likelihood We have iid data, at least initially. Each datum
More informationPOLI 8501 Introduction to Maximum Likelihood Estimation
POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,
More informationLinear and logistic regression
Linear and logistic regression Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Linear and logistic regression 1/22 Outline 1 Linear regression 2 Logistic regression 3 Fisher discriminant analysis
More informationMIT Spring 2016
Generalized Linear Models MIT 18.655 Dr. Kempthorne Spring 2016 1 Outline Generalized Linear Models 1 Generalized Linear Models 2 Generalized Linear Model Data: (y i, x i ), i = 1,..., n where y i : response
More informationLecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016
Lecture 7 Logistic Regression Luigi Freda ALCOR Lab DIAG University of Rome La Sapienza December 11, 2016 Luigi Freda ( La Sapienza University) Lecture 7 December 11, 2016 1 / 39 Outline 1 Intro Logistic
More informationThe classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know
The Bayes classifier Theorem The classifier satisfies where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know Alternatively, since the maximum it is
More informationThe classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA
The Bayes classifier Linear discriminant analysis (LDA) Theorem The classifier satisfies In linear discriminant analysis (LDA), we make the (strong) assumption that where the min is over all possible classifiers.
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2019 Due: 11am Monday, February 25th, 2019 Submit scan of plots/written responses to Gradebook; submit your
More informationStatistics & Data Sciences: First Year Prelim Exam May 2018
Statistics & Data Sciences: First Year Prelim Exam May 2018 Instructions: 1. Do not turn this page until instructed to do so. 2. Start each new question on a new sheet of paper. 3. This is a closed book
More informationMaximum Likelihood Estimation
Chapter 7 Maximum Likelihood Estimation 7. Consistency If X is a random variable (or vector) with density or mass function f θ (x) that depends on a parameter θ, then the function f θ (X) viewed as a function
More informationEM & Variational Bayes
EM & Variational Bayes Hanxiao Liu September 9, 2014 1 / 19 Outline 1. EM Algorithm 1.1 Introduction 1.2 Example: Mixture of vmfs 2. Variational Bayes 2.1 Introduction 2.2 Example: Bayesian Mixture of
More informationA minimalist s exposition of EM
A minimalist s exposition of EM Karl Stratos 1 What EM optimizes Let O, H be a random variables representing the space of samples. Let be the parameter of a generative model with an associated probability
More informationChapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1
Chapter 4 HOMEWORK ASSIGNMENTS These homeworks may be modified as the semester progresses. It is your responsibility to keep up to date with the correctly assigned homeworks. There may be some errors in
More informationMaximum likelihood estimation
Maximum likelihood estimation Guillaume Obozinski Ecole des Ponts - ParisTech Master MVA Maximum likelihood estimation 1/26 Outline 1 Statistical concepts 2 A short review of convex analysis and optimization
More informationf X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2
Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,
More informationChapter 3 : Likelihood function and inference
Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM
More informationLikelihood-Based Methods
Likelihood-Based Methods Handbook of Spatial Statistics, Chapter 4 Susheela Singh September 22, 2016 OVERVIEW INTRODUCTION MAXIMUM LIKELIHOOD ESTIMATION (ML) RESTRICTED MAXIMUM LIKELIHOOD ESTIMATION (REML)
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should
More informationInformation in a Two-Stage Adaptive Optimal Design
Information in a Two-Stage Adaptive Optimal Design Department of Statistics, University of Missouri Designed Experiments: Recent Advances in Methods and Applications DEMA 2011 Isaac Newton Institute for
More informationExpectation Maximization (EM) Algorithm. Each has it s own probability of seeing H on any one flip. Let. p 1 = P ( H on Coin 1 )
Expectation Maximization (EM Algorithm Motivating Example: Have two coins: Coin 1 and Coin 2 Each has it s own probability of seeing H on any one flip. Let p 1 = P ( H on Coin 1 p 2 = P ( H on Coin 2 Select
More informationLecture 3 - Linear and Logistic Regression
3 - Linear and Logistic Regression-1 Machine Learning Course Lecture 3 - Linear and Logistic Regression Lecturer: Haim Permuter Scribe: Ziv Aharoni Throughout this lecture we talk about how to use regression
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationEstimating the parameters of hidden binomial trials by the EM algorithm
Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :
More informationMachine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang
Machine Learning Lecture 04: Logistic and Softmax Regression Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology This set
More informationLogistic Regression. Seungjin Choi
Logistic Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/
More informationInformation in Data. Sufficiency, Ancillarity, Minimality, and Completeness
Information in Data Sufficiency, Ancillarity, Minimality, and Completeness Important properties of statistics that determine the usefulness of those statistics in statistical inference. These general properties
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationComputational statistics
Computational statistics EM algorithm Thierry Denœux February-March 2017 Thierry Denœux Computational statistics February-March 2017 1 / 72 EM Algorithm An iterative optimization strategy motivated by
More informationFinite Singular Multivariate Gaussian Mixture
21/06/2016 Plan 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Plan Singular Multivariate Normal Distribution 1 Basic definitions Singular Multivariate Normal Distribution 2 3 Multivariate
More informationGaussian Processes 1. Schedule
1 Schedule 17 Jan: Gaussian processes (Jo Eidsvik) 24 Jan: Hands-on project on Gaussian processes (Team effort, work in groups) 31 Jan: Latent Gaussian models and INLA (Jo Eidsvik) 7 Feb: Hands-on project
More informationMaximum Likelihood Estimation
Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters
More informationSemiparametric Gaussian Copula Models: Progress and Problems
Semiparametric Gaussian Copula Models: Progress and Problems Jon A. Wellner University of Washington, Seattle European Meeting of Statisticians, Amsterdam July 6-10, 2015 EMS Meeting, Amsterdam Based on
More informationAM 205: lecture 18. Last time: optimization methods Today: conditions for optimality
AM 205: lecture 18 Last time: optimization methods Today: conditions for optimality Existence of Global Minimum For example: f (x, y) = x 2 + y 2 is coercive on R 2 (global min. at (0, 0)) f (x) = x 3
More informationCS229 Lecture notes. Andrew Ng
CS229 Lecture notes Andrew Ng Part X Factor analysis When we have data x (i) R n that comes from a mixture of several Gaussians, the EM algorithm can be applied to fit a mixture model. In this setting,
More informationIEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm
IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.
More informationSupport Vector Machines
Support Vector Machines Le Song Machine Learning I CSE 6740, Fall 2013 Naïve Bayes classifier Still use Bayes decision rule for classification P y x = P x y P y P x But assume p x y = 1 is fully factorized
More informationLecture 16 Solving GLMs via IRWLS
Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example
More informationFall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.
1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n
More informationNeural Network Training
Neural Network Training Sargur Srihari Topics in Network Training 0. Neural network parameters Probabilistic problem formulation Specifying the activation and error functions for Regression Binary classification
More informationSTAT Advanced Bayesian Inference
1 / 8 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics March 5, 2018 Distributional approximations 2 / 8 Distributional approximations are useful for quick inferences, as starting
More informationChapter 4: Asymptotic Properties of the MLE (Part 2)
Chapter 4: Asymptotic Properties of the MLE (Part 2) Daniel O. Scharfstein 09/24/13 1 / 1 Example Let {(R i, X i ) : i = 1,..., n} be an i.i.d. sample of n random vectors (R, X ). Here R is a response
More informationSTAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.
STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationChapter 4: Asymptotic Properties of the MLE
Chapter 4: Asymptotic Properties of the MLE Daniel O. Scharfstein 09/19/13 1 / 1 Maximum Likelihood Maximum likelihood is the most powerful tool for estimation. In this part of the course, we will consider
More informationGraphical Models for Collaborative Filtering
Graphical Models for Collaborative Filtering Le Song Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012 Sequence modeling HMM, Kalman Filter, etc.: Similarity: the same graphical model topology,
More informationStat 710: Mathematical Statistics Lecture 12
Stat 710: Mathematical Statistics Lecture 12 Jun Shao Department of Statistics University of Wisconsin Madison, WI 53706, USA Jun Shao (UW-Madison) Stat 710, Lecture 12 Feb 18, 2009 1 / 11 Lecture 12:
More informationParametric Techniques
Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure
More informationUnsupervised Learning
2018 EE448, Big Data Mining, Lecture 7 Unsupervised Learning Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html ML Problem Setting First build and
More information1. Fisher Information
1. Fisher Information Let f(x θ) be a density function with the property that log f(x θ) is differentiable in θ throughout the open p-dimensional parameter set Θ R p ; then the score statistic (or score
More informationBayesian Paradigm. Maximum A Posteriori Estimation
Bayesian Paradigm Maximum A Posteriori Estimation Simple acquisition model noise + degradation Constraint minimization or Equivalent formulation Constraint minimization Lagrangian (unconstraint minimization)
More informationHOMEWORK #4: LOGISTIC REGRESSION
HOMEWORK #4: LOGISTIC REGRESSION Probabilistic Learning: Theory and Algorithms CS 274A, Winter 2018 Due: Friday, February 23rd, 2018, 11:55 PM Submit code and report via EEE Dropbox You should submit a
More informationSpatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields
Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields 1 Introduction Jo Eidsvik Department of Mathematical Sciences, NTNU, Norway. (joeid@math.ntnu.no) February
More informationExpectation Maximization Algorithm
Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters
More informationUnbiased Estimation. Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others.
Unbiased Estimation Binomial problem shows general phenomenon. An estimator can be good for some values of θ and bad for others. To compare ˆθ and θ, two estimators of θ: Say ˆθ is better than θ if it
More informationGauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA
JAPANESE BEETLE DATA 6 MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA Gauge Plots TuscaroraLisa Central Madsen Fairways, 996 January 9, 7 Grubs Adult Activity Grub Counts 6 8 Organic Matter
More informationComputer Intensive Methods in Mathematical Statistics
Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of
More informationLatent Variable Models and EM Algorithm
SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/
More informationPARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS
Statistica Sinica 15(2005), 831-840 PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS Florin Vaida University of California at San Diego Abstract: It is well known that the likelihood sequence of the EM algorithm
More informationClassification: Logistic Regression from Data
Classification: Logistic Regression from Data Machine Learning: Alvin Grissom II University of Colorado Boulder Slides adapted from Emily Fox Machine Learning: Alvin Grissom II Boulder Classification:
More informationA Note on the Expectation-Maximization (EM) Algorithm
A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization
More information