Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

Size: px
Start display at page:

Download "Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton"

Transcription

1 EM Algorithm

2 Last lecture 1/35 General optimization problems Newton Raphson Fisher scoring Quasi Newton Nonlinear regression models Gauss-Newton Generalized linear models Iteratively reweighted least squares

3 Expectation maximization (EM) algorithm 2/35 An iterative algorithm for maximizing likelihood when the model contains unobserved latent variables. Was initially invented by computer scientist in special circumstances. Generalized by Arthur Dempster, Nan Laird, and Donald Rubin in a classic 1977 JRSSB paper, which is widely known as the DLR paper. The algorithm iterate between E-step (expectation) and M-step (maximization). E-step: create a function for the expectation of the log-likelihood, evaluated using the current estimate for the parameters. M-step: compute parameters maximizing the expected log-likelihood found on the E step.

4 Motivating example of EM algorithm 3/35 Assume people s height (in cm) follow normal distributions with different means for male and female: N(µ 1, σ 2 1 ) for male, and N(µ 2, σ 2 2 ) for female. We observe the heights for 5 people (don t know the gender): 182, 163, 175, 185, 158. We want to estimate µ 1, µ 2, σ 1 and σ 2. This is the typical two-component normal mixture model, e.g., data are from a mixture of two normal distributions. The goal is to estimate model parameters. We could, of course, form the likelihood function (multiplication of Normal densities) and find its maximum by Newton-Raphson.

5 A sketch of an EM algorithm 4/35 Some notations: For person i, denote his/her height by x i, and use Z i to indicate gender. Define p i be the proportion of male in the population. Start by choosing reasonable initial values. Then: In the E-step, compute the probability of each person being male or female, given the current model parameters. We have (after some derivation) λ (k) i E[Z i µ (k) 1, µ(k) 2, σ(k) 1, σ(k) 2 ) = π (k) φ(x i ; µ (k) 1, σ(k) 1 ) πφ(x i ; µ (k) 1, σ(k) 1 ) + (1 π(k) )φ(x i ; µ (k) 2, σ(k) 2 ) In the M-step, update parameters and group proportions by considering the probabilities from E-step as weights. They are basically weighted average and variance. For example, i λ (k) i(1 λ (k) µ (k+1) 1 = x i i i λ (k) i, µ (k+1) 2 = i )x i i(1 λ (k) i ), p i = i λ (k) i /5

6 Example results 5/35 We choose µ 1 = 175, µ 2 = 165, σ 1 = σ 2 = 10 as initial values. After first iteration, we have after E-step Person x i : height (cm) λ i : prob. male The estimates for parameters after M-step are (weighted average and variance): µ 1 = 176, µ 2 = 167, σ 1 = 8.7, σ 2 = 9.2, π = At iteration 15 (converged), we have: Person Height (cm) Prob. male e e e e e-06 The estimates for parameters are: µ 1 = 179.6, µ 2 = 161.5, σ 1 = 4.1, σ 2 = 3.5, π = 0.6.

7 Another motivating example of EM algorithm 6/35 ABO blood groups Genotype Genotype Frequency Phenotype AA p 2 A A AO 2p A p O A BB p 2 B B BO 2p B p O B OO p 2 O O AB 2p A p B AB The genotype frequencies above assume Hardy-Weinberg equilibrium. For a random sample of n individuals, we observe their phenotype, but not their genotype. We wish to obtain the MLEs of the underlying allele frequencies p A, p B, and p O = 1 p A p B. The likelihood is (from multinomial): L(p A, p B ) = (p 2 A + 2p Ap O ) n A (p 2 B + 2p Bp O ) n B (p 2 O )n O (2p A p B ) n AB

8 Motivating example (Allele counting algorithm) 7/35 Let n A, n B, n O, n AB be the observed numbers of individuals with phenotypes A, B, O, AB, respectively. Let n AA, n AO, n BB and n BO be the unobserved numbers of individuals with genotypes AA, AO, BB and BO, respectively. They satisfy n AA + n AO = n A and n BB + n BO = n B. 1. Start with initial estimates p (0) = (p (0) A, p(0) B, p(0) O ) 2. Calculate the expected n AA and n BB, given observed data and p (k) n (k+1) AA = E(n AA n A, p (k) ) = n A p (k) p (k) A p(k) A 3. Update p (k+1). Imagining that n (k+1) AA, n(k+1) BB p (k+1) A = (2n (k+1) AA + n(k+1) AO A p(k) A + 2p(k) O p(k) A and n (k+1) AB, n (k+1) BB =? were actually observed + n(k+1) AB )/(2n), p(k+1) B =? 4. Repeat step 2 and 3 until the estimates converge

9 EM algorithm: Applications 8/35 Expectation-Mmaximization algorithm (Dempster, Laird, & Rubin, 1977, JRSSB, 39:1 38) is a general iterative algorithm for parameter estimation by maximum likelihood (optimization problems). It is useful when some of the random variables involved are not observed, i.e., considered missing or incomplete. direct maximizing the target likelihood function is difficult, but one can introduce (missing) random variables so that maximizing the complete-data likelihood is simple. Typical problems include: Filling in missing data in a sample Discovering the value of latent variables Estimating parameters of HMMs Estimating parameters of finite mixtures

10 Description of EM 9/35 Consider (Y obs, Y mis ) f (y obs, y mis θ), where we observe Y obs but not Y mis It can be difficult to find MLE ˆθ = arg max θ g(y obs θ) = arg max θ f (Yobs, y mis θ) dy mis But it could be easy to find ˆθ C = arg max θ f (Y obs, Y mis θ), if we had observed Y mis. { E step: h (k) } (θ) E log f (Y obs, Y mis θ) Y obs, θ (k) M step: θ (k+1) = arg max θ h (k) (θ); Nice properties (compared to Newton-Raphson): 1. simplicity of implementation 2. stable monotone convergence

11 Justification of EM 10/35 The E-step creates a surrogate function (often called the Q function ), which is the expected value of the log likelihood function, with respect to the conditional distribution of Y mis given Y obs, under the current estimate of the parameters θ (k). The M-step maximizes the surrogate function.

12 Ascent property of EM 11/35 Theorem: At each iteration of the EM algorithm, log g(y obs θ (k+1) ) log g(y obs θ (k) ) and the equality holds if and only if θ (k+1) = θ (k). Proof: The definition of θ (k+1) gives which can be expanded to E{log f (Y obs, Y mis θ (k+1) ) Y obs, θ (k) } E{log f (Y obs, Y mis θ (k) ) Y obs, θ (k) }, E{log c(y mis Y obs, θ (k+1) ) Y obs, θ (k) }+log g(y obs θ (k+1) ) E{log c(y mis Y obs, θ (k) ) Y obs, θ (k) }+log g(y obs θ (k) ). By the non-negativity of the Kullback-leibler information, i.e., log p(x) p(x)dx 0, for densities p(x), q(x), q(x) we have log c(y mis Y obs, θ (k) [ ) c(y mis Y obs, θ (k+1) ) c(y mis Y obs, θ (k) ) dy mis = E log c(y mis Y obs, θ (k) ] ) c(y mis Y obs, θ (k+1) ) Y obs, θ (k) 0. (2) (1)

13 Ascent property of EM (continued) 12/35 Combining (1) and (2) yields log g(y obs θ (k+1) ) log g(y obs θ (k) ), thus we partially proved the theorem. If the equality holds, i.e., log g(y obs θ (k+1) ) = log g(y obs θ (k) ), (3) by (1) and (2), E{log c(y mis Y obs, θ (k+1) ) Y obs, θ (k) } = E{log c(y mis Y obs, θ (k) ) Y obs, θ (k) }. The Kullback-leibler information is zero if and only if log c(y mis Y obs, θ (k+1) ) = log c(y mis Y obs, θ (k) ). (4) Combining (3) and (4), we have log f (Y θ (k+1) ) = log f (Y θ (k) ). The uniqueness of θ leads to θ (k+1) = θ (k).

14 Example 1: Grouped Multinomial Data 13/35 Suppose Y = (y 1, y 2, y 3, y 4 ) has a multinomial distribution with cell probabilities ( θ 4, 1 θ 4, 1 θ 4, θ ). 4 Then the probability for Y is given by L(θ Y) (y 1 + y 2 + y 3 + y 4 )! y 1!y 2!y 3!y 4! ( θ ) y1 ( 1 θ 4 4 ) y2 ( 1 θ If we use Newton-Raphson to directly maximize f (Y, θ), we need y 1 /4 l(θ Y) = 1/2 + θ/4 y 2 + y 3 1 θ + y 4 θ l(θ Y) = y 1 (2 + θ) y 2 + y 3 2 (1 θ) y 4 2 θ 2 The probability of the first cell is a trouble-maker! How to avoid? 4 ) y3 ( θ ) y4. 4

15 Example 1: Grouped Multinomial Data (continued) 14/35 Suppose Y = (y 1, y 2, y 3, y 4 ) has a multinomial distribution with cell probabilities ( θ 4, 1 θ 4, 1 θ 4, θ ). 4 Define the complete-data: X = (x 0, x 1, y 2, y 3, y 4 ) to have a multinomial distribution with probabilities ( 1 2, θ 4, 1 θ 4, 1 θ 4, θ ), 4 and to satisfy x 0 + x 1 = y 1 Observed-data log likelihood ( 1 l(θ Y) y 1 log 2 + θ ) 4 Complete-data log likelihood + (y 2 + y 3 ) log (1 θ) + y 4 log θ l C (θ X) (x 1 + y 4 ) log θ + (y 2 + y 3 ) log (1 θ)

16 Example 1: Grouped Multinomial Data (continued) 15/35 E step: evaluate x (k+1) 1 = E(x 1 Y, θ (k) ) = y 1 θ (k) /4 1/2 + θ (k) /4 M step: maximize complete-data log likelihood with x 1 replaced by x (k+1) 1 θ (k+1) = x (k+1) 1 + y 4 x (k+1) 1 + y 4 + y 2 + y 3

17 Example 1: Grouped Multinomial Data (continued) 16/35 We observe Y = (125, 18, 20, 34) and start EM with θ (0) = 0.5. Parameter update Convergence to ˆθ Convergence rate k θ (k) θ (k) ˆθ (θ (k) ˆθ)/(θ (k 1) ˆθ) ˆθ Stop

18 Example 2: Normal mixtures 17/35 Consider a J-group normal mixture, where x 1,..., x n J j=1 p jφ(x i µ j, σ j ). φ(. µ, σ) is the normal density. This is the clustering/finite mixture problem in which EM is typically used for. Define indicator variable for observation i: (y i1, y i2,..., y ij ) follows a multinomial distribution (with trail number=1) and cell probabilities p = (p 1, p 2,..., p J ). Clearly, j y i j = 1. Given y i j = 1 and y i j = 0 for j j, we assume x i N(µ j, σ j ). You can check, marginally, x i J j=1 p jφ(x i µ j, σ j ). Here, {x i } i is the observed data; {x i, y i1,..., y ij } i is the complete data. Observed-data log likelihood l(µ, σ, p x) Complete-data log likelihood l C (µ, σ, p x, y) log i J j=1 p j φ(x i µ j, σ j ) { y i j log p j + log φ(x i µ j, σ j ) } i j

19 Example 2: Normal mixtures (continued) 18/35 Complete-data log likelihood: l C (µ, σ, p x, y) y i j {log p j (x i µ j ) 2 /(2σ 2 j ) log σ j} E step: evaluate for i = 1,..., n and j = 1,..., J, i j ω (k) i j E(y i j x i, µ (k), σ (k), p (k) ) = Pr(y i j = 1 x i, µ (k), σ (k), p (k) ) = p(k) j j p (k) j f (x i µ (k) j, σ (k) j ) f (x i µ (k) j, σ (k) j )

20 M step: maximize complete-data log likelihood with y i j replaced by ω i j p (k+1) j µ (k+1) j = σ (k+1) j = = n 1 i i ω (k) i j ω (k) i j x i i ω (k) i j / i ω (k) i j ( xi µ (k) j ) 2 / i ω (k) i j Practice: When all groups share the same variance (σ 2 ), what s the M-step update for σ 2?

21 Example 2: Normal mixtures in R 20/35 ### two component EM ### pn(0,1)+(1-p)n(4,1) EM_TwoMixtureNormal = function(p, mu1, mu2, sd1, sd2, X, maxiter=1000, tol=1e-5) { diff=1 iter=0 while (diff>tol & iter<maxiter) { ## E-step: compute omega: d1=dnorm(x, mean=mu1, sd=sd1) d2=dnorm(x, mean=mu2, sd=sd2) omega=d1*p/(d1*p+d2*(1-p)) # compute density in two groups ## M-step: update p, mu and sd p.new=mean(omega) mu1.new=sum(x*omega) / sum(omega) mu2.new=sum(x*(1-omega)) / sum(1-omega) resid1=x-mu1 resid2=x-mu2;

22 sd1.new=sqrt(sum(resid1ˆ2*omega) / sum(omega)) sd2.new=sqrt(sum(resid2ˆ2*(1-omega)) / sum(1-omega)) ## calculate diff to check convergence diff=sqrt(sum((mu1.new-mu1)ˆ2+(mu2.new-mu2)ˆ2 +(sd1.new-sd1)ˆ2+(sd2.new-sd2)ˆ2)) p=p.new; mu1=mu1.new; mu2=mu2.new; sd1=sd1.new; sd2=sd2.new; iter=iter+1; } cat("iter", iter, ": mu1=", mu1.new, ", mu2=",mu2.new, ", sd1=",sd1.new, ", sd2=",sd2.new, ", p=", p.new, ", diff=", diff, "\n") }

23 Example 2: Normal mixtures in R (continued) 22/35 > ## simulation > p0=0.3; > n=5000; > X1=rnorm(n*p0); # n*p0 indiviudals from N(0,1) > X2=rnorm(n*(1-p0), mean=4) # n*(1-p0) individuals from N(4,1) > X=c(X1,X2) # observed data > hist(x, 50) Frequency Histogram of X X

24 Example 2: Normal mixtures in R (continued) 23/35 > ## initial values for EM > p=0.5 > mu1=quantile(x, 0.1); > mu2=quantile(x, 0.9) > sd1=sd2=sd(x) > c(p, mu1, mu2, sd1, sd2) > EM_TwoMixtureNormal(p, mu1, mu2, sd1, sd2, X) Iter 1: mu1=0.8697, mu2=4.0109, sd1=2.1342, sd2=1.5508, p=0.3916, diff= Iter 2: mu1=0.9877, mu2=3.9000, sd1=1.8949, sd2=1.2262, p=0.3843, diff= Iter 3: mu1=0.8353, mu2=4.0047, sd1=1.7812, sd2=1.0749, p=0.3862, diff= Iter 4: mu1=0.7203, mu2=4.0716, sd1=1.6474, sd2=0.9899, p=0.3852, diff= Iter 44: mu1= , mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.9e-05 Iter 45: mu1= , mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.4e-05 Iter 46: mu1= , mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=1.1e-05 Iter 47: mu1= , mu2=3.9515, sd1=0.9885, sd2=1.0316, p=0.2959, diff=8.7e-06

25 In class practice: Poisson mixture 24/35 Using the same notations as in Normal mixture model. now assume the data is from a mixture of Poisson distributions. Consider x 1,..., x n J j=1 p jφ(x i λ j ), where φ(. λ) is the Poisson density. Again use y i j to indicate group assignments, (y i1, y i2,..., y ij ) follows a multinomial distribution with cell probabilities p = (p 1, p 2,..., p J ). Now the observed-data log likelihood l(λ, p x) log Complete-data log likelihood l C (λ, p x, y) i J j=1 p j (x i log λ j λ j ) { y i j log p j + (x i log λ j λ j ) } i j Derivate the EM iterations!

26 Example 3: Mixed-effects model 25/35 For a longitudinal dataset of i = 1,..., N subjects, each with n i measurements of the outcome. The linear mixed effect model is given by Y i = X i β + Z i b i + ɛ i, b i N q (0, D), ɛ i N ni (0, σ 2 I ni ), b i, ɛ i independent Observed-data log-likelihood l(β, D, σ 2 Y 1,..., Y N ) where Σ i = Z i DZ i + σ2 I ni. i { 1 2 (Y i X i β) Σ 1 i (Y i X i β) 1 } 2 log Σ i, In fact, this likelihood can be directly maximized for (β, D, σ 2 ) by using Newton-Raphson or Fisher scoring. Given (D, σ 2 ) and hence Σ i, we obtain β that maximizes the likelihood by solving l(β, D, σ 2 Y 1,..., Y N ) = X i β Σ 1 (Y i X i β) = 0, which implies β = N i=1 X i Σ 1 i X i i 1 N i=1 X i Σ 1 i Y i.

27 Example 3: Mixed-effects model (continued) 26/35 Complete-data log-likelihood Note the equivalence of (ɛ i, b i ) and (Y i, b i ) and the fact that ( ) {( ) ( )} bi 0 D 0 = N, 0 0 σ 2 I ni ɛ i l C (β, D, σ 2 ɛ 1,..., ɛ N, b 1,..., b N ) { i 1 2 b i Db i log D 2σ 2ɛ i ɛ i n i 2 log σ2 } The parameter that maximizes the complete-data log-likelihood is obtained as, conditional on other parameters, N D = N 1 b i b i σ 2 = β = i=1 i=1 1 N n i N X i X i i=1 N i=1 1 ɛ i ɛ i N i=1 X i (Y i Z i b i ).

28 Example 3: Mixed-effects model (continued) 27/35 E step: to evaluate E ( b i b i Y i, β (k), D (k), σ 2(k)) E ( ɛ i ɛ Y i, β (k), D (k), σ 2(k)) E ( b i Y i, β (k), D (k), σ 2(k)) We use the relationship E(b i b i Y i) = E(b i Y i )E(b i Y i) + Var(b i Y i ). Thus we need to calculate E(b i Y i ) and Var(b i Y i ). Recall the conditional distribution for multivariate normal variables ( ) {( ) ( Yi Xi β Zi DZ = N, i + )} σ2 I ni Z i D, 0 D Let Σ i = Z i DZ i + σ2 I ni. We known that b i DZ i E(b i Y i ) = 0 + DZ i Σ 1 i (Y i X i β) Var(b i Y i ) = D DZ i Σ 1 i Z i D.

29 Example 3: Mixed-effects model (continued) 28/35 Similarly, We use the relationship We can derive E(ɛ i ɛ i Y i ) = E(ɛ i Y i )E(ɛ i Y i ) + Var(ɛ i Y i ). ( Yi ɛ i ) = N {( ) Xi β, 0 Let Σ i = Z i DZ i + σ2 I ni. Then we have M step D (k+1) = N 1 σ 2(k+1) = β (k+1) = i=1 ( Zi DZ i + )} σ2 I ni σ 2 I ni σ 2 I ni σ 2. I ni E(ɛ i Y i ) = 0 + σ 2 Σ 1 i (Y i X i β) Var(ɛ i Y i ) = σ 2 I ni σ 4 Σ 1 i. N E(b i b i Y i, β (k), D (k), σ 2(k) )) i=1 1 N n i N X i X i i=1 N i=1 1 E(ɛ i ɛ i Y i, β (k), D (k), σ 2(k) ) N i=1 X i E(Y i Z i b i Y i, β (k), D (k), σ 2(k) ).

30 Issues 29/35 1. Stopping rules l(θ (k+1) ) l(θ (k) ) < ɛ for m consecutive steps, where l(θ) is observed-data log-likelihood. This is bad! l(θ) may not change much even when θ does. θ (k+1) θ (k) < ɛ for m consecutive steps This could run into problems when the components of θ are of very different magnitudes. θ (k+1) j θ (k) j < ɛ 1 ( θ (k) + ɛ 2 ) for j = 1,..., p j In practice, take ɛ 1 =10 8 ɛ 2 =10ɛ 1 to 100ɛ 1

31 Issues (continued) 30/35 2. Local vs. global max There may be multiple modes EM may converge to a saddle point Solution: Multiple starting points 3. Starting points Use information from the context Use a crude method (such as the method of moments) Use an alternative model formulation 4. Slow convergence EM can be painfully slow to converge near the maximum Solution: Switch to another optimization algorithm when you get near the maximum 5. Standard errors Numerical approximation of the Hessian matrix Louis (1982), Meng and Rubin (1991)

32 Numerical approximation of the Hessian matrix 31/35 Note: l(θ) = observed-data log-likelihood We estimate the gradient using { l(θ)} i = l(θ) θ i l(θ + δ ie i ) l(θ δ i e i ) 2δ i where e i is a unit vector with 1 for the ith element and 0 otherwise. In calculating derivatives using this formula, I generally start with some medium size δ and then repeatedly halve it until the estimated derivative stabilizes. We can estimate the Hessian by applying the above formula twice: { l(θ)} i j l(θ + δ ie i + δ j e j ) l(θ + δ i e i δ j e j ) l(θ δ i e i + δ j e j ) + l(θ δ i e i δ j e j ) 4δ i δ j

33 Louis estimator (1982) 32/35 l C (θ Y obs, Y mis ) log { f (Y obs, Y mis θ)} { } l O (θ Y obs ) log f (Y obs, y mis θ) dy mis l C (θ Y obs, Y mis ), l O (θ Y obs ) = gradients of l C, l O l C (θ Y obs, Y mis ), l O (θ Y obs ) = second derivatives of l C, l O We can prove that (5) l O (θ Y obs ) = E { l } C (θ Y obs, Y mis ) Y obs (6) l O (θ Y obs ) = E { l C (θ Y obs, Y mis ) Y obs } E {[ l C (θ Y obs, Y mis ) ] 2 Y obs } + [ l O (θ Y obs ) ] 2 MLE: ˆθ = arg max θ l O (θ Y obs ) Louis variance estimator: { l O (θ Y obs ) } 1 evaluated at θ = ˆθ Note: All of the conditional expectations can be computed in the EM algorithm using only l C and l C, which are first and second derivatives of the complete-data log-likelihood. Louis estimator should be evaluated at the last step of EM.

34 Proof of (5) 33/35 Proof: By the definition of l O (θ Y obs ), l O (θ Y obs ) = log { } f (Y obs, y mis θ) dy mis θ = f (Y obs, y mis θ) dy mis / θ f (Yobs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis. (7) Multiplying and dividing the integrand of the numerator by f (Y obs, y mis θ) gives (5), f (Y obs,y mis θ) f (Y l O (θ Y obs ) = obs,y mis θ) f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis = = log{ f (Yobs,y mis θ)} θ f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis f (Y obs, y mis θ) l C (θ Y obs, Y mis ) dy mis f (Yobs, y mis θ) dy mis = E { l C (θ Y obs, Y mis ) Y obs }.

35 Proof of (6) 34/35 Proof: We take an additional derivative of l O (θ Y obs ) in expression (7) to obtain f l (Y obs, y mis θ) dy mis f 2 (Y obs, y mis θ) dy mis O (θ Y obs ) = f (Yobs, y mis θ) dy mis f (Yobs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis f (Yobs, y mis θ) dy mis { l O (θ Y obs ) } 2. To see how the first term breaks down, we take an additional derivative of log { f f (Yobs, y mis θ)} (Y obs, y mis θ) dy mis = f (Y obs, y mis θ) dy mis θ to obtain f (Y obs, y mis θ) dy mis = Thus we express the first term to be 2 log { f (Y obs, y mis θ)} θ θ f (Y obs, y mis θ) dy mis [ ] 2 log { f (Yobs, y mis θ)} + f (Y obs, y mis θ) dy mis θ E { l } C (θ Y obs, Y mis ) Y obs + E {[ l C (θ Y obs, Y mis ) ] 2 } Y obs.

36 Convergence rate of EM 35/35 Let I C (θ) and I O (θ) denote the complete information and observed information, respectively. One can show when the EM converges, the linear convergence rate, denoted as (θ (k+1) ˆθ)/(θ (k) ˆθ) approximates 1 I O (ˆθ)/I C (ˆθ). (later) This means that When missingness is small, EM converges quickly Otherwise EM converges slowly.

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Statistical Estimation

Statistical Estimation Statistical Estimation Use data and a model. The plug-in estimators are based on the simple principle of applying the defining functional to the ECDF. Other methods of estimation: minimize residuals from

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014 Put your solution to each problem on a separate sheet of paper. Problem 1. (5166) Assume that two random samples {x i } and {y i } are independently

More information

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments

Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments Chapter 2. Review of basic Statistical methods 1 Distribution, conditional distribution and moments We consider two kinds of random variables: discrete and continuous random variables. For discrete random

More information

Chapter 3 : Likelihood function and inference

Chapter 3 : Likelihood function and inference Chapter 3 : Likelihood function and inference 4 Likelihood function and inference The likelihood Information and curvature Sufficiency and ancilarity Maximum likelihood estimation Non-regular models EM

More information

Biostat 2065 Analysis of Incomplete Data

Biostat 2065 Analysis of Incomplete Data Biostat 2065 Analysis of Incomplete Data Gong Tang Dept of Biostatistics University of Pittsburgh October 20, 2005 1. Large-sample inference based on ML Let θ is the MLE, then the large-sample theory implies

More information

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2

f X (y, z; θ, σ 2 ) = 1 2 (2πσ2 ) 1 2 exp( (y θz) 2 /2σ 2 ) l c,n (θ, σ 2 ) = i log f(y i, Z i ; θ, σ 2 ) (Y i θz i ) 2 /2σ 2 Chapter 7: EM algorithm in exponential families: JAW 4.30-32 7.1 (i) The EM Algorithm finds MLE s in problems with latent variables (sometimes called missing data ): things you wish you could observe,

More information

EM for ML Estimation

EM for ML Estimation Overview EM for ML Estimation An algorithm for Maximum Likelihood (ML) Estimation from incomplete data (Dempster, Laird, and Rubin, 1977) 1. Formulate complete data so that complete-data ML estimation

More information

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach Jae-Kwang Kim Department of Statistics, Iowa State University Outline 1 Introduction 2 Observed likelihood 3 Mean Score

More information

Goodness of Fit Goodness of fit - 2 classes

Goodness of Fit Goodness of fit - 2 classes Goodness of Fit Goodness of fit - 2 classes A B 78 22 Do these data correspond reasonably to the proportions 3:1? We previously discussed options for testing p A = 0.75! Exact p-value Exact confidence

More information

EM algorithm and applications Lecture #9

EM algorithm and applications Lecture #9 EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al., 2001.. The EM algorithm This lecture plan: 1. Presentation

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf 2013-14 We know that X ~ B(n,p), but we do not know p. We get a random sample

More information

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf 1 Introduction to Machine Learning Maximum Likelihood and Bayesian Inference Lecturers: Eran Halperin, Lior Wolf 2014-15 We know that X ~ B(n,p), but we do not know p. We get a random sample from X, a

More information

Estimating the parameters of hidden binomial trials by the EM algorithm

Estimating the parameters of hidden binomial trials by the EM algorithm Hacettepe Journal of Mathematics and Statistics Volume 43 (5) (2014), 885 890 Estimating the parameters of hidden binomial trials by the EM algorithm Degang Zhu Received 02 : 09 : 2013 : Accepted 02 :

More information

Lecture 3 September 1

Lecture 3 September 1 STAT 383C: Statistical Modeling I Fall 2016 Lecture 3 September 1 Lecturer: Purnamrita Sarkar Scribe: Giorgio Paulon, Carlos Zanini Disclaimer: These scribe notes have been slightly proofread and may have

More information

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method. Rebecca Barter May 5, 2015 Linear Regression Review Linear Regression Review

More information

Lecture Notes. Introduction

Lecture Notes. Introduction 5/3/016 Lecture Notes R. Rekaya June 1-10, 016 Introduction Variance components play major role in animal breeding and genetic (estimation of BVs) It has been an active area of research since early 1950

More information

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017 Put your solution to each problem on a separate sheet of paper. Problem 1. (5106) Let X 1, X 2,, X n be a sequence of i.i.d. observations from a

More information

Latent Variable Models and EM algorithm

Latent Variable Models and EM algorithm Latent Variable Models and EM algorithm SC4/SM4 Data Mining and Machine Learning, Hilary Term 2017 Dino Sejdinovic 3.1 Clustering and Mixture Modelling K-means and hierarchical clustering are non-probabilistic

More information

Maximum Likelihood Estimation

Maximum Likelihood Estimation Maximum Likelihood Estimation Prof. C. F. Jeff Wu ISyE 8813 Section 1 Motivation What is parameter estimation? A modeler proposes a model M(θ) for explaining some observed phenomenon θ are the parameters

More information

A note on shaved dice inference

A note on shaved dice inference A note on shaved dice inference Rolf Sundberg Department of Mathematics, Stockholm University November 23, 2016 Abstract Two dice are rolled repeatedly, only their sum is registered. Have the two dice

More information

CS Lecture 18. Expectation Maximization

CS Lecture 18. Expectation Maximization CS 6347 Lecture 18 Expectation Maximization Unobserved Variables Latent or hidden variables in the model are never observed We may or may not be interested in their values, but their existence is crucial

More information

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm IEOR E4570: Machine Learning for OR&FE Spring 205 c 205 by Martin Haugh The EM Algorithm The EM algorithm is used for obtaining maximum likelihood estimates of parameters when some of the data is missing.

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

Outline of GLMs. Definitions

Outline of GLMs. Definitions Outline of GLMs Definitions This is a short outline of GLM details, adapted from the book Nonparametric Regression and Generalized Linear Models, by Green and Silverman. The responses Y i have density

More information

POLI 8501 Introduction to Maximum Likelihood Estimation

POLI 8501 Introduction to Maximum Likelihood Estimation POLI 8501 Introduction to Maximum Likelihood Estimation Maximum Likelihood Intuition Consider a model that looks like this: Y i N(µ, σ 2 ) So: E(Y ) = µ V ar(y ) = σ 2 Suppose you have some data on Y,

More information

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation

BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation BIO5312 Biostatistics Lecture 13: Maximum Likelihood Estimation Yujin Chung November 29th, 2016 Fall 2016 Yujin Chung Lec13: MLE Fall 2016 1/24 Previous Parametric tests Mean comparisons (normality assumption)

More information

Fall 2003: Maximum Likelihood II

Fall 2003: Maximum Likelihood II 36-711 Fall 2003: Maximum Likelihood II Brian Junker November 18, 2003 Slide 1 Newton s Method and Scoring for MLE s Aside on WLS/GLS Application to Exponential Families Application to Generalized Linear

More information

Allele Frequency Estimation

Allele Frequency Estimation Allele Frequency Estimation Examle: ABO blood tyes ABO genetic locus exhibits three alleles: A, B, and O Four henotyes: A, B, AB, and O Genotye A/A A/O A/B B/B B/O O/O Phenotye A A AB B B O Data: Observed

More information

Lecture 16 Solving GLMs via IRWLS

Lecture 16 Solving GLMs via IRWLS Lecture 16 Solving GLMs via IRWLS 09 November 2015 Taylor B. Arnold Yale Statistics STAT 312/612 Notes problem set 5 posted; due next class problem set 6, November 18th Goals for today fixed PCA example

More information

Review and continuation from last week Properties of MLEs

Review and continuation from last week Properties of MLEs Review and continuation from last week Properties of MLEs As we have mentioned, MLEs have a nice intuitive property, and as we have seen, they have a certain equivariance property. We will see later that

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Latent Variable Models and EM Algorithm

Latent Variable Models and EM Algorithm SC4/SM8 Advanced Topics in Statistical Machine Learning Latent Variable Models and EM Algorithm Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/atsml/

More information

A note on multiple imputation for general purpose estimation

A note on multiple imputation for general purpose estimation A note on multiple imputation for general purpose estimation Shu Yang Jae Kwang Kim SSC meeting June 16, 2015 Shu Yang, Jae Kwang Kim Multiple Imputation June 16, 2015 1 / 32 Introduction Basic Setup Assume

More information

Mixtures of Rasch Models

Mixtures of Rasch Models Mixtures of Rasch Models Hannah Frick, Friedrich Leisch, Achim Zeileis, Carolin Strobl http://www.uibk.ac.at/statistics/ Introduction Rasch model for measuring latent traits Model assumption: Item parameters

More information

Spring 2012 Math 541B Exam 1

Spring 2012 Math 541B Exam 1 Spring 2012 Math 541B Exam 1 1. A sample of size n is drawn without replacement from an urn containing N balls, m of which are red and N m are black; the balls are otherwise indistinguishable. Let X denote

More information

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak. Large Sample Theory Large Sample Theory is a name given to the search for approximations to the behaviour of statistical procedures which are derived by computing limits as the sample size, n, tends to

More information

MATH4427 Notebook 2 Fall Semester 2017/2018

MATH4427 Notebook 2 Fall Semester 2017/2018 MATH4427 Notebook 2 Fall Semester 2017/2018 prepared by Professor Jenny Baglivo c Copyright 2009-2018 by Jenny A. Baglivo. All Rights Reserved. 2 MATH4427 Notebook 2 3 2.1 Definitions and Examples...................................

More information

Generalized Linear Models

Generalized Linear Models Generalized Linear Models Advanced Methods for Data Analysis (36-402/36-608 Spring 2014 1 Generalized linear models 1.1 Introduction: two regressions So far we ve seen two canonical settings for regression.

More information

Statistics 3858 : Maximum Likelihood Estimators

Statistics 3858 : Maximum Likelihood Estimators Statistics 3858 : Maximum Likelihood Estimators 1 Method of Maximum Likelihood In this method we construct the so called likelihood function, that is L(θ) = L(θ; X 1, X 2,..., X n ) = f n (X 1, X 2,...,

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2017

Cheng Soon Ong & Christian Walder. Canberra February June 2017 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2017 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 679 Part XIX

More information

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Machine Learning. Lecture 3: Logistic Regression. Feng Li. Machine Learning Lecture 3: Logistic Regression Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2016 Logistic Regression Classification

More information

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous

More information

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems

More information

Introduction An approximated EM algorithm Simulation studies Discussion

Introduction An approximated EM algorithm Simulation studies Discussion 1 / 33 An Approximated Expectation-Maximization Algorithm for Analysis of Data with Missing Values Gong Tang Department of Biostatistics, GSPH University of Pittsburgh NISS Workshop on Nonignorable Nonresponse

More information

Expectation maximization tutorial

Expectation maximization tutorial Expectation maximization tutorial Octavian Ganea November 18, 2016 1/1 Today Expectation - maximization algorithm Topic modelling 2/1 ML & MAP Observed data: X = {x 1, x 2... x N } 3/1 ML & MAP Observed

More information

Gibbs Sampling in Latent Variable Models #1

Gibbs Sampling in Latent Variable Models #1 Gibbs Sampling in Latent Variable Models #1 Econ 690 Purdue University Outline 1 Data augmentation 2 Probit Model Probit Application A Panel Probit Panel Probit 3 The Tobit Model Example: Female Labor

More information

Approximate Likelihoods

Approximate Likelihoods Approximate Likelihoods Nancy Reid July 28, 2015 Why likelihood? makes probability modelling central l(θ; y) = log f (y; θ) emphasizes the inverse problem of reasoning y θ converts a prior probability

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Lecture 14 More on structural estimation

Lecture 14 More on structural estimation Lecture 14 More on structural estimation Economics 8379 George Washington University Instructor: Prof. Ben Williams traditional MLE and GMM MLE requires a full specification of a model for the distribution

More information

HT Introduction. P(X i = x i ) = e λ λ x i

HT Introduction. P(X i = x i ) = e λ λ x i MODS STATISTICS Introduction. HT 2012 Simon Myers, Department of Statistics (and The Wellcome Trust Centre for Human Genetics) myers@stats.ox.ac.uk We will be concerned with the mathematical framework

More information

Gaussian Mixture Models

Gaussian Mixture Models Gaussian Mixture Models Pradeep Ravikumar Co-instructor: Manuela Veloso Machine Learning 10-701 Some slides courtesy of Eric Xing, Carlos Guestrin (One) bad case for K- means Clusters may overlap Some

More information

Generalized linear models

Generalized linear models Generalized linear models Søren Højsgaard Department of Mathematical Sciences Aalborg University, Denmark October 29, 202 Contents Densities for generalized linear models. Mean and variance...............................

More information

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012

ECE 275B Homework #2 Due Thursday MIDTERM is Scheduled for Tuesday, February 21, 2012 Reading ECE 275B Homework #2 Due Thursday 2-16-12 MIDTERM is Scheduled for Tuesday, February 21, 2012 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example 7.11,

More information

Chapter 2. Discrete Distributions

Chapter 2. Discrete Distributions Chapter. Discrete Distributions Objectives ˆ Basic Concepts & Epectations ˆ Binomial, Poisson, Geometric, Negative Binomial, and Hypergeometric Distributions ˆ Introduction to the Maimum Likelihood Estimation

More information

Computer Intensive Methods in Mathematical Statistics

Computer Intensive Methods in Mathematical Statistics Computer Intensive Methods in Mathematical Statistics Department of mathematics johawes@kth.se Lecture 16 Advanced topics in computational statistics 18 May 2017 Computer Intensive Methods (1) Plan of

More information

Optimization Methods II. EM algorithms.

Optimization Methods II. EM algorithms. Aula 7. Optimization Methods II. 0 Optimization Methods II. EM algorithms. Anatoli Iambartsev IME-USP Aula 7. Optimization Methods II. 1 [RC] Missing-data models. Demarginalization. The term EM algorithms

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm 1/29 EM & Latent Variable Models Gaussian Mixture Models EM Theory The Expectation-Maximization Algorithm Mihaela van der Schaar Department of Engineering Science University of Oxford MLE for Latent Variable

More information

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014. Clustering K-means Machine Learning CSE546 Carlos Guestrin University of Washington November 4, 2014 1 Clustering images Set of Images [Goldberger et al.] 2 1 K-means Randomly initialize k centers µ (0)

More information

Computational statistics

Computational statistics Computational statistics EM algorithm Thierry Denœux February-March 2017 Thierry Denœux Computational statistics February-March 2017 1 / 72 EM Algorithm An iterative optimization strategy motivated by

More information

Review of Maximum Likelihood Estimators

Review of Maximum Likelihood Estimators Libby MacKinnon CSE 527 notes Lecture 7, October 7, 2007 MLE and EM Review of Maximum Likelihood Estimators MLE is one of many approaches to parameter estimation. The likelihood of independent observations

More information

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014

Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Overview - 1 Statistical Genetics I: STAT/BIOST 550 Spring Quarter, 2014 Elizabeth Thompson University of Washington Seattle, WA, USA MWF 8:30-9:20; THO 211 Web page: www.stat.washington.edu/ thompson/stat550/

More information

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing

Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Estimating and Using Propensity Score in Presence of Missing Background Data. An Application to Assess the Impact of Childbearing on Wellbeing Alessandra Mattei Dipartimento di Statistica G. Parenti Università

More information

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015

ECE 275B Homework #2 Due Thursday 2/12/2015. MIDTERM is Scheduled for Thursday, February 19, 2015 Reading ECE 275B Homework #2 Due Thursday 2/12/2015 MIDTERM is Scheduled for Thursday, February 19, 2015 Read and understand the Newton-Raphson and Method of Scores MLE procedures given in Kay, Example

More information

STAT Advanced Bayesian Inference

STAT Advanced Bayesian Inference 1 / 8 STAT 625 - Advanced Bayesian Inference Meng Li Department of Statistics March 5, 2018 Distributional approximations 2 / 8 Distributional approximations are useful for quick inferences, as starting

More information

ML Testing (Likelihood Ratio Testing) for non-gaussian models

ML Testing (Likelihood Ratio Testing) for non-gaussian models ML Testing (Likelihood Ratio Testing) for non-gaussian models Surya Tokdar ML test in a slightly different form Model X f (x θ), θ Θ. Hypothesist H 0 : θ Θ 0 Good set: B c (x) = {θ : l x (θ) max θ Θ l

More information

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression Lecture 4: Types of errors. Bayesian regression models. Logistic regression A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting more generally COMP-652 and ECSE-68, Lecture

More information

Generalized Linear Models for Non-Normal Data

Generalized Linear Models for Non-Normal Data Generalized Linear Models for Non-Normal Data Today s Class: 3 parts of a generalized model Models for binary outcomes Complications for generalized multivariate or multilevel models SPLH 861: Lecture

More information

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments STAT 135 Lab 3 Asymptotic MLE and the Method of Moments Rebecca Barter February 9, 2015 Maximum likelihood estimation (a reminder) Maximum likelihood estimation Suppose that we have a sample, X 1, X 2,...,

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Expectation Maximization (EM) and Mixture Models Hamid R. Rabiee Jafar Muhammadi, Mohammad J. Hosseini Spring 203 http://ce.sharif.edu/courses/9-92/2/ce725-/ Agenda Expectation-maximization

More information

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

STA216: Generalized Linear Models. Lecture 1. Review and Introduction STA216: Generalized Linear Models Lecture 1. Review and Introduction Let y 1,..., y n denote n independent observations on a response Treat y i as a realization of a random variable Y i In the general

More information

Statistics. Statistics

Statistics. Statistics The main aims of statistics 1 1 Choosing a model 2 Estimating its parameter(s) 1 point estimates 2 interval estimates 3 Testing hypotheses Distributions used in statistics: χ 2 n-distribution 2 Let X 1,

More information

1.5 MM and EM Algorithms

1.5 MM and EM Algorithms 72 CHAPTER 1. SEQUENCE MODELS 1.5 MM and EM Algorithms The MM algorithm [1] is an iterative algorithm that can be used to minimize or maximize a function. We focus on using it to maximize the log likelihood

More information

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions. Large Sample Theory Study approximate behaviour of ˆθ by studying the function U. Notice U is sum of independent random variables. Theorem: If Y 1, Y 2,... are iid with mean µ then Yi n µ Called law of

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

Generalized Linear Models. Kurt Hornik

Generalized Linear Models. Kurt Hornik Generalized Linear Models Kurt Hornik Motivation Assuming normality, the linear model y = Xβ + e has y = β + ε, ε N(0, σ 2 ) such that y N(μ, σ 2 ), E(y ) = μ = β. Various generalizations, including general

More information

Generalized Linear Models Introduction

Generalized Linear Models Introduction Generalized Linear Models Introduction Statistics 135 Autumn 2005 Copyright c 2005 by Mark E. Irwin Generalized Linear Models For many problems, standard linear regression approaches don t work. Sometimes,

More information

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random STA 216: GENERALIZED LINEAR MODELS Lecture 1. Review and Introduction Much of statistics is based on the assumption that random variables are continuous & normally distributed. Normal linear regression

More information

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013

Lecture 3: Basic Statistical Tools. Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 Lecture 3: Basic Statistical Tools Bruce Walsh lecture notes Tucson Winter Institute 7-9 Jan 2013 1 Basic probability Events are possible outcomes from some random process e.g., a genotype is AA, a phenotype

More information

Lecture 5: LDA and Logistic Regression

Lecture 5: LDA and Logistic Regression Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39 Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant

More information

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct / Machine Learning for Signal rocessing Expectation Maximization Mixture Models Bhiksha Raj 27 Oct 2016 11755/18797 1 Learning Distributions for Data roblem: Given a collection of examples from some data,

More information

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models

The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models The consequences of misspecifying the random effects distribution when fitting generalized linear mixed models John M. Neuhaus Charles E. McCulloch Division of Biostatistics University of California, San

More information

Lecture 6: April 19, 2002

Lecture 6: April 19, 2002 EE596 Pat. Recog. II: Introduction to Graphical Models Spring 2002 Lecturer: Jeff Bilmes Lecture 6: April 19, 2002 University of Washington Dept. of Electrical Engineering Scribe: Huaning Niu,Özgür Çetin

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation

Clustering by Mixture Models. General background on clustering Example method: k-means Mixture model based clustering Model estimation Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based clustering Model estimation 1 Clustering A basic tool in data mining/pattern recognition: Divide

More information

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a Some slides are due to Christopher Bishop Limitations of K-means Hard assignments of data points to clusters small shift of a

More information

Fractional Imputation in Survey Sampling: A Comparative Review

Fractional Imputation in Survey Sampling: A Comparative Review Fractional Imputation in Survey Sampling: A Comparative Review Shu Yang Jae-Kwang Kim Iowa State University Joint Statistical Meetings, August 2015 Outline Introduction Fractional imputation Features Numerical

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

Maximum Likelihood Estimation. only training data is available to design a classifier

Maximum Likelihood Estimation. only training data is available to design a classifier Introduction to Pattern Recognition [ Part 5 ] Mahdi Vasighi Introduction Bayesian Decision Theory shows that we could design an optimal classifier if we knew: P( i ) : priors p(x i ) : class-conditional

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems

Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems LEARNING AND INFERENCE IN GRAPHICAL MODELS Chapter 08: Direct Maximum Likelihood/MAP Estimation and Incomplete Data Problems Dr. Martin Lauer University of Freiburg Machine Learning Lab Karlsruhe Institute

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Expectation-Maximization (EM) algorithm

Expectation-Maximization (EM) algorithm I529: Machine Learning in Bioinformatics (Spring 2017) Expectation-Maximization (EM) algorithm Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2017 Contents Introduce

More information

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods. Linear models for classification Logistic regression Gradient descent and second-order methods

More information

Econometrics I, Estimation

Econometrics I, Estimation Econometrics I, Estimation Department of Economics Stanford University September, 2008 Part I Parameter, Estimator, Estimate A parametric is a feature of the population. An estimator is a function of the

More information

Fitting Narrow Emission Lines in X-ray Spectra

Fitting Narrow Emission Lines in X-ray Spectra Outline Fitting Narrow Emission Lines in X-ray Spectra Taeyoung Park Department of Statistics, University of Pittsburgh October 11, 2007 Outline of Presentation Outline This talk has three components:

More information

6. Fractional Imputation in Survey Sampling

6. Fractional Imputation in Survey Sampling 6. Fractional Imputation in Survey Sampling 1 Introduction Consider a finite population of N units identified by a set of indices U = {1, 2,, N} with N known. Associated with each unit i in the population

More information