Fall 2003: Maximum Likelihood II

36-711 Fall 2003: Maximum Likelihood II Brian Junker November 18, 2003 Slide 1 Newton s Method and Scoring for MLE s Aside on WLS/GLS Application to Exponential Families Application to Generalized Linear Models Application to Nonlinear Least Squares Application to Robust Regression Newton s Method and Scoring for MLE s When carrying out Newton-Raphson to maximizel n (θ), the natural iterates are: ˆθ n ( j+1) = ˆθ n ( j) [ 2 l n (ˆθ n ( j) )] 1 l n (ˆθ n ( j) ) Slide 2 Sometimes the observed information has a simpler form, in which case one may use ˆθ ( j+1) n = ˆθ ( j) n + I n (ˆθ ( j) n ) 1 l n (ˆθ ( j) n ) Maximizing using this quasi-newton method is called Fisher Scoring. 1

Aside on WLS/GLS Suppose y=xβ+ǫ, ǫ N(0, Σ) We can convert this to an ordinary least squares problem via Slide 3 whose solution is Σ 1/2 y= ΣXβ+ǫ, ǫ N(0, I p p ) ˆβ=(X T Σ 1 X) 1 X T Σ 1 y When Σ is diagonal, this is called weighted least squares (WLS) When Σ is general, this is called generalized least squares (GLS) We will repeatedly apply this idea in the examples below. Application to Exponential Families Let y=(y 1,...,y n ) T denote iid data to be modelled with an exponential family model. Recall that an exponential family model has the form f (y i θ)=g(y i )e β(θ)+γ(θ)t k(y i ) Slide 4 whereθ p 1 are the p original parameters,γ(θ) r 1 are the r natural parameters, and k(y i ) r 1 are the r sufficient statistics forγ. Then the likelihood for y is L n (θ) = n f (y i θ) = n g(y i )e n β(θ)+ n γ(θ) T k(y i ) = G(y)e B(θ)+γ(θ)T K(y) and the log-likelihood is l n (θ)=log G(y)+ B(θ)+γ(θ) T K(y) 2

Comparing Newton-Raphson with Fisher Scoring In order to apply Newton-Raphson to find ˆθ n, we need to compute l n (θ) and 2 l n (θ). A simple form for l n (θ) follows from Slide 5 so that l n (θ) = B(θ)+ γ(θ)k(y) 0 = E[ l n (θ)] = B(θ)+ γ(θ)µ(θ) l n (θ) = γ(θ)[k(y) µ(θ)] whereµ(θ) r 1 = E θ [K(y) r 1 ] and γ(θ)=[ γ j / θ i ] p r = J γ (θ) T. From the first expression for l n (θ) above it is also easy to see that 2 2 B(θ), if A r p s.t.γ(θ)=aθ l n (θ)= (messy) otherwise In the first case, we see that Newton-Raphson and Fisher Scoring are really the same thing: I n (θ)=e[ 2 l n (θ)]= E[ 2 B(θ)]= 2 B(θ). Fisher Scoring when Newton is Ugly Using the form l n (θ)= γ(θ)[k(y) µ(θ)], we have I n (θ) = E θ [ 2 l n (θ)] = Var θ ( l n (θ)) = γ(θ)σ(θ) γ(θ) T = µ(θ)σ(θ) 1 µ(θ) T Slide 6 whereσ(θ)=var θ (K(y)) and the last equality follows from µ(θ) T = K(y)L n (y θ)dν(y) = K(y)[ L n (y θ)] T dν(y) = K(y)[ l n (θ)] T L n (θ)dν(y) = K(y)[K(y) µ(θ)] T γ(θ) T L n (θ)dν(y) = [K(y) µ(θ)][k(y) µ(θ)] T L n (θ)dν(y) γ(θ) T = Σ(θ) γ(θ) T This shows that I n (θ) can be expressed in terms of the first and second moments of K(y), which may be simpler than working with 2 l n (θ). 3

Some details Fisher scoring, ˆθ ( j+1) = ˆθ ( j) + I n (ˆθ ( j) ) 1 l n (ˆθ ( j) ), may be expressed as ˆθ ( j+1) = ˆθ ( j) + { µ(ˆθ ( j) )Σ(θ ( j) ) 1 µ(θ ( j) ) T} 1 γ(ˆθ ( j) )[K(y) µ(ˆθ ( j) )] and, after applying our identity µ(θ) T =Σ(θ) γ(θ) T, we get ˆθ ( j+1) = ˆθ ( j) + { µ(ˆθ ( j) )Σ(θ ( j) ) 1 µ(θ ( j) ) T} 1 µ(ˆθ ( j) )Σ(θ ( j) ) 1 [K(y) µ(ˆθ ( j) )] Slide 7 which again uses just K(y) and its first two momentsµ(θ) andσ(θ). This suggests the following iteratively reweighted least squares (IRLS) algorithm: Compute the WLS/GLS solution ˆβ for ỹ= Xβ+ ǫ,ǫ N(0, Σ): ˆβ=( X T Σ 1 X) 1 X T Σ 1 ỹ where ỹ=[k(y) µ(ˆθ ( j) )], X= µ(ˆθ ( j) ) T, and Σ=Σ 1 (ˆθ ( j) ); Let ˆθ ( j+1) = ˆθ ( j) + ˆβ; Repeat until converged. Application to Generalized Linear Models (GLM s) Examples: Loglinear (Multinomial and Poisson) models for tables of counts Poisson regression models Logistic and probit regression models Normal linear regression Slide 8 The essential assumptions are: L n (θ) = G(y)e B(θ)+γT (θ)y q(x T 1 θ) q(x T 2 µ(θ) = E θ [Y] = θ) q(xθ). q(x T nθ) where X is a model matrix with rows x T 1, xt 2,..., xt n, and q 1 ( ) is called the link function for the model. 4

Slide 9 Example: Logistic Regression We assume Then L n (θ) = y i x i Bin 1, e xt i θ (y i {0, 1}) 1+e xt i θ n n 1+e = 1 xt i θ 1+e xt i θ e n x T i θy i ext i θy i = e B(θ)+yT Xθ Sinceγ(θ)= Xθ is linear inθ, Newton s method and Scoring will be the same. Also note that p 1 q(x T 1 θ) p 2 q(x T 2 θ) µ(θ)= E[y]=. = q(xθ).. p n q(x T nθ) where q(t)= et and so q 1 p (p)=log 1+e t 1 p = logit(p) Fisher Scoring for GLM s Slide 10 [ q(x T ] µ(θ) T = i θ) θ j Q (Xθ)X so that = [ q (x T i θ)x ] i j = q (x T 1θ) 0 0 0 q (x T 2θ) 0........ 0 0 q (x T n θ) X l n (θ) = γ(θ)[k(y) γ(θ)] = µ(θ)σ 1 (θ)[y µ(θ)] = [Q (Xθ)X] T D 1 [y µ(θ)] = X T [Q (Xθ)] T D 1 [y µ(θ)] where D n n is the diagonal matrix with diagonal elements d ii =σ 2 = Var (y i ). Also, I n (θ) = µ(θ)σ 1 µ(θ) T = [Q (Xθ)X] T D 1 [Q (Xθ)X] = X T [Q (Xθ)] T D 1 [Q (Xθ)]X 5

This leads to an IRLS algorithm that operates almost directly on X and µ(θ)=q(xθ): ˆθ ( j+1) ˆθ ( j) = ( X T D 1 X) 1 X T D 1 ỹ where ỹ=y µ(ˆθ ( j) ) and X=Q (Xθ ( j) )X. Slide 11 Example: Logistic Regression (cont d) µ i (θ) = E[y i ] = q(x T i θ) = ext i θ /(1 e xt i θ ) p i σ 2 i = Var (y i ) = p i (1 p i ) [ q e t ] (t) = 1+e t = What do l n (θ) and 2 l n (θ) look like in this case? 1 (1+e t [1 q(t)]2 ) 2= Application to Nonlinear Least Squares Basic assumptions: Y i indep N(µ i (φ),σ 2 /w i ) µ i (φ) = q(x i ;φ) Slide 12 where the function q( ), the design matrix X with rows x T i and the weights w i are all known in advance. We wish to estimateθ=(φ,σ 2 ) Examples: Exponential model: q(x;α,β)=αe βx ;φ=(α,β). Logistic model: q(x;α,β,γ)=α/(1+γe βx );φ=(α,β,γ). Gompertz model: q(x;α,β,γ)=αe γe βx ;φ=(α,β,γ). 6

Slide 13 A Sketch of Fisher Scoring Since L n (θ) is a normal likelihood, and so l n (θ)= 1 n logσ2 1 2σ 2 l n (θ) = 2 l n (θ) = w i [y i µ i (φ)] 2 1 σ 2 n w i [y i µ i (φ)] µ i (φ) n + 1 n 2σ 2 2σ 4 w i[y i µ i (φ)] 2 1 n σ 2 w i µ i (φ) µ i (φ) T 0 0 n 2σ 4 where the matrices have been partitioned into parts relevant toφand toσ 2. Scoring yields ˆφ ( j+1) = ˆφ ( j) + w i µ i (φ ( j) ) µ i (φ ( j) ) 1 T w i [y i µ i (φ ( j) )] µ i (φ ( j) ) ˆσ 2 ( j+1) = ˆσ 2 ( j) ˆσ 2 ( j) + 1 n w i [y i µ i (φ ( j) )] 2 Slide 14 Application to Robust Regression Main idea: E[y i x i ]=µ i (φ)= x T i φ, y indep i p(y i x i,φ,σ) ( ) 2 yi µ i (φ) Least Squares: Minimize S (φ) = σ Robust Regression: Minimize S (φ) = Examples: ρ(t)=t 2 ρ(t)= t t 2 /2, t <k ρ(t)= k t k 2 /2, t k ( ) yi µ i (φ) ρ σ t 2 t <k ρ(t)= k 2 t k ρ(t)=log cosh 2 (t)... Typically ρ(t) is symmetric, even, and has a unique antimode ρ(0) = 0. 7

Two common approaches to estimating robust regression models Scoring Approach: Replace the model Slide 15 with the model X i c σ e 1 2( y i µ i σ ) 2, c= 1 2π X i c σ e ρ( y i µ i σ ), c 1 = e ρ(y) dy Apply the Scoring idea. This yields an IRLS algorithm like the one for nonlinear least-squares. Iterative weighting with influence function: Observe that if S (φ)= n ρ ( ) y i µ i (φ) σ, then S (φ) = ( ρ yi µ ) i X i /σ = σ = 1 σ XT W(y Xφ) ( yi µ ) i w i X i σ Slide 16 where W is a diagonal matrix with diagonal elements ( w i =ρ yi µ )/( i yi µ ) i σ σ w i gives values of the influence function at each i. This leads to another IRLS-like procedure: 1. Compute ˆσ as a robust estimate of the residual standard deviationσ(for example, 2IQR 3 ˆσ, or take ˆσ = med resid med(resid) /0.6745. 2. Use this ˆσ to calculate W, and obtainφ ( j+1) using WLS. 8