Fall 2003: Maximum Likelihood II

Similar documents
Outline of GLMs. Definitions

Generalized Linear Models. Kurt Hornik

Weighted Least Squares I

Generalized Linear Models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Linear Regression Models P8111

STAT5044: Regression and Anova

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Statistics & Data Sciences: First Year Prelim Exam May 2018

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Last lecture 1/35. General optimization problems Newton Raphson Fisher scoring Quasi Newton

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Regression diagnostics

HT Introduction. P(X i = x i ) = e λ λ x i

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Linear Methods for Prediction

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

ST 740: Linear Models and Multivariate Normal Inference

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Peter Hoff Linear and multilinear models April 3, GLS for multivariate regression 5. 3 Covariance estimation for the GLM 8

Linear Models in Machine Learning

Generalized Linear Models Introduction

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Likelihood-Based Methods

Generalized Linear Models 1

Lecture 16 Solving GLMs via IRWLS

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

A few basics of credibility theory

[y i α βx i ] 2 (2) Q = i=1

Linear Regression With Special Variables

Generalized Estimating Equations

Problem Selected Scores

STAT 135 Lab 3 Asymptotic MLE and the Method of Moments

SB1a Applied Statistics Lectures 9-10

Time Series Analysis

Logistic Regression. Seungjin Choi

STA 2201/442 Assignment 2

Introduction to Estimation Methods for Time Series models Lecture 2

Advanced Statistics I : Gaussian Linear Model (and beyond)

Sampling distribution of GLM regression coefficients

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Statistics 3858 : Maximum Likelihood Estimators

Diagnostics can identify two possible areas of failure of assumptions when fitting linear models.

Chapter 3: Maximum Likelihood Theory

Bayesian Inference. Chapter 9. Linear models and regression

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

POLI 8501 Introduction to Maximum Likelihood Estimation

P n. This is called the law of large numbers but it comes in two forms: Strong and Weak.

36-463/663: Multilevel & Hierarchical Models

Chapter 4: Generalized Linear Models-II

STA 260: Statistics and Probability II

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Lecture 6: Discrete Choice: Qualitative Response

5.2 Expounding on the Admissibility of Shrinkage Estimators

MIT Spring 2016

Lecture 15. Hypothesis testing in the linear model

1 One-way analysis of variance

Generalized Linear Models I

AGEC 661 Note Eleven Ximing Wu. Exponential regression model: m (x, θ) = exp (xθ) for y 0

Poisson regression: Further topics

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Linear Methods for Prediction

Generalized Linear Models

SOLUTION FOR HOMEWORK 4, STAT 4352

Lecture 5: LDA and Logistic Regression

Answer Key for STAT 200B HW No. 8

STA 450/4000 S: January


Linear Regression. Junhui Qian. October 27, 2014

36-720: The Rasch Model

MLE and GMM. Li Zhao, SJTU. Spring, Li Zhao MLE and GMM 1 / 22

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Stat 5100 Handout #26: Variations on OLS Linear Regression (Ch. 11, 13)

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Statement: With my signature I confirm that the solutions are the product of my own work. Name: Signature:.

FSAN815/ELEG815: Foundations of Statistical Learning

Computing the MLE and the EM Algorithm

Linear and logistic regression

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Simple Linear Regression

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Chapter 4: Asymptotic Properties of the MLE (Part 2)

For iid Y i the stronger conclusion holds; for our heuristics ignore differences between these notions.

Gaussian Graphical Models and Graphical Lasso

Computational methods for mixed models

EM Algorithm II. September 11, 2018

DA Freedman Notes on the MLE Fall 2003

Introduction to Generalized Linear Models

Likelihood Ratio tests

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Some explanations about the IWLS algorithm to fit generalized linear models

Mathematical statistics

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

REGRESSION WITH SPATIALLY MISALIGNED DATA. Lisa Madsen Oregon State University David Ruppert Cornell University

Transcription:

36-711 Fall 2003: Maximum Likelihood II Brian Junker November 18, 2003 Slide 1 Newton s Method and Scoring for MLE s Aside on WLS/GLS Application to Exponential Families Application to Generalized Linear Models Application to Nonlinear Least Squares Application to Robust Regression Newton s Method and Scoring for MLE s When carrying out Newton-Raphson to maximizel n (θ), the natural iterates are: ˆθ n ( j+1) = ˆθ n ( j) [ 2 l n (ˆθ n ( j) )] 1 l n (ˆθ n ( j) ) Slide 2 Sometimes the observed information has a simpler form, in which case one may use ˆθ ( j+1) n = ˆθ ( j) n + I n (ˆθ ( j) n ) 1 l n (ˆθ ( j) n ) Maximizing using this quasi-newton method is called Fisher Scoring. 1

Aside on WLS/GLS Suppose y=xβ+ǫ, ǫ N(0, Σ) We can convert this to an ordinary least squares problem via Slide 3 whose solution is Σ 1/2 y= ΣXβ+ǫ, ǫ N(0, I p p ) ˆβ=(X T Σ 1 X) 1 X T Σ 1 y When Σ is diagonal, this is called weighted least squares (WLS) When Σ is general, this is called generalized least squares (GLS) We will repeatedly apply this idea in the examples below. Application to Exponential Families Let y=(y 1,...,y n ) T denote iid data to be modelled with an exponential family model. Recall that an exponential family model has the form f (y i θ)=g(y i )e β(θ)+γ(θ)t k(y i ) Slide 4 whereθ p 1 are the p original parameters,γ(θ) r 1 are the r natural parameters, and k(y i ) r 1 are the r sufficient statistics forγ. Then the likelihood for y is L n (θ) = n f (y i θ) = n g(y i )e n β(θ)+ n γ(θ) T k(y i ) = G(y)e B(θ)+γ(θ)T K(y) and the log-likelihood is l n (θ)=log G(y)+ B(θ)+γ(θ) T K(y) 2

Comparing Newton-Raphson with Fisher Scoring In order to apply Newton-Raphson to find ˆθ n, we need to compute l n (θ) and 2 l n (θ). A simple form for l n (θ) follows from Slide 5 so that l n (θ) = B(θ)+ γ(θ)k(y) 0 = E[ l n (θ)] = B(θ)+ γ(θ)µ(θ) l n (θ) = γ(θ)[k(y) µ(θ)] whereµ(θ) r 1 = E θ [K(y) r 1 ] and γ(θ)=[ γ j / θ i ] p r = J γ (θ) T. From the first expression for l n (θ) above it is also easy to see that 2 2 B(θ), if A r p s.t.γ(θ)=aθ l n (θ)= (messy) otherwise In the first case, we see that Newton-Raphson and Fisher Scoring are really the same thing: I n (θ)=e[ 2 l n (θ)]= E[ 2 B(θ)]= 2 B(θ). Fisher Scoring when Newton is Ugly Using the form l n (θ)= γ(θ)[k(y) µ(θ)], we have I n (θ) = E θ [ 2 l n (θ)] = Var θ ( l n (θ)) = γ(θ)σ(θ) γ(θ) T = µ(θ)σ(θ) 1 µ(θ) T Slide 6 whereσ(θ)=var θ (K(y)) and the last equality follows from µ(θ) T = K(y)L n (y θ)dν(y) = K(y)[ L n (y θ)] T dν(y) = K(y)[ l n (θ)] T L n (θ)dν(y) = K(y)[K(y) µ(θ)] T γ(θ) T L n (θ)dν(y) = [K(y) µ(θ)][k(y) µ(θ)] T L n (θ)dν(y) γ(θ) T = Σ(θ) γ(θ) T This shows that I n (θ) can be expressed in terms of the first and second moments of K(y), which may be simpler than working with 2 l n (θ). 3

Some details Fisher scoring, ˆθ ( j+1) = ˆθ ( j) + I n (ˆθ ( j) ) 1 l n (ˆθ ( j) ), may be expressed as ˆθ ( j+1) = ˆθ ( j) + { µ(ˆθ ( j) )Σ(θ ( j) ) 1 µ(θ ( j) ) T} 1 γ(ˆθ ( j) )[K(y) µ(ˆθ ( j) )] and, after applying our identity µ(θ) T =Σ(θ) γ(θ) T, we get ˆθ ( j+1) = ˆθ ( j) + { µ(ˆθ ( j) )Σ(θ ( j) ) 1 µ(θ ( j) ) T} 1 µ(ˆθ ( j) )Σ(θ ( j) ) 1 [K(y) µ(ˆθ ( j) )] Slide 7 which again uses just K(y) and its first two momentsµ(θ) andσ(θ). This suggests the following iteratively reweighted least squares (IRLS) algorithm: Compute the WLS/GLS solution ˆβ for ỹ= Xβ+ ǫ,ǫ N(0, Σ): ˆβ=( X T Σ 1 X) 1 X T Σ 1 ỹ where ỹ=[k(y) µ(ˆθ ( j) )], X= µ(ˆθ ( j) ) T, and Σ=Σ 1 (ˆθ ( j) ); Let ˆθ ( j+1) = ˆθ ( j) + ˆβ; Repeat until converged. Application to Generalized Linear Models (GLM s) Examples: Loglinear (Multinomial and Poisson) models for tables of counts Poisson regression models Logistic and probit regression models Normal linear regression Slide 8 The essential assumptions are: L n (θ) = G(y)e B(θ)+γT (θ)y q(x T 1 θ) q(x T 2 µ(θ) = E θ [Y] = θ) q(xθ). q(x T nθ) where X is a model matrix with rows x T 1, xt 2,..., xt n, and q 1 ( ) is called the link function for the model. 4

Slide 9 Example: Logistic Regression We assume Then L n (θ) = y i x i Bin 1, e xt i θ (y i {0, 1}) 1+e xt i θ n n 1+e = 1 xt i θ 1+e xt i θ e n x T i θy i ext i θy i = e B(θ)+yT Xθ Sinceγ(θ)= Xθ is linear inθ, Newton s method and Scoring will be the same. Also note that p 1 q(x T 1 θ) p 2 q(x T 2 θ) µ(θ)= E[y]=. = q(xθ).. p n q(x T nθ) where q(t)= et and so q 1 p (p)=log 1+e t 1 p = logit(p) Fisher Scoring for GLM s Slide 10 [ q(x T ] µ(θ) T = i θ) θ j Q (Xθ)X so that = [ q (x T i θ)x ] i j = q (x T 1θ) 0 0 0 q (x T 2θ) 0........ 0 0 q (x T n θ) X l n (θ) = γ(θ)[k(y) γ(θ)] = µ(θ)σ 1 (θ)[y µ(θ)] = [Q (Xθ)X] T D 1 [y µ(θ)] = X T [Q (Xθ)] T D 1 [y µ(θ)] where D n n is the diagonal matrix with diagonal elements d ii =σ 2 = Var (y i ). Also, I n (θ) = µ(θ)σ 1 µ(θ) T = [Q (Xθ)X] T D 1 [Q (Xθ)X] = X T [Q (Xθ)] T D 1 [Q (Xθ)]X 5

This leads to an IRLS algorithm that operates almost directly on X and µ(θ)=q(xθ): ˆθ ( j+1) ˆθ ( j) = ( X T D 1 X) 1 X T D 1 ỹ where ỹ=y µ(ˆθ ( j) ) and X=Q (Xθ ( j) )X. Slide 11 Example: Logistic Regression (cont d) µ i (θ) = E[y i ] = q(x T i θ) = ext i θ /(1 e xt i θ ) p i σ 2 i = Var (y i ) = p i (1 p i ) [ q e t ] (t) = 1+e t = What do l n (θ) and 2 l n (θ) look like in this case? 1 (1+e t [1 q(t)]2 ) 2= Application to Nonlinear Least Squares Basic assumptions: Y i indep N(µ i (φ),σ 2 /w i ) µ i (φ) = q(x i ;φ) Slide 12 where the function q( ), the design matrix X with rows x T i and the weights w i are all known in advance. We wish to estimateθ=(φ,σ 2 ) Examples: Exponential model: q(x;α,β)=αe βx ;φ=(α,β). Logistic model: q(x;α,β,γ)=α/(1+γe βx );φ=(α,β,γ). Gompertz model: q(x;α,β,γ)=αe γe βx ;φ=(α,β,γ). 6

Slide 13 A Sketch of Fisher Scoring Since L n (θ) is a normal likelihood, and so l n (θ)= 1 n logσ2 1 2σ 2 l n (θ) = 2 l n (θ) = w i [y i µ i (φ)] 2 1 σ 2 n w i [y i µ i (φ)] µ i (φ) n + 1 n 2σ 2 2σ 4 w i[y i µ i (φ)] 2 1 n σ 2 w i µ i (φ) µ i (φ) T 0 0 n 2σ 4 where the matrices have been partitioned into parts relevant toφand toσ 2. Scoring yields ˆφ ( j+1) = ˆφ ( j) + w i µ i (φ ( j) ) µ i (φ ( j) ) 1 T w i [y i µ i (φ ( j) )] µ i (φ ( j) ) ˆσ 2 ( j+1) = ˆσ 2 ( j) ˆσ 2 ( j) + 1 n w i [y i µ i (φ ( j) )] 2 Slide 14 Application to Robust Regression Main idea: E[y i x i ]=µ i (φ)= x T i φ, y indep i p(y i x i,φ,σ) ( ) 2 yi µ i (φ) Least Squares: Minimize S (φ) = σ Robust Regression: Minimize S (φ) = Examples: ρ(t)=t 2 ρ(t)= t t 2 /2, t <k ρ(t)= k t k 2 /2, t k ( ) yi µ i (φ) ρ σ t 2 t <k ρ(t)= k 2 t k ρ(t)=log cosh 2 (t)... Typically ρ(t) is symmetric, even, and has a unique antimode ρ(0) = 0. 7

Two common approaches to estimating robust regression models Scoring Approach: Replace the model Slide 15 with the model X i c σ e 1 2( y i µ i σ ) 2, c= 1 2π X i c σ e ρ( y i µ i σ ), c 1 = e ρ(y) dy Apply the Scoring idea. This yields an IRLS algorithm like the one for nonlinear least-squares. Iterative weighting with influence function: Observe that if S (φ)= n ρ ( ) y i µ i (φ) σ, then S (φ) = ( ρ yi µ ) i X i /σ = σ = 1 σ XT W(y Xφ) ( yi µ ) i w i X i σ Slide 16 where W is a diagonal matrix with diagonal elements ( w i =ρ yi µ )/( i yi µ ) i σ σ w i gives values of the influence function at each i. This leads to another IRLS-like procedure: 1. Compute ˆσ as a robust estimate of the residual standard deviationσ(for example, 2IQR 3 ˆσ, or take ˆσ = med resid med(resid) /0.6745. 2. Use this ˆσ to calculate W, and obtainφ ( j+1) using WLS. 8