Weighted Least Squares I

Similar documents
MIT Spring 2016

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Lecture 16 Solving GLMs via IRWLS

STAT5044: Regression and Anova

Generalized Linear Models

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Logistic Regression. Seungjin Choi

Fall 2003: Maximum Likelihood II

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Linear and logistic regression

Linear Methods for Prediction

Generalized Linear Models Introduction

Generalized Linear Models. Kurt Hornik

Cox regression: Estimation

STAT 135 Lab 13 (Review) Linear Regression, Multivariate Random Variables, Prediction, Logistic Regression and the δ-method.

Outline of GLMs. Definitions

Linear Methods for Prediction

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Linear Regression Models P8111

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Generalized Linear Models 1

Generalized linear models

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

POLI 8501 Introduction to Maximum Likelihood Estimation

Lecture 4: Exponential family of distributions and generalized linear model (GLM) (Draft: version 0.9.2)

SCMA292 Mathematical Modeling : Machine Learning. Krikamol Muandet. Department of Mathematics Faculty of Science, Mahidol University.

Asymptotic Multivariate Kriging Using Estimated Parameters with Bayesian Prediction Methods for Non-linear Predictands

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Generalized Linear Models I

Answer Key for STAT 200B HW No. 8

STAT 100C: Linear models

SB1a Applied Statistics Lectures 9-10

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Generalized Estimating Equations

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Information in a Two-Stage Adaptive Optimal Design

ECE531 Lecture 8: Non-Random Parameter Estimation

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Linear Models in Machine Learning

Normalising constants and maximum likelihood inference

Generalized linear models

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

2 Nonlinear least squares algorithms

Time Series Analysis

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

LOGISTIC REGRESSION Joseph M. Hilbe

Logistic Regression and Generalized Linear Models

MATH Generalized Linear Models

Iterative Reweighted Least Squares

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

ML estimation: Random-intercepts logistic model. and z

STAT5044: Regression and Anova

Bayesian Logistic Regression

Bayesian Multivariate Logistic Regression

Statistical Machine Learning Hilary Term 2018

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Ch 4. Linear Models for Classification

Modeling Binary Outcomes: Logit and Probit Models

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

,..., θ(2),..., θ(n)

EXTENDING PARTIAL LEAST SQUARES REGRESSION

1 Outline. 1. Motivation. 2. SUR model. 3. Simultaneous equations. 4. Estimation

STA 216, GLM, Lecture 16. October 29, 2007

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

SGN Advanced Signal Processing Project bonus: Sparse model estimation

Computational methods for mixed models

Multinomial Data. f(y θ) θ y i. where θ i is the probability that a given trial results in category i, i = 1,..., k. The parameter space is

Models, Testing, and Correction of Heteroskedasticity. James L. Powell Department of Economics University of California, Berkeley

A Course in Applied Econometrics Lecture 18: Missing Data. Jeff Wooldridge IRP Lectures, UW Madison, August Linear model with IVs: y i x i u i,

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Generalized Linear Models

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Chapter 4: Asymptotic Properties of the MLE (Part 2)

Lecture 6: Methods for high-dimensional problems

Some explanations about the IWLS algorithm to fit generalized linear models

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Matrix Approach to Simple Linear Regression: An Overview

Generalized Linear Models (1/29/13)

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Lecture 17: Likelihood ratio and asymptotic tests

Likelihoods for Generalized Linear Models

Generalized Linear Models and Exponential Families

The equivalence of the Maximum Likelihood and a modified Least Squares for a case of Generalized Linear Model

Stat 579: Generalized Linear Models and Extensions

ESTIMATING THE MEAN LEVEL OF FINE PARTICULATE MATTER: AN APPLICATION OF SPATIAL STATISTICS

D-optimal Designs for Factorial Experiments under Generalized Linear Models

Machine Learning Lecture Notes

Introduction to Machine Learning

Introduction An approximated EM algorithm Simulation studies Discussion

A Very Brief Summary of Statistical Inference, and Examples

Ph.D. Qualifying Exam Friday Saturday, January 3 4, 2014

Transcription:

Weighted Least Squares I for i = 1, 2,..., n we have, see [1, Bradley], data: Y i x i i.n.i.d f(y i θ i ), where θ i = E(Y i x i ) co-variates: x i = (x i1, x i2,..., x ip ) T let X n p be the matrix of covariates with rows x T i parameter of interest: β = (β 1, β 2,..., β p ), p < n θ i = E(Y i x i ) = β T x i V ar(y i x i ) = v i (φ) has a known form, which doesn t depend on β, v i (φ) s are not all the same and φ is known want to estimate β ignoring the underlying density, one could use the Weighted Least Squares estimator: β W LS = arg min β ( ) 2 v i (φ) 1 Y i β T x i February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 1

WLS II one could also use the Maximum Likelihood Estimator: β MLE = arg max β log(l(β)) = arg max β for WLS we solve the following normal equation: log ( ) f(y i β T x i ) ( ) v i (φ) 1 Y i β T x i x ij = 0, j = 1, 2,..., p (1) for MLE we solve the following system of equations: β j log ( ) f(y i β T x i ) = 0, j = 1, 2,..., p (2) for certain choice of f( ), β W LS = β MLE, what are those? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 2

NEF of Distributions: I NEF stands for Natural Exponential Family a NEF looks like: f(y θ) = h(y) exp[p (θ)y Q(θ)] where θ = E(Y ) and range of Y doesn t depend on θ consider f(y θ)dy = 1 or h(y) exp[p (θ)y Q(θ)]dy = 1 and assume differentiation under the integral sign is possible apply d dθ to both sides of the above to get: θ = E(Y ) = Q (θ) P (θ), why? apply d2 dθ 2 to both sides of the above to get: V ar(y ) = 1 P (θ), why? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 3

WLS and MLE I if f(y i β T x i ) all come from a NEF, then β W LS = β MLE sketch of proof: log = β j ( ) f(y i β T x i ) log ( ) f(y i β T x i ) = = = = = {log(h(y i )) + P (β T x i )Y i Q(β T x i )} {P (θ i )x ij Y i Q (θ i )x ij } ( ) P (θ i ) Y i Q (θ i ) P (θ i ) x ij v i (φ) 1 (Y i E(Y i x i )) x ij ( ) v i (φ) 1 Y i β T x i x ij February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 4

WLS and MLE II so equation (2) boils down to solving for: ( ) v i (φ) 1 Y i β T x i x ij = 0, j = 1, 2,..., p the above is exactly same as equation (1), Q.E.D. note the solutions to above equations also satisfies (how?): ( X T W X ) βw LS = X T W Y = β W LS = ( X T W X ) 1 X T W Y where W is diagonal with (W ) ii = v i (φ) 1 February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 5

Example I Heteroskedastic Least Squares: for i = 1, 2,..., n we have, i.n.i.d Y i x i Normal 1 (θ i, σ 2 k(x i )), for some known constant σ 2 and a known function k( ) with k : R p (0, ) θ i = E(Y i x i ) = β T x i want to estimate β so we take diagonal W such that (W ) ii = 1/(σ 2 k(x i )) and β W LS = ( X T W X ) 1 ( X T W Y ) now β W LS = β MLE because for Normal distribution comes from a NEF: 2 3 h i y exp 2 Normal 1 (θ, σ 2 2σ k(x); y) = 2 k(x i ) p exp θ θ 2 6 2πσ2 k(x i ) 4 σ 2 y k(x i ) 2σ 2 k(x i ) 7 5 {z } {z } {z } h(y) P (θ) Q(θ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 6

Iteratively Reweighted Least Squares I suppose in the previous setting, for a known non-linear function m(, ) with first derivative we have: θ i = m(β, x i ) want to estimate β ignoring the underlying density, one uses the Iteratively Reweighted Least Squares estimator: β IRLS = arg min β v i (φ) 1 (Y i m(β, x i )) 2 one can show under this set up, as well, β IRLS = β MLE the proof is very similar to the proof of β W LS = β MLE, which we did before, left as an assignment problem February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 7

IRLS II here we need to solve the following normal equation: v i (φ) 1 (Y i m(β, x i )) β j m(β, x i ) = 0, j = 1, 2,..., p (3) the problem is the normal equations (3) are not easily solved for β one could use the NR algorithm, instead we are going to use something different February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 8

IRLS III a new iterative route: let current update be β n 1 linearize the problem using Taylor expansion: m(β, x i ) m( b β n 1, x i ) + β β b T h n 1 β m( β b i n 1, x i ) now solve the simpler problem: bβ n = arg min β nx j v i (φ) 1 Y i m( β b n 1, x i ) + β b T n 1 h β T β m( β b i ff 2 n 1, x i ) h β m( β b i n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 9

IRLS IV the simpler problem can be solved with the following normal equations: nx j v i (φ) 1 Y i m( β b n 1, x i ) + β b T n 1 h β m( β b i n 1, x i ) h β T β m( β b i ff n 1, x i ) m( β b β n 1, x i ) = 0, j = 1, 2,..., p (4) j now take: ( X b n 1 ) ij = m( β b β n 1, x i ) j 8 ( W c < v i (φ) 1 if i = j n 1 ) ij = : 0 otherwise ( Y b n 1 ) i = Y i m( β b n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 10

IRLS V equation (4) amounts to solving (why?): X T n 1Ŵn 1Ŷn 1 = ( XT n 1 Ŵ n 1 Xn 1 ) ( βn β n 1 ) = β n β n 1 = ( XT n 1 Ŵ n 1 Xn 1 ) 1 XT n 1 Ŵ n 1 Ŷ n 1 = β n = β n 1 + ( XT n 1 Ŵ n 1 Xn 1 ) 1 XT n 1 Ŵ n 1 Ŷ n 1 (5) so the second term above looks like the WLS solution of regressing Ŷn 1 on X n 1 with weights Ŵn 1 and we iterate this procedure and hence the name the IRLS algorithm: start with properly chosen initial β 0 and apply the above updating scheme (until convergence) to get β 0 β 1 β 2 β IRLS February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 11

IRLS VI note from equation (5) it looks like a NR type update, this is a so called Newton Raphson like algorithm IRLS may or may not converge depending on starting values, much like NR February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 12

Example I Heteroskedastic Non-linear Least Squares: for i = 1, 2,..., n we have, i.n.i.d Y i x i Normal 1 (θ i, σ 2 k(x i )), for some known constant σ 2 and a known function k( ) with k : R p (0, ) θ i = E(Y i x i ) = m(β, x i ), for a known non-linear function m( ) with first derivative want to estimate β here for computing β IRLS (= β MLE, why?), we will need: ( X n 1 ) ij = m( β β n 1, x i ) j 1/(σ k(x i )) if i = j (Ŵn 1) ij = 0 otherwise (Ŷn 1) i = Y i m( β n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 13

IRLS and Scoring I consider the Generalized Linear Model (GLM) set up (a quick recap): random component: f(y i θ i ) come from a NEF, θ i = E(Y i x i ) systematic component: call η i = β T x i, also called the linear predictor link function: an invertible function g( ) such that η i = g(θ i ) with first derivative let V ar(y i x i ) = v i (β, φ), for some known parameter φ want to estimate β going to use scoring to find the MLE: β MLE February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 14

IRLS and Scoring II the log likelihood and it s derivative or the score: log (f(y i θ i )) = = β j log (f(y i θ i )) = = {log(h(y i )) + P (θ i )Y i Q(θ i )} (6) β j {P (θ i )Y i Q(θ i )} v i (β, φ) 1 (Y i E(Y i x i )) d i x ij = u j, say, (why?) here d i := θ i η i, i and d i, u i both are functions of β February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 15

IRLS and Scoring III if v(, ) doesn t depend on β (assume it from now on), then the information matrix entries simplify to: I(β) kj = E [ ] u j β k = v i (φ) 1 d 2 i x ij x ik (why?) in case v(, ) does depend on β, one needs carefully compute the information matrix entries on a case-by-case basis February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 16

IRLS and Scoring IV define: so we have (why?): (X) ij = x ij = v i (φ) 1 d 2 i ( β n 1 ) if i = j ij 0 otherwise ( ( Rn 1 )i = Y i g 1 ( β T n 1x i )) / d i ( β n 1 ) (Ŵn 1 ) I( β n 1 ) = X T Ŵ n 1 X T ( u( β n 1 ) = v i (φ) 1 Y i g 1 ( β T ) n 1x i ) d i ( β n 1 )x ij = X T Ŵ n 1 Rn 1 February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 17

IRLS and Scoring V now the scoring update satisfies: β n = β n 1 + [ I( β n 1 )] 1 u( βn 1 ) = β n = β n 1 + (X T Ŵ n 1 X T ) 1 X T Ŵ n 1 Rn 1 so, scoring updates for the MLE is reduces to some IRLS updates for the NEF densities February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 18

Example I Logistic Regression: for i = 1, 2,..., n we have, Y i x i i.n.i.d Bernoulli(θ i ) we have η i = β T x i also, η i = g(θ i ) = log ( θi 1 θ i ), the well known logit transform note if we take η i = g(θ i ) = Φ 1 (θ i ), the well known probit transform, then we will have the probit regression model (here Φ 1 ( ) is the inverse cdf of the Normal 1 (0, 1) distribution) what will be the expressions for Ŵn 1 and Ŷn 1 in this case? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 19

References [1] Edwin L. Bradley. The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. Journal of the American Statistical Association, 68:199 200, 1973. [2] A. Charnes, E. L. Frome, and P. L. Yu. The equivalence of generalized least squares and maximum likelihood estimates in the exponential family. Journal of the American Statistical Association, 71:169 171, 1976. [3] P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion). Journal of the Royal Statistical Society, Series B, Methodological, 46:149 192, 1984. February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 20