Weighted Least Squares I for i = 1, 2,..., n we have, see [1, Bradley], data: Y i x i i.n.i.d f(y i θ i ), where θ i = E(Y i x i ) co-variates: x i = (x i1, x i2,..., x ip ) T let X n p be the matrix of covariates with rows x T i parameter of interest: β = (β 1, β 2,..., β p ), p < n θ i = E(Y i x i ) = β T x i V ar(y i x i ) = v i (φ) has a known form, which doesn t depend on β, v i (φ) s are not all the same and φ is known want to estimate β ignoring the underlying density, one could use the Weighted Least Squares estimator: β W LS = arg min β ( ) 2 v i (φ) 1 Y i β T x i February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 1
WLS II one could also use the Maximum Likelihood Estimator: β MLE = arg max β log(l(β)) = arg max β for WLS we solve the following normal equation: log ( ) f(y i β T x i ) ( ) v i (φ) 1 Y i β T x i x ij = 0, j = 1, 2,..., p (1) for MLE we solve the following system of equations: β j log ( ) f(y i β T x i ) = 0, j = 1, 2,..., p (2) for certain choice of f( ), β W LS = β MLE, what are those? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 2
NEF of Distributions: I NEF stands for Natural Exponential Family a NEF looks like: f(y θ) = h(y) exp[p (θ)y Q(θ)] where θ = E(Y ) and range of Y doesn t depend on θ consider f(y θ)dy = 1 or h(y) exp[p (θ)y Q(θ)]dy = 1 and assume differentiation under the integral sign is possible apply d dθ to both sides of the above to get: θ = E(Y ) = Q (θ) P (θ), why? apply d2 dθ 2 to both sides of the above to get: V ar(y ) = 1 P (θ), why? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 3
WLS and MLE I if f(y i β T x i ) all come from a NEF, then β W LS = β MLE sketch of proof: log = β j ( ) f(y i β T x i ) log ( ) f(y i β T x i ) = = = = = {log(h(y i )) + P (β T x i )Y i Q(β T x i )} {P (θ i )x ij Y i Q (θ i )x ij } ( ) P (θ i ) Y i Q (θ i ) P (θ i ) x ij v i (φ) 1 (Y i E(Y i x i )) x ij ( ) v i (φ) 1 Y i β T x i x ij February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 4
WLS and MLE II so equation (2) boils down to solving for: ( ) v i (φ) 1 Y i β T x i x ij = 0, j = 1, 2,..., p the above is exactly same as equation (1), Q.E.D. note the solutions to above equations also satisfies (how?): ( X T W X ) βw LS = X T W Y = β W LS = ( X T W X ) 1 X T W Y where W is diagonal with (W ) ii = v i (φ) 1 February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 5
Example I Heteroskedastic Least Squares: for i = 1, 2,..., n we have, i.n.i.d Y i x i Normal 1 (θ i, σ 2 k(x i )), for some known constant σ 2 and a known function k( ) with k : R p (0, ) θ i = E(Y i x i ) = β T x i want to estimate β so we take diagonal W such that (W ) ii = 1/(σ 2 k(x i )) and β W LS = ( X T W X ) 1 ( X T W Y ) now β W LS = β MLE because for Normal distribution comes from a NEF: 2 3 h i y exp 2 Normal 1 (θ, σ 2 2σ k(x); y) = 2 k(x i ) p exp θ θ 2 6 2πσ2 k(x i ) 4 σ 2 y k(x i ) 2σ 2 k(x i ) 7 5 {z } {z } {z } h(y) P (θ) Q(θ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 6
Iteratively Reweighted Least Squares I suppose in the previous setting, for a known non-linear function m(, ) with first derivative we have: θ i = m(β, x i ) want to estimate β ignoring the underlying density, one uses the Iteratively Reweighted Least Squares estimator: β IRLS = arg min β v i (φ) 1 (Y i m(β, x i )) 2 one can show under this set up, as well, β IRLS = β MLE the proof is very similar to the proof of β W LS = β MLE, which we did before, left as an assignment problem February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 7
IRLS II here we need to solve the following normal equation: v i (φ) 1 (Y i m(β, x i )) β j m(β, x i ) = 0, j = 1, 2,..., p (3) the problem is the normal equations (3) are not easily solved for β one could use the NR algorithm, instead we are going to use something different February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 8
IRLS III a new iterative route: let current update be β n 1 linearize the problem using Taylor expansion: m(β, x i ) m( b β n 1, x i ) + β β b T h n 1 β m( β b i n 1, x i ) now solve the simpler problem: bβ n = arg min β nx j v i (φ) 1 Y i m( β b n 1, x i ) + β b T n 1 h β T β m( β b i ff 2 n 1, x i ) h β m( β b i n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 9
IRLS IV the simpler problem can be solved with the following normal equations: nx j v i (φ) 1 Y i m( β b n 1, x i ) + β b T n 1 h β m( β b i n 1, x i ) h β T β m( β b i ff n 1, x i ) m( β b β n 1, x i ) = 0, j = 1, 2,..., p (4) j now take: ( X b n 1 ) ij = m( β b β n 1, x i ) j 8 ( W c < v i (φ) 1 if i = j n 1 ) ij = : 0 otherwise ( Y b n 1 ) i = Y i m( β b n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 10
IRLS V equation (4) amounts to solving (why?): X T n 1Ŵn 1Ŷn 1 = ( XT n 1 Ŵ n 1 Xn 1 ) ( βn β n 1 ) = β n β n 1 = ( XT n 1 Ŵ n 1 Xn 1 ) 1 XT n 1 Ŵ n 1 Ŷ n 1 = β n = β n 1 + ( XT n 1 Ŵ n 1 Xn 1 ) 1 XT n 1 Ŵ n 1 Ŷ n 1 (5) so the second term above looks like the WLS solution of regressing Ŷn 1 on X n 1 with weights Ŵn 1 and we iterate this procedure and hence the name the IRLS algorithm: start with properly chosen initial β 0 and apply the above updating scheme (until convergence) to get β 0 β 1 β 2 β IRLS February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 11
IRLS VI note from equation (5) it looks like a NR type update, this is a so called Newton Raphson like algorithm IRLS may or may not converge depending on starting values, much like NR February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 12
Example I Heteroskedastic Non-linear Least Squares: for i = 1, 2,..., n we have, i.n.i.d Y i x i Normal 1 (θ i, σ 2 k(x i )), for some known constant σ 2 and a known function k( ) with k : R p (0, ) θ i = E(Y i x i ) = m(β, x i ), for a known non-linear function m( ) with first derivative want to estimate β here for computing β IRLS (= β MLE, why?), we will need: ( X n 1 ) ij = m( β β n 1, x i ) j 1/(σ k(x i )) if i = j (Ŵn 1) ij = 0 otherwise (Ŷn 1) i = Y i m( β n 1, x i ) February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 13
IRLS and Scoring I consider the Generalized Linear Model (GLM) set up (a quick recap): random component: f(y i θ i ) come from a NEF, θ i = E(Y i x i ) systematic component: call η i = β T x i, also called the linear predictor link function: an invertible function g( ) such that η i = g(θ i ) with first derivative let V ar(y i x i ) = v i (β, φ), for some known parameter φ want to estimate β going to use scoring to find the MLE: β MLE February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 14
IRLS and Scoring II the log likelihood and it s derivative or the score: log (f(y i θ i )) = = β j log (f(y i θ i )) = = {log(h(y i )) + P (θ i )Y i Q(θ i )} (6) β j {P (θ i )Y i Q(θ i )} v i (β, φ) 1 (Y i E(Y i x i )) d i x ij = u j, say, (why?) here d i := θ i η i, i and d i, u i both are functions of β February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 15
IRLS and Scoring III if v(, ) doesn t depend on β (assume it from now on), then the information matrix entries simplify to: I(β) kj = E [ ] u j β k = v i (φ) 1 d 2 i x ij x ik (why?) in case v(, ) does depend on β, one needs carefully compute the information matrix entries on a case-by-case basis February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 16
IRLS and Scoring IV define: so we have (why?): (X) ij = x ij = v i (φ) 1 d 2 i ( β n 1 ) if i = j ij 0 otherwise ( ( Rn 1 )i = Y i g 1 ( β T n 1x i )) / d i ( β n 1 ) (Ŵn 1 ) I( β n 1 ) = X T Ŵ n 1 X T ( u( β n 1 ) = v i (φ) 1 Y i g 1 ( β T ) n 1x i ) d i ( β n 1 )x ij = X T Ŵ n 1 Rn 1 February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 17
IRLS and Scoring V now the scoring update satisfies: β n = β n 1 + [ I( β n 1 )] 1 u( βn 1 ) = β n = β n 1 + (X T Ŵ n 1 X T ) 1 X T Ŵ n 1 Rn 1 so, scoring updates for the MLE is reduces to some IRLS updates for the NEF densities February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 18
Example I Logistic Regression: for i = 1, 2,..., n we have, Y i x i i.n.i.d Bernoulli(θ i ) we have η i = β T x i also, η i = g(θ i ) = log ( θi 1 θ i ), the well known logit transform note if we take η i = g(θ i ) = Φ 1 (θ i ), the well known probit transform, then we will have the probit regression model (here Φ 1 ( ) is the inverse cdf of the Normal 1 (0, 1) distribution) what will be the expressions for Ŵn 1 and Ŷn 1 in this case? February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 19
References [1] Edwin L. Bradley. The equivalence of maximum likelihood and weighted least squares estimates in the exponential family. Journal of the American Statistical Association, 68:199 200, 1973. [2] A. Charnes, E. L. Frome, and P. L. Yu. The equivalence of generalized least squares and maximum likelihood estimates in the exponential family. Journal of the American Statistical Association, 71:169 171, 1976. [3] P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives (with discussion). Journal of the Royal Statistical Society, Series B, Methodological, 46:149 192, 1984. February 17, 2006 c 2006 - Gopi Goswami (goswami@stat.harvard.edu) Page 20