Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Weighting We have seen that if E(Y) = Xβ and V (Y) = σ 2 G, where G is known, the model can be rewritten as a linear model. This is known as generalized least squares or, if G is diagonal, with trace(g) = n, as weighted least squares. Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown. We typically assume a parametric form relating g i to µ i or x i, where g i is the ith diagonal element of G. For example, we might assume that g i = xi θ, where θ is an unknown parameter.

We usually fit such models using iterated weighted least squares, i.e. 1. start with the least squares estimates; 2. use the residuals to estimate θ and hence g i ; 3. use these estimates in a weighted least squares fit to re-estimate β; 4. iterate steps 2 and 3 until convergence to ˆβ G. It can be shown that, asymptotically, ˆβ G N(β, σ 2 (X G 1 X) 1 ). Note, however, that the variance depends on G, which is a function of θ and not Ĝ, which is a function of ˆθ. The quality of the asymptotic approximation depends on how well we estimate θ. In particular, standard errors are usually underestimated if a simple plug-in estimator is used.

Transformations The famous Box-Cox family of transformations is given by { Y (λ) (Y = λ 1)/λ if λ 0; log Y if λ = 0. The assumed model is with Y i s independent. Y (λ) i N(x iβ, σ 2 ), The main purpose of this transformation is to ensure symmetry (normality) in the distribution of Y. In a randomization framework, it ensures additivity of unit and treatment effects.

The best fitting λ will be chosen by maximum likelihood and should stabilise the variance. The MLEs are easily obtained by using an iterative maximisation over λ and finding the least squares estimates of β, at each λ, i.e. we minimise n { S(λ) = Y (λ) 2 i x iβ}. i=1 It can be shown that, for a fixed λ, {S(λ) S(ˆλ)} χ 2 1. This means that we can conduct all the usual inference, but with one degree of freedom lost for estimating λ.

Both weighting and transformation rely on the assumption that the variance increases (or decreases) with µ or with x. Whereas weighting assumes a symmetrical distribution, transformation assumes that the distribution becomes more asymmetrical the smaller the variance is. It is also possible to combine transformations and weighting (if the data are sufficiently rich), to simultaneously correct for heteroskedasticity and asymmetry. Other families of transformations are possible.

It is also possible to transform both sides, i.e. Y (λ) i N ( (x i β ) (λ), σ 2 ). This corrects for asymmetry, while maintaining the functional relationship between µ and the explanatory variables. This makes most sense when the functional form of µ is known (and so is much more common with nonlinear functional forms).

Components of Variance In many studies there are several nested levels of variation, e.g. schools and pupils. It is natural to consider each of these as contributing a variance component, e.g. for group i, unit j, we have Y ij = x ijβ + δ i + ɛ ij, where δ i N(0, σ 2 1 ), ɛ ij N(0, σ 2 ) and all r.v.s are independent. This can be rewritten as where Y N n (Xβ, V),

V = U 0 0 0 U.... 0 U and U = σ1 2 + σ2 σ1 2 σ1 2 σ1 2. σ1 2 + σ2... σ1 2 + σ2 σ 2 1. This generalizes in an obvious way to: unequal numbers of subgroups in each group; any number of variance components; negative variance components.

If σ 2 1 /σ2 were known, β could be estimated by generalized least squares. In practice, we usually estimate the variance components and plug the estimates in to a generalized least squares fit. This underestimates the standard errors of ˆβ, but various corrections are available. The variance components can be estimated by maximum likelihood or residual maximum likelihood (REML) (or we can do a Bayesian analysis). REML uses an orthogonal transformation of Y of dimension n p and maximises the likelihood. In the general linear model, it gives S 2 as the estimator of σ 2.

Generalized Linear Models Linear models are useful for response variables taking values on R. Generalized linear models (GLMs) are used for other responses. GLMs require an assumption that the responses are from a particular distribution from the exponential family, with location parameter µ depending on x through a link function g( ): g(µ i ) = x iβ and constant dispersion parameter. These are most commonly used with discrete data, the assumed distribution being Poisson or binomial, but occasionally the gamma distribution is used for continuous data on R +. This is an alternative to transforming Y.

GLMs have a number of properties which make them simple to work with: Maximum likelihood estimates can be obtained using iteratively reweighted least squares, which ensures convergence. Canonical link functions (e.g. log λ for Poisson, log{π/(1 π)} for binomial) can be used, for which X Y is a sufficient statistic. Extensions include generalized linear mixed models and hierarchical generalized linear models. These allow for additional random effects.

Quasi-likelihood models retain the same mean-variance relationship as the distributions used in GLMs, but drop the distributional assumption and allow for an extra dispersion parameter. For example, instead of assuming Y Poisson(λ), which implies E(Y ) = λ and V (Y ) = λ, we assume E(Y ) = λ and V (Y ) = σ 2 λ, with the same link function, but without assuming a specific distribution. This is a kind of semi-parametric model.

Proportional Hazards Models For time-to-event data T, it is often appropriate to assume the model T i Weibull(α i, η), where log α i = x i β. This is not a GLM, since the Weibull distribution is not a member of the exponential family. The hazard function for this distribution is h(t) = f (t) 1 F (t) = ηα η i t η 1 = ηe ηx iβ t η 1 = h 0 (t)e x iβ, where h 0 (t) is the baseline hazard for an observation with x = 0.

The Weibull regression model has the property of proportional hazards. Very often, we use Cox s semi-parametric proportional hazards model, which assumes h(t) = h 0 (t)e x iβ, but does not make any distributional assumption. As usual, proportional hazards is just an assumption and can be relaxed. We can also include random effects (frailty) in these models.

Semi-Parametric Models Semi-parametric regression models assume that Y N(f (x), σ 2 ), but avoid assume a specific functional form for f (x). Instead, data-driven smoothers are used to approximate the unknown function. Typical examples are smoothed local polynomials (splines), which try to follow the data, while penalising roughness. These can be extended to other distributions using generalized additive models. Opinion: These models are most useful for nuisance effects, as they do not allow mechanistic understanding to be developed.

Nonlinear Models The term nonlinear models is completely general, but is most often associated with models of the form E(Y i ) = f (x i ; θ) and V (Y) = σ 2 I. The parameters are usually estimated by nonlinear least squares (NLLS), i.e. we choose θ to minimise n {Y i f (x i ; θ)} 2. i=1 Inference is usually carried out by assuming normality, in which case the NLLS estimates are equivalent to the MLEs, and invoking the asymptotic properties of MLEs.

Nonlinear least squares is a fundamentally difficult numerical problem: Convergence to the global optimum is not guaranteed and often fails in practice. Partial linear algorithms can help when there are separable linear parameters, i.e. the model can be written E(Y ) = q β j f j (x; θ). j=1 This reduces the numerical search by q dimensions. Also, asymptotic inferences can be very poor approximations. Bayesian methods avoid these computational problems, but might create others.