A Covariance Regression Model

Size: px

Start display at page:

Download "A Covariance Regression Model"

Kelley Mills
5 years ago
Views:

1 A Covariance Regression Model Peter D. Hoff 1 and Xiaoyue Niu 2 December 8, 2009 Abstract Classical regression analysis relates the expectation of a response variable to a linear combination of explanatory variables. In this article, we propose a covariance regression model that parameterizes the covariance matrix of a multivariate response vector as a parsimonious quadratic function of explanatory variables. The approach can be seen as analogous to the mean regression model, and has a representation as a type of random effects model. Parameter estimation for covariance regression is straightforward using either an EM algorithm or a Gibbs sampling scheme. The proposed methodology provides a simple but flexible representation of heteroscedasticity across the levels of an explanatory variable, and can give better-calibrated prediction regions when compared to a homoscedastic model. Some key words: heteroscedasticity, Markov chain Monte Carlo, multivariate, positive definite cone, random effects. 1 Introduction Estimation of a conditional mean function µ x = E[y x] is a well studied data-analysis task for which there are a large number of statistical models and procedures. Less studied is the problem of estimating a covariance function Σ x = Cov[y x] across a range of values of an explanatory x-variable. In the univariate case, several procedures assume that the variance can be expressed as a function of the mean, i.e. σx 2 = g(µ x ) for some known function g (see, for example, Carroll et al. [1982]). In many such cases the data can be represented by a generalized linear model with an appropriate variance function, or perhaps the data can be transformed to a scale in which the variance is constant as a function of the mean [Box and Cox, 1964]. Other approaches separately parameterize the mean and variance, giving either a linear model for the standard deviation [Rutemiller and Departments of Statistics 1,2 and Biostatistics 1, University of Washington, Seattle, WA Web: www. stat.washington.edu/~hoff. This work was partially supported by NSF grant SES

2 Bowers, 1968] or by forcing the variance to be non-negative via a link function [Smyth, 1989]. In situations where the explanatory variable x is continuous and the variance function assumed to be smooth, Carroll [1982] and Müller and Stadtmüller [1987] propose and study kernel estimates of the variance function. Less developed are methods for multivariate heteroscedasticity. One exception is in the context of multivariate time series, for which a variety of multivariate autoregressive conditionally heteroscedastic (ARCH) models have been developed [Engle and Kroner, 1995, Fong et al., 2006]. However, the applicability of such models are limited to situations where the heteroscedasticity is temporal in nature. In this article we develop a simple model for a covariance function {Σ x : x X } for which the domain of the explanatory x-variable is the same as in mean regression, that is, the explanatory vector can contain continuous, discrete and categorical variables. Our model is based on an analogy with linear regression. As a function of x, the covariance regression function Σ x is a curve within the cone of positive definite matrices. A geometric interpretation of this model is developed in Section 2, along with a representation as a random effects model. Section 3 discusses methods of parameter estimation, including an EM algorithm for obtaining maximum likelihood estimates, as well as a Gibbs sampler for Bayesian inference. Section 4 illustrates the model with a simple data analysis involving a bivariate response vector and a univariate continuous explanatory variable. Section 5 summarizes the article and suggests directions for further research. 2 A covariance regression model 2.1 Model definition and geometry Let y R p be a random multivariate response vector and x R q be a vector of explanatory variables. Our goal is to provide a parsimonious model and estimation method for Cov[y x] = Σ x, the conditional covariance matrix of y given x. We begin by analogy with linear regression. The simple linear regression model expresses the conditional mean µ x = E[y x] as a + Bx, an affine function of x. This model restricts the p-dimensional vector µ x to a q-dimensional subspace of R p. The set of p p covariance matrices is the cone of positive semidefinite matrices. This cone is convex and thus closed under addition. The simplest version of our proposed covariance regression model expresses Σ x as Σ x = A + Bxx T B T (1) where A is a p p positive-definite matrix and B is a p q matrix. The resulting covariance function is positive definite for all x, and expresses the covariance as equal to a baseline covariance matrix A plus a rank-1, p p positive definite matrix that depends on x. The model given by Equation 1 is in some sense a natural generalization of mean regression to a model for covariance matrices. A 2

3 vector mean function lies in a vector (linear) space, and is expressed as a linear map from R q to R p. The covariance matrix function lies in the cone of positive definite matrices, where the natural group action is matrix multiplication on the left and right. The covariance regression model expresses the covariance function via such a map from the q q cone to the p p cone. Letting {b 1,..., b p } be the rows of B, the covariance regression model gives Var[y j x] = a j,j + b T j xx T b j (2) Cov[y j, y k x] = a j,k + b T j xx T b k. (3) The parameterization of the variance suggests that the model requires the variance of each element of y to be increasing in the absolute value of the elements of x, as the minimum variance is obtained when x = 0. This constraint can be alleviated by including an intercept term so that the first element of the explanatory vector is 1. For example, in the case of a single scalar explanatory variable x we write b j = (b 1,j, b 2,j ) T, giving Var[y j x] = a j,j + (b 2,j + b 2,j x) 2 Cov[y j, y k x] = a j,k + (b 1,j + b 2,j x)(b 1,k + b 2,k x). For any given finite interval (c, d) R there exist parameter values (b 1,j, b 2,j ) so that the variance of y j is either increasing or decreasing in x for x (c, d). We now consider the geometry of the covariance regression model. For each x, the model expresses Σ x as equal to a point A inside the positive-definite cone plus a rank-1 positive-semidefinite matrix Bxx T B T. The latter matrix is a point on the boundary of the cone, so the range of Σ x as a function of x can be seen as a submanifold of the boundary of the cone, but pushed into the cone by an amount A. Figure 1 represents this graphically for the simplest of cases, in which p = 2 and there is just a single scalar explanatory variable x. In this case, each covariance matrix can be expressed as a three-dimensional vector (σ1 2, σ2 2, σ 1,2} such that σ1 2 0, σ2 2 0, σ 1,2 σ 1 σ 2. The set of such points constitutes the positive semidefinite cone, whose boundary is shown by the outer surfaces in the two plots in Figure 1. The range of Bxx T B T over all x and matrices B includes the set of all rank-1 positive definite matrices, which is simply the boundary of the cone. Thus the possible range of A + Bxx T B T for a given A is simply the boundary of the cone, translated by an amount A. Such a translated cone is shown from two perspectives in Figure 1. For a given A and B, the covariance regression model expresses Σ x as a curve on this translated boundary. A few such curves for six different values of B are shown in black in Figure 1. 3

4 Figure 1: The positive-definite cone and a translation, from two perspectives. The outer surface is the boundary of the the positive definite cone, and the inner cone is equal to the boundary plus a positive definite matrix A. Black curves on the inner cone represent covariance regression curves A + Bxx T B T for different values of B. 2.2 Random effects representation The covariance regression model also has an interpretation as a type of random effects model. Consider a model for observed data y 1,..., y n of the following form: y i = µ xi + γ i Bx i + ɛ i (4) E[ɛ i ] = 0, Cov[ɛ i ] = A E[γ i ] = 0, Var[γ i ] = 1, E[γ i ɛ i ] = 0. The resulting covariance matrix for y i given x i is then E[(y i µ xi )(y i µ xi ) T ] = E[γi 2 Bx i x T i B T + γ i (Bx i ɛ T i + ɛ i x T i B T ) + ɛ i ɛ T i ] = Bx i x T i B T + A = Σ xi. The model given by Equation 4 can be thought of as a factor analysis model in which the latent factor for unit i is restricted to be a multiple of the unit s explanatory vector x i. To see how this 4

5 impacts the variance, let {b 1,..., b p } be the rows of B. The model in (4) can then be expressed as y i,1 µ xi,1 b T 1 x i. = γ i. + y i,p µ xi,p b T p x i ɛ i,1. ɛ i,p. (5) We can interpret γ i as describing additional unit-level variability beyond that represented by ɛ i. The vectors {b 1,..., b p } describe how this additional variability is manifested across the p different response variables. Via the above random effects representation, the covariance regression model can be seen as similar in spirit to a random effects model for longitudinal data discussed in Scott and Handcock [2001]. In that article, the covariance among a set of repeated measurements y i from a single individual i were modeled as y i = µ i + γ i X i β + ɛ i, where X i is an observed design matrix for the repeated measurements and γ i is a mean-zero unit variance random effect. In the longitudinal data application in that article, X i was constructed from a set of basis functions evaluated at the observed time points, and β represented unknown weights. This model induces a covariance matrix of X i ββ T X T i + Cov[ɛ i ] among the observations common to an individual. For the problem we are considering in this article, where the explanatory variables are shared among all p observations of a given unit (i.e. the rows of X i are identical and equal to x i ), the covariance matrix induced by Scott and Handcock s model reduces to (x T i β)2 11 T + Cov[ɛ i ], which is much more restrictive than the model given by (4). 2.3 Higher rank models The model given by Equation 1 restricts the difference between Σ x and the baseline matrix A to be a rank-one matrix. This restriction can be lifted by extending the model to allow for higher-rank deviations. Consider the following extension of the random effects representation given by Equation 4: y = µ x + γ Bx + ψ Cx + ɛ (6) where γ and ψ are mean-zero variance-one random variables, uncorrelated with each other and with ɛ. Under this model, the covariance of y is given by Σ x = A + Bxx T B T + Cxx T C T. This model allows the deviation of Σ x from the baseline A to be of rank 2. Additionally, we can interpret the second random effect ψ as allowing an additional, independent source of heteroscedas- 5

6 ticity for the set of the p response variables. For the rank-2 model, Equation 5 becomes y i,1 µ xi,1 b T 1 x i c T 1. = γ i. + ψ x i ɛ i,1 i. +.. y i,p µ xi,p b T p x i c T p x i Whereas the rank-1 model essentially requires that extreme residuals for one element of y co-occur with extreme residuals of the other elements, the rank-2 model provides more flexibility, allowing for heteroscedasticity across multiple elements of y without requiring extreme residuals for all or none of the elements. Further flexibility can be gained by adding additional random effects, allowing the difference between Σ x and the baseline A to be of any desired rank. ɛ i,p 2.4 Identifiability We first consider identifiability for the rank-1 model and a single scalar explanatory variable x. Including an intercept term so that the explanatory vector is (1, x) T, the model in (1) becomes Σ x (A, B) = A + b 1 b T 1 + (b 1 b T 2 + b 2 b T 1 )x + b 2 b T 2 x 2. Now suppose that (Ã, B) are such that Σ x (A, B) = Σ x (Ã, B) for all x R. Setting x = 0 indicates that A + b 1 b T 1 = Ã + b 1 bt 1. Considering x = ±1 implies that b 2 b T 2 = b 2 bt 2 and thus that b 2 = ±b 2. If b 2 0, we have b 1 b T 2 + b 2 b T 1 = b T T 1 b2 + b2 b1, which implies that B = ±B and Ã = A. Thus these parameters are identifiable, at least given an adequate range of x-values. For the rank-r model with r > 1, consider a random effects representation given by y i µ xi = γi,k B (k) x i + ɛ i. Let B 1 = (b (1) 1,..., b(r) 1 ) be the p r matrix defined by the first columns of B (1),..., B (r), and define {B j : k = 1,..., q} similarly. The model can then be expressed as y i µ xi = q x k B k γ i + ɛ i. k=1 Now suppose that γ i is allowed to have a covariance matrix Ψ not necessarily equal to the identity. The above representation shows that the model given by {B 1,..., B k, Ψ} is equivalent to the one given by {B 1 Ψ 1/2,..., B k Ψ 1/2, I}, and so without loss of generality it can be assumed that Ψ = I, i.e. the random effects are independent with unit variance. In this case, note that Cov[γ i ] = Cov[Hγ i ] where H is any r r orthonormal matrix. This implies that the covariance function Σ x given by {B 1,..., B k, I} is equal to the one given by {B 1 H,..., B k H, I} for any orthonormal H, and so the parameters in the higher rank model are not completely identifiable. One possible identifiability constraint is to restrict B 1 = (b (1) 1,..., b(r) 1 ), the matrix of first columns of B (1),..., B (r), to have orthogonal columns. 6

7 3 Parameter estimation 3.1 Likelihood-based inference In this section we consider parameter estimation based on the n p data matrix Y = (y 1,..., y n ) T observed under conditions X = (x 1,..., x n ) T. We assume normal models for all error terms: γ 1,..., γ n independent normal(0, 1) (7) ɛ 1,..., ɛ n independent multivariate normal(0, A) y i = µ xi + γ i Bx i + ɛ i. For now, assume {µ x, x X } are known and let E = (e 1,..., e n ) T be the n p matrix of residuals. The log likelihood of the parameters based on E and X is l(a, B : E, X) = c 1 log A + Bx i x T i B 1 tr[(a + Bx i x T i B T ) 1 e i e T i ]. (8) 2 2 i After some algebra, it can be shown that the maximum likelihood estimates of A and B satisfy the following equations: i ˆΣ 1 x i = i i ˆΣ 1 x i ˆBxi x T i = i ˆΣ 1 x i e i e T i ˆΣ 1 x i e i e T i i ˆΣ 1 x i ˆΣ 1 x i ˆBxi x T i, where ˆΣ x = Â + ˆBxx T ˆBT. While not providing closed-form expressions for Â and ˆB, these 1 equations indicate that the maximum likelihood estimates give a covariance function ˆΣ x i that, loosely speaking, acts on average as a pseudo-inverse for e i e T i. While direct maximization of (8) is challenging, the random effects representation of the model allows for parameter estimation via simple iterative methods. In particular, maximum likelihood estimation via the EM algorithm is straightforward, as is Bayesian estimation using a Gibbs sampler to approximate the posterior distribution p(a, B Y, X). Both of these methods rely on the conditional distribution of {γ 1,..., γ n } given {Y, X, A, B}. Straightforward calculations give {γ i Y, X, A, B} normal(m i, v i ), where v i = (1 + x T i B T A 1 Bx i ) 1 m i = v i (y i µ xi ) T A 1 Bx i. Given the variety of modeling options for mean regression, we do not cover estimation of {µ x : x X } in the next two sections. In what follows we assume {µ x : x X } are known or fixed at some estimated values. We note that both the EM algorithm and the Gibbs sampling scheme presented below can be modified to accommodate simultaneous estimation of the mean function. 7

8 3.2 Estimation with the EM algorithm Let e i = (y i µ xi ) and E = (e T 1,..., et n ) T. The EM algorithm proceeds by iteratively maximizing the expected value of the complete data log-likelihood, l(a, B) = log p(e A, B, X, γ), which is simply obtained from the multivariate normal density ( ) l(a, B) = 1 n np log(2π) + n log A + (e i γ i Bx i ) T A 1 (e i γ i Bx i ). (9) 2 Given current estimates (Â, ˆB) of (A, B), one step of the EM algorithm proceeds as follows: First, m i = E[γ i Â, ˆB, e i ] and v i = Var[γ i Â, ˆB, e i ] are computed and plugged into the likelihood (9), giving where 2E[l(A, B) Â, ˆB)] = np log(2π) + n log A + E[(e i γ i Bx i ) T A 1 (e i γ i Bx i ) Â, ˆB)] i=1 n i=1 E[(e i γ i Bx i ) T A 1 (e i γ i Bx i ) Â, ˆB)] = (e i m i Bx i ) T A 1 (e i m i Bx i ) + v i x T i B T A 1 Bx i = (e i m i Bx i ) T A 1 (e i m i Bx i ) + s i x T i B T A 1 Bx i s i, with s i = v 1/2 i. Next, a 2n q matrix X is constructed, having ith row equal to m i x i and (n + i)th row equal to s i x i. Additionally, let Ẽ be the 2n p matrix given by (ET, 0 E T ) T. The expected value of the complete data log-likelihood can be written as 2E[l(A, B) Â, ˆB)] np log(2π) = n log A + tr([ẽ B X][Ẽ B X] T A 1 ) which is essentially the likelihood for normal multivariate regression. The next step of the EM algorithm obtains the new values (Â, ˆB) as the maximizers of this expected likelihood, which are given by ˆB = ẼT X( XT X) 1 Â = (Ẽ XB 1 ) T (Ẽ XB 1 )/n. This procedure is repeated until a desired convergence criterion has been met. 3.3 Posterior approximation with the Gibbs sampler A Bayesian analysis provides estimates and confidence intervals for arbitrary functions of the parameters, as well as a simple way of making predictive inference for future observations. Given a prior distribution p(a, B), inference is based on the joint posterior distribution, p(a, B Y, X) 8

9 p(a, B) p(y X, A, B). While this posterior distribution is not available in closed-form, a Monte Carlo approximation to the joint posterior distribution of (A, B) is available via Gibbs sampling. Using the random effects representation of the model in Equation 7, the Gibbs sampler constructs a Markov chain in {A, B, γ 1,..., γ n } whose stationary distribution is equal to the joint posterior distribution of these quantities. Calculations are facilitated by the use of a semi-conjugate prior distribution for A and B, in which p(a) is an inverse-wishart(a 1 0, ν 0) distribution having expectation A 0 /(ν 0 p 1) and p(b A) is a matrix normal(b 0, A, V 0 ) distribution, such that E[B A] = B 0, E[(B B 0 )(B B 0 ) T ) A] = A tr(v 0 ) and E[(B B 0 ) T (B B 0 ) A] = V 0 tr(a). The Gibbs sampler proceeds by iteratively sampling (A, B) and {γ 1,..., γ n } from their full conditional distributions. As with the EM algorithm, we consider inference given values of {µ x : x X }, letting e i = y i µ xi and E = (e 1,..., e n ) T. One iteration of a Gibbs sampler consists of the following steps: 1. Sample γ i normal(m i, v i ) for each i {1,..., n}, where v i = (1 + x T i BT A 1 Bx i ) 1 ; m i = v i e T i A 1 Bx i. 2. Sample (A, B) p(a, B E, X, γ 1,..., γ n ) as follows: (a) sample A inverse-wishart(a 1 n, ν 0 + n), and (b) sample B matrix normal(b n, A, [X T γ X γ + V 1 0 ] 1 ), where X γ = ΓX, with Γ = diag(γ 1,..., γ n ), B n = (E T X γ + B 0 V 1 0 )(XT γ X γ + V 1 0 ) 1, and A n = A 0 + (E X γ B n ) T (E X γ B n ) + (B n B 0 ) T V 1 0 (B n B 0 ). In the absence of strong prior information, default values for the prior parameters {B 0, V 0, A 0, ν 0 } can be based on other considerations. In normal regression for example, Zellner [1986] suggests a g-prior which makes the Bayes procedure invariant to linear transformations of the design matrix X. An analogous result for covariance regression can be obtained by selecting B 0 = 0 and V 0 = g(x T X) 1, i.e. by relating the prior precision of B to the precision given by the observed design matrix. A typical choice for g is to set g = n so that, roughly speaking, the information in the prior distribution is equivalent to that contained in one observation. Such choices lead to what Kass and Wasserman [1995] call a unit-information prior distribution, which in some cases weakly centers the prior distribution around an estimate based on the data. For example, setting ν 0 = p + 2 and A 0 equal to the sample covariance matrix of E weakly centers the prior distribution of A around a homoscedastic sample estimate. 9

10 3.4 Estimation for higher-rank models Section 2.3 discussed the possibility of a more flexible covariance regression model by allowing the deviation between A and Σ x to be of a rank greater than one. The general form for a rank-r covariance regression model is given by r y i = µ xi + γ i,k B (k) x i + ɛ i k=1 = µ xi + B(γ i x i ) + ɛ i, where B = (B (1),..., B (r) ). Estimation for this model can proceed with a small modification of the Gibbs sampling algorithm given above, in which B (k) and {γ i,k, i = 1,..., n} are updated for each k {1,..., r} separately. Alternatively, the full conditional distributions of B and {γ 1,..., γ n } are available in closed form, and so the B- and γ-parameters for all ranks could be updated simultaneously. However, in our experience the calculation of these full conditional distributions is computationally costly: The full conditional distributions of the γ i s involve separate matrix inversions for each i = 1,..., n (or more precisely, for each unique value of the x i s). In our experience, sampling the random effects associated with all ranks simultaneously greatly slows down the Markov chain without providing improved performance in terms of convergence or mixing. An EM algorithm is also available for estimation of this general rank model. The main modification to the algorithm presented in Section 3.2 is that the conditional distribution of each γ i is a multivariate normal distribution, which leads to a more complex E-step in the procedure, while the M-step is equivalent to a multivariate least-squares regression estimation as before. We note that, in our experience, convergence of the EM algorithm for ranks greater than 1 can be slow, presumably due to the identifiability issue described in Section 2.4. More details about these estimation algorithms for the general rank model are available from the companion computer code for this article, available at the first author s website. 4 An example with a single continuous predictor 4.1 Heteroscedastic FEV and height data To illustrate the use of the covariance regression model we analyze data on forced expiratory volume (FEV) in liters and height in inches of 654 Boston youths [Rosner, 2000]. One feature of these data are the general increase in the variance of these variables with age, as shown in Figure 2. As the mean responses for these two variables are also increasing with age, one possible modeling strategy is to apply a variance stabilizing transformation to the data. In general, such transformations presume a particular mean-variance relationship, and choosing an appropriate transformation can be prone to much subjectivity. As an alternative, a covariance regression model allows 10

11 age FEV age height Figure 2: FEV and height data, as a function of age. The smooth lines are local polynomial fits. heteroscedasticity to be modeled separately from heterogeneity in the mean, and also allows for modeling on the original scale of the data. 4.2 Maximum likelihood estimation Ages for the 654 subjects ranged from 3 to 19 years, although there were only two 3-year-olds and three 19-year-olds. As we will be using plug-in estimates of µ x, we combine the data from children of ages 3 and 19 with those of the 4 and 18-year-olds, respectively, giving a sample size of at least 8 in each age category. To focus the example on the covariance regression model, we take as our data the bivariate residuals from two local polynomial regression fits (using loess in the R statistical computing environment), one for each of FEV and height. We then use the EM algorithm described in Section 3 to fit the following two covariance regression models: Model 1: A rank-1 model with x i = (1, age 1/2 i ); Model 2: A rank-2 model with x i = (1, age 1/2 i, age i ). Note that including age 1/2 as a regressor results in there being a linear component to the modeled relationship between age and the variances and covariance. The maximized log likelihoods for these two models are and , respectively, which give the two models roughly the same value of the AIC. However, the increased flexibility of 11

12 Var(FEV) Var(height) Cor(FEV,height) age age age Figure 3: Sample variances and correlations as a function of age, along with covariance regression fits. The gray lines correspond to a rank-1 model with x = (1, age 1/2 ). The black lines correspond to a rank-2 model with x = (1, age 1/2, age). Model 2 over Model 1 is highlighted in Figure 3, which plots the fitted variances and covariance of FEV and height as a function of age, along with the sample variances and correlations for each age group. The plots suggest that the rank-2 model has sufficient flexibility to capture the observed trends in Σ x as a function of age. 4.3 Posterior predictive distributions One potential application of the covariance regression model is to make predictive regions for multivariate observations. Erroneously assuming a covariance matrix to be constant in x could give a prediction region with correct coverage probability for an entire population, but incorrect for specific values of x, and incorrect for making generalizations to populations having a distribution of x-values that is different from that of the data. Predictive inference is straightforward to implement in the context of Bayesian estimation: The prior distributions and data generate a predictive distribution p(ỹ x, Y, X) for each possible value of x, which can be approximated via the output from the Markov chain Monte Carlo algorithm described in Section 3. Using Model 2 described above and the default prior distributions discussed in Section 3, 50,000 iterations of the Gibbs sampler were generated, the first 1,000 of which were discarded to allow for convergence to the stationary distribution. Parameter values were saved every 10th iteration thereafter, leaving 4,900 saved values with which to make Monte Carlo approximations. For each of the 4,900 generated values of {A, B}, we constructed Σ x (A, B) for each age from 4 to 18, yielding 45 parameters for each value of {A, B}. Effective sample sizes (roughly, the equivalent 12

13 age age 5 age 6 age age 8 age 9 age 10 height residual age 11 age 12 age age 14 age 15 age age 17 FEV residual age Figure 4: Observed data and 90% posterior predictive ellipsoids for each age. The black ellipsoids correspond to the covariance regression model, and the gray to a model with constant variance. 13

14 age group sample size homoscedastic heteroscedastic Table 1: Observed-data coverage rates by age for the heteroscedastic predictive ellipse from the covariance regression model, and the homoscedastic predictive ellipse from a constant covariance model. The nominal (target) coverage rates for the ellipses is 90%. number of independent Monte Carlo samples) for these 45 parameters were all above 1000, with the exception of σ1 2 and σ 1,2 for the 18-year-old age group, which had effective sample sizes of 988 and 713 respectively. For each age group x and each of the 4,900 values of Σ x, a predictive sample ỹ was generated from the multivariate normal(0, Σ x ) distribution. A 90% predictive ellipse was then generated as the smallest ellipse that contained 90% of the 4,900 posterior predictive ỹ-values for the given age group. These ellipses are displayed graphically in Figure 4, along with the data and an analogous predictive ellipse based on a homoscedastic (constant covariance) model. Averaged across observations from all age groups, both of the two sets of ellipsoids contain 90.5% of the observed data, which is very close to the nominal coverage of 90%. However, as can be seen from Table 1, the homoscedastic ellipse overcovers the observed data for the younger age groups, and undercovers for the older groups. In contrast, the flexibility of the covariance regression model allows the confidence ellipsoids to change size and shape as a function of age, and thus is able to match the nominal coverage rate fairly closely across the different ages. 5 Discussion This article has presented a model for a covariance matrix Cov[y x] = Σ x as a function of an explanatory variable x. We have presented a geometric interpretation in terms of curves along the boundary of a translated positive definite cone, and have provided a random effects representation that facilitates parameter estimation. This covariance regression model goes beyond what can be provided by variance stabilizing transformations, which serve to reduce the relationship between the mean and the variance. Unlike models or methods which accommodate heteroscedasticity in the form of a mean-variance relationship, the covariance regression model allows the mean function µ x to be parameterized separately from the variance function Σ x. Although the example in this article involved a single continuous predictor, the covariance regression model accommodates explanatory variables of all types, including categorical variables. This could be useful in the analysis of multivariate data sampled from a large number of groups, such as groups are defined by the cross-classification of several categorical variables. For example, it may 14

15 be desirable to estimate a separate covariance matrix for each combination of age group, education level, race and religion in a given population. The number of observations for each combination of explanatory variables may be quite small, making it impractical to estimate a separate covariance matrix for each group. A practical alternative would be to use a covariance regression model as a parsimonious representation of the heteroscedasticity across the groups. Like mean regression, a challenge for covariance regression modeling is variable selection, i.e. the choice of an appropriate set of explanatory variables. One possibility is to use selection criteria such as AIC or BIC, although non-identifiability of some parameters in the higher-rank models requires a careful accounting of the number of parameters. Another possibility may be to use Bayesian procedures, either by Markov chain Monte Carlo approximations to Bayes factors, or by explicitly formulating a prior distribution to allow some coefficients to be zero with non-zero probability. Example code and an R-package for the EM and Gibbs sampling algorithms are available at the first author s website: References G. E. P. Box and D. R. Cox. An analysis of transformations. (With discussion). J. Roy. Statist. Soc. Ser. B, 26: , ISSN Raymond J. Carroll. Adapting for heteroscedasticity in linear models. Ann. Statist., 10(4): , ISSN URL 10:4<1224:AFHILM>2.0.CO;2-H&origin=MSN. Raymond J. Carroll, David Ruppert, and Robert N. Holt, Jr. Some aspects of estimation in heteroscedastic linear models. In Statistical decision theory and related topics, III, Vol. 1 (West Lafayette, Ind., 1981), pages Academic Press, New York, Robert F. Engle and Kenneth F. Kroner. Multivariate simultaneous generalized arch. Econometric Theory, 11(1): , ISSN doi: /S URL http: //dx.doi.org.offcampus.lib.washington.edu/ /s P. W. Fong, W. K. Li, and Hong-Zhi An. A simple multivariate ARCH model specified by random coefficients. Comput. Statist. Data Anal., 51(3): , ISSN doi: /j.csda URL /j.csda Robert E. Kass and Larry Wasserman. A reference Bayesian test for nested hypotheses and its relationship to the Schwarz criterion. J. Amer. Statist. Assoc., 90(431): , ISSN

16 Hans-Georg Müller and Ulrich Stadtmüller. Estimation of heteroscedasticity in regression analysis. Ann. Statist., 15(2): , ISSN doi: /aos/ URL http: //dx.doi.org/ /aos/ Bernard Rosner. Fundamentals of Biostatistics. Duxbury Press, ISBN Herbert C. Rutemiller and David A. Bowers. Estimation in a heteroscedastic regression model. J. Amer. Statist. Assoc., 63: , ISSN M.A. Scott and M.S. Handcock. Covariance Models for Latent Structure in Longitudinal Data. Sociological Methodology, pages , Gordon K. Smyth. Generalized linear models with varying dispersion. J. Roy. Statist. Soc. Ser. B, 51(1):47 60, ISSN URL (1989)51:1<47:GLMWVD>2.0.CO;2-4&origin=MSN. Arnold Zellner. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian inference and decision techniques, volume 6 of Stud. Bayesian Econometrics Statist., pages North-Holland, Amsterdam,

Bayesian Linear Regression

Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective