arxiv: v3 [stat.me] 26 Feb 2012

Size: px

Start display at page:

Download "arxiv: v3 [stat.me] 26 Feb 2012"

Mae Robertson
6 years ago
Views:

1 Data augmentation for non-gaussian regression models using variance-mean mixtures arxiv: v3 [stat.me] 26 Feb 212 Nicholas G. Polson Booth School of Business University of Chicago James G. Scott University of Texas at Austin February 28, 212 Abstract We use the theory of normal variance-mean mixtures to derive a data-augmentation scheme that unifies a wide class of statistical models under a single framework. This generalizes existing theory on normal variance mixtures for priors in regression and classification. It also allows variants of the expectation-maximization algorithm to be brought to bear on a much wider range of models than previously appreciated. We demonstate the resulting gains in accuracy and stability on several examples, including sparse quantile regression and binary logistic regression. Key words: Data augmentation; Hierarchical model; Sparsity; Variance mean mixture of normals 1 Introduction 1.1 Regularized regression and classification Many problems in regularized estimation involve an obective function of the form Q(β) = n f(y i, x T i β σ) + i=1 p g(β τ). (1) =1 Here y i is a response, which may be continuous, binary, or multinomial; x i is a known p- vector of predictors; β = (β 1,..., β p ) is an unknown vector of coefficients; f and g are the negative log likelihood and penalty function, respectively; and σ and τ are scale parameters, for now assumed fixed. Following the literature on penalized likelihood, where the log prior is interpreted as a penalty function, we may phrase the problem as one of minimizing Q(β), or equivalently maximizing the unnormalized posterior density exp{ Q(β)}. In this paper, we unify many seemingly disparate problems of this form into a single class of normal variance-mean mixtures. There are two main practical results of this 1

2 unification. First, it allows us to exploit the probabilistic latent-variable structure of exp{ Q(β)}. This leads to expectation-maximization algorithms (Dempster et al., 1977), and variants thereof, that avoid the need for analytic approximations, numerical derivatives, or other black-box optimization routines. The result is a unifed, accurate, stable approach to estimation in many non-gaussian problems, including quantile regression, logistic regression, support-vector machines, and robust regression. In the case of logistic regression, the gains in accuracy and stability hold even for maximum-likelihood estimation. We emphasize that the crucial step here is probabilistic, rather than algorithmic, in nature. Upon recognizing a certain likelihood as a variance-mean mixture, the algorithm comes essentially for free by applying Theorem 1, the paper s main result. This theorem describes a simple relationship between the derivatives of f and g and the conditional sufficient statistics for the latent variables that arise in our expectation-maximization algorithm. The expected values of these conditional sufficient statistics can usually be calculated in closed form, even if the full conditional distribution of the latent variables is unknown or intractable. Theorem 3 provides the correponding result for the posterior mean estimator, generalizing the results of Masreliez (1975) and Pericchi and Smith (1992), long recognized for their importance in robust Bayesian inference. A second maor advantage of our approach is its wide scope of potential use. For example, by representing both likelihoods and priors as mixtures, our method allows penalizedlikelihood methods to be applied within hierarchical non-gaussian models, where information is pooled across batches of related coefficients, with no essential modification of the approach. Moreover, our data-augmentation scheme can be woven together seamlessly with other latent-variable methods, such as those for discrete mixtures and missing-data problems. It also suggests an approach for deriving more sophisicated Markov-chain sampling methods for full Bayesian inference. 1.2 Relationship with previous work Our work is motivated by recent Bayesian research on sparsity-inducing priors in linear regression, where f = y Xβ 2 is the negative Gaussian log likelihood, and g corresponds to a normal variance-mixture prior (Andrews and Mallows, 1974). Examples of work in this area include the lasso (Tibshirani, 1996; Park and Casella, 28; Hans, 29); the bridge estimator (West, 1987; Knight and Fu, 2; Huang et al., 28; Polson and Scott, 211b); the relevance vector machine of Tipping (21); the normal/jeffreys model of Figueiredo (23) and Bae and Mallick (24); the normal/exponential-gamma model of Griffin and Brown (212); the normal/gamma and normal/inverse-gaussian (Caron and Doucet, 28; Griffin and Brown, 21); the horseshoe prior of Carvalho et al. (21); the hypergeometric inverted beta model of Polson and Scott (212b); and the double-pareto model of Armagan et al. (212a). A related line of work concerns the use iterative convex relaxation in nonconcave penalized-likelihood problems (e.g. Zou and Li, 28). Polson and Scott (211a) give a review of this extensive literature, and the connections between Bayesian shrinkage estimation and penalized likelihood. We generalize this work by representing both the likelihood and prior as variance- 2

3 Table 1: Variance-mean mixture representations for many common loss functions. Recall that z i = y i x T i β for regression, or z i = y i x T i β for binary classification. Error/loss function f(z i β, σ) κ z µ z p(ω i) Squared-error zi 2 /σ 2 ω i = 1 Absolute-error z i/σ Exponential Check loss z i + (2q 1)z i 1 2q Generalized inverse Gaussian Support vector machines max(1 z i, ) 1 1 Generalized inverse Gaussian Logistic log(1 + e z i ) 1/2 Polya mean mixtures of Gaussians. This data-augmentation approach relies upon the following decomposition: n p p(β τ, σ, y) e Q(β) exp f(y i, x T i β σ) g(β τ) i=1 =1 { n } p p(z i β, σ) p(β τ) i=1 = p(z β, σ) p(β τ), where the working response z i is equal to y i x T i β for regression, or y ix T i β for binary classification, with y i coded as ±1. The case of a multinomial response requires only a small modification, detailed in the appendix. Both σ and τ are hyperparameters; they are typically estimated ointly with β, although they may also be specified by the user or chosen by cross-validation. In some cases, most notably in logistic regression, the likelihood is free of hyperparameters, in which case σ does not appear in the model. One thing we do not do in this paper is to study the formal statistical properties of the resulting estimators, such as consistency as p and n both diverge. For this we refer the reader to Griffin and Brown (212) and Armagan et al. (212b), who discuss this issue from both a Bayesian and classical perspective. Rather, we provide a representation theorem that makes these estimators both easier to compute and more widely applicable to hierarchical, non-gaussian models. =1 2 Normal variance-mean mixtures There are two key steps in our approach: first, we use variance-mean mixtures, rather than ust variance mixtures; and second, we interweave two different mixture representations, one for the likelihood and one for the prior. The introduction of latent variables {ω i } and {λ } in Equations (2) and (3), below, reduces exp{ Q(β)} to a Gaussian linear model 3

4 with heteroscedastic errors: p(z i β, σ) = p(β τ) = φ(z i µ z + κ z ωi 1, σ 2 ωi 1 ) dp (ω i ) (2) φ(β µ β + κ β λ 1, τ 2 λ 1 ) dp (λ ), (3) where φ(a m, v) is the normal density, evaluated at a, for mean m and variance v. By marginalizing over these latent variables with respect to different fixed combinations of (µ z, κ z, µ β, κ β ) and the mixing measures P (λ ) and P (ω i ), it is possible to generate many commonly used obective functions that have not been widely recognized as Gaussian mixtures. Table 1 lists several common likelihoods in this class, along with the corresponding fixed choices for (κ z, µ z ) and the mixing distribution P (ω i ). A discussion of priors and penalty functions that fall within this class can be found in Polson and Scott (212a). An important feature of our approach is that we avoid dealing directly with conditional distributions for these latent variables. To find the posterior mode, it is sufficient to use Theorem 1 to calculate moments of these distributions, exploiting known facts about Gaussian mixtures. These moments in turn depend only upon the derivatives of f and g, along with the hyperparameters. We focus on two choices of the mixing measure: the generalized inverse-gaussian distribution; and the Polya distribution, which is essentially an infinite sum of exponential random variables (Barndorff-Nielsen et al., 1982). These two choices lead to the hyperbolic and Z distributions, respectively, for the resulting variance-mean mixture. The two key integral identities are α 2 κ 2 2α e α θ µ +κ(θ µ) = e α(θ µ) 1 B(α, κ) (1 + e θ µ ) 2(α κ) = φ (θ µ + κv, v) p G ( v 1,, α 2 κ 2) dv φ (θ µ + κv, v) p P (v α, α 2κ) dv, where p G and p P are the density functions of the generalized inverse-gaussian and Polya distributions, respectively. We use θ to denote a dummy argument that could involve either data or parameters, and v to denote a latent variance; all other terms are hyperparameters specified by the user for the purposes of representing a particular density or function. These two expressions lead, by a simple application of the Fatou Lebesgue theorem, to three further identities for the improper limiting cases of the two densities above: a 1 exp { 2c 1 max(aθ, ) } = c 1 exp { 2c 1 ρ q (θ) } = (1 + exp{θ µ}) 1 = φ(θ av, cv) dv φ(θ (2τ 1)v, cv)e 2τ(1 τ)v dv φ(θ µ (1/2)v, v) p P (v, 1) dv, where ρ q (θ) = 1 2 θ + ( q 1 2) θ is the check-loss function (Koenker, 25; Li et al., 21). 4

5 The first leads to the likelihood for support-vector machines; the second, to quantile and lasso regression; and the third, to logistic and multinomial logistic regression. The function p P (v, 1) is an improper density corresponding to a sum of exponential random variables. Thus by using either a generalized inverse-gaussian or a Polya mixing distribution, one can generate obective functions corresponding to the lasso estimator, support-vector machines, the check-loss function, and the binary and multinomial logistic-regression models. The relevant distribution theory leading to these integral identities can be found in Appendix A. Previous studies (e.g. Polson and Scott, 211c; Gramacy and Polson, 212) have presented similar results for specific models, including support-vector machines and the so-called powered-logit likelihood. But as far as we are aware, ours is the first characterization of the full class. 3 An expectation-maximization algorithm 3.1 Overview of approach We now show that this conditionally Gaussian representation results in a simple expectationmaximization algorithm for any model in the proposed class, given our data augmentation scheme. In the expectation step, one computes the expected value of the log posterior, given hyperparameters and the current estimate β (g) at step g of the algorithm: C(β β (g) ) = log p(β ω, λ, τ, σ, z)p(ω, λ β (g), τ, z) dω dλ. Then in the maximization step, one maximizes the complete-data posterior as a function of β: β (g+1) = arg max β C(β β (g) ). The advantages of the expectation-maximization algorithm are its lack of user-specified tuning constants, and its well known theoretical properties: the sequence of estimated parameter values {β (1), β (2),... } monotonically increases the observed-data log posterior density, and will converge to a global maximum if this function is concave. The complete-data log posterior can be represented in two different ways. We appeal to both of these representations in deriving the expectation and maximization steps. Under a normal variance-mean mixture of the form in (2) (3), log p(β ω, λ, τ, σ, z) = c (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n i=1 p =1 ( ω i zi µ z κ y ωi 1 ) 2 λ (β µ β κ β λ 1 ) 2 (4) for some constant c, recalling that z i = y i x T i β for regression or z i = y i x T i β for classifi- 5

6 cation. Factorizing this further as a function of β yields that log p(β ω, λ, τ, σ, z) = c 1 (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n ω i (z i µ z ) 2 + κ y i=1 k λ (β µ β ) 2 + κ β =1 n (z i µ z ) i=1 p (β µ β ) (5) for some constant c 1. We now explicitly derive the expectation and maximization steps, along with the necessary conditional sufficient statistics. 3.2 The expectation step From (5), observe that the complete-data obective function is linear in both ω i and λ. Therefore, in the expectation step, we calculate the complete-data log posterior by (g) replacing λ and ω i with their conditional expectations ˆλ and ˆω (g) i, given the data and the current β (g). The following theorem provides expressions for these conditional moments under any model where both the likelihood and prior can be represented by normal variance-mean mixtures. Theorem 1. Suppose that the obective function Q(β) in (1) can be represented by a hierarchical variance-mean Gaussian mixture, as in Equations (2) and (3). Then the conditional moments ˆλ = E (λ β, τ, z) and ˆω i = E (ω i σ, z i ) are given by the following expressions: (β µ β )ˆλ = κ β + τ 2 g (β τ) (z i µ z )ˆω i = κ z + σ 2 f (z i β, σ), where f and g are the derivatives of negative log likelihood and negative log prior from (1), respectively. The advantage of the theorem is that it characterizes the required moments purely in terms of the likelihood and penalty functions, which are pre-specified in most regularization problems: f(z i ) = log p(z i β, σ), g(β ) = log p(β τ). One caveat is that when β µ β, the conditional moment for λ in the expectation step may be numerically infinite, and care must be taken. Indeed, infinite values for λ will arise naturally under certain sparsity-inducing choices of g, such as the lasso, and indicate that the algorithm has converged to a sparse solution. One way to handle the resulting problem of numerical infinities is to start the algorithm from a value where (β µ β ) has no zeros, and to remove β from the model when it gets within a small numerical threshold of its mean (c.f. Fan and Li, 21). This conveys the added benefit of hastening the matrix computations in the maximization step. Although we have found this approach to work well in practice, it has the disadvantage that a variable cannot re-enter the model once it has been deleted. Therefore, if using a sparsity-inducing prior, we check for convergence =1 6

7 once a putative zero has been found, by proposing small perturbations in each component of β to assess whether any variables should re-enter the model. An alternate approach involves the use of restricted least-squares; for details, see Section 3.2 of Polson and Scott (211c). Our method, of course, does not sidestep the problem of optimization over a combinatorially large space. In particular, there is no way to guarantee convergence to a global maximum if the penalty function is concave, in which case multiple restarts from different initial values will be necessary to check for the presence of local modes. 3.3 The maximization step Returning to (4), the maximization step involves computing the posterior mode under a heteroscedastic Gaussian error model and a conditionally Gaussian prior for β, given the latent variables {ω i } and {λ }. This posterior mode is recognizable as a generalized ridge estimator. Theorem 2. Suppose that the obective function Q(β) in (1) can be represented by variancemean Gaussian mixtures, as in (2)-(3). Then given estimates {ˆω i } and {ˆλ }, we have the following expressions for the conditional maximum of β, where ω = (ω 1,..., ω n ) and λ = (λ 1,..., λ p ) are vectors, and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ). 1. In a regression problem, ˆβ = ( τ 2 ) ˆΛ + X T 1(y ˆΩX + b (ˆΩy ) ) y = X T µz ω κ z 1 b = (τ 2 )(µ β λ + κ β 1). 2. In a binary classification problem where y i = ±1 and X has rows x i = y ix i, ˆβ = ) 1 ( ) (τ 2 ˆΛ + X T ˆΩX X T µ z ˆω + κ z 1. The maximization step can be easily extended to encompass a series of conditional maximization steps: first for the regression coefficients β, and then for hyperparameters such as σ and τ. These latter steps exploit standard results on variance components in linear models; we therefore omit the details, and refer the reader to, for example, Gelman (26). 4 Examples 4.1 Binary logistic regression The simplicity and generality of our approach are best illustrated by an initial example involving a familiar likelihood. Suppose we wish to fit a penalized logistic regression, where n p ˆβ = arg min log{1 + exp( y i x T i β)} + g(β τ), β R p i=1 =1 7

8 assuming that the outcomes y i are coded as ±1, and that τ is fixed. Many factors conspire to make this a difficult problem, but we focus particularly on the non-gaussianity of the likelihood. For the logistic likelihood, recall that ω 1 has a Polya distribution with α = 1, κ = 1/2. Theorem 1 gives the relevant conditional moment as ˆω i = 1 ( e z i z i 1 + e z 1 ). i 2 Therefore, if the log prior g satisfies (3), then the following three updates, when iterated repeatedly, will generate a sequence of estimates that converges to stationary point of Q(β): β (g+1) = ˆω (g+1) i = ˆλ (g+1) ( τ 2 ˆΛ(g) + X T ˆΩ ) ( ) 1 1 (g) X 2 XT 1 1 z (g+1) i ( e z(g+1) i 1 + e z(g+1) i 1 2 = κ β + τ 2 g (β (g+1) τ) β (g+1), µ β where z (g) i = y i x T i β(g) ; X is the matrix having rows x i = y i x i ; 1 is a vector of ones; and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ) are diagonal matrices. If the penalty function g is convex on (, ), this stationary point will be the global maximum (Zou and Li, 28). This sequence of steps resembles iteratively re-weighted least squares (Green, 1984) due to the presence of the diagonal weights matrix Ω. But there are subtle and important differences, even in the unpenalized case where λ and the solution is the maximumlikelihood estimator. In iteratively re-weighted least squares, for example, the analogous weight matrix Ω has diagonal entries ω i = µ i (1 µ i ), where µ i = 1/(1 + e xt i β ) is the estimated value of pr(y i = 1) at each stage of the algorithm. These weights arise from a sequential approximation to the likelihood. In contrast, the weights ω i in our second step above arise from an exact mixture representation of the likelihood. Figure 1 shows a comparison of the weights for each algorithm, as a function of the linear predictor x T i β. The weights under iteratively re-weighted least squares decay to zero much more rapidly than in our algorithm. This can lead to numerical difficulties when the successes and failures are nearly separable by a hyperplane in R p. To illustrate this phenomenon, we simulated a logistic-regression problem with 2 variables and 1 observations, where each coefficient β and each design point x i were independent draws from a standard normal distribution. We then compared our expectationmaximization algorithm to iteratively re-weighted least squares for computing the unpenalized maximum-likelihood estimate. Each algorithm was run from two different starting values: β () = (1/1,..., 1/1) T, and β () = (1/2,..., 1/2) T. As Figure 2 shows, iteratively re-weighted least squares is highly sensitive to the choice ) 8

9 Weight Function Iteratively re-weighted least squares Expectation-maximization Absolute Value of Linear Predictor Figure 1: The weights ω i on the diagonal weight matrix Ω that arise in iteratively re-weighted least squares for logistic regression, versus those that arise in our dataaugmentation approach, as a function of the linear predictor x T i β. of starting value. It converges when initialized at β () = (1/1,..., 1/1) T, but not when initialized at β () = (1/2,..., 1/2) T. Our data-augmentation algorithm is far more robust, finding the maximum easily in both cases. We emphasize that this is not a problem caused by perfect separability of the successes and failures, in which case there is no unique maximum-likelihood solution. As the bottom pane of Figure 2 shows, 18 of 1 observations had fitted success probabilities between 5 and 95. Rather, the problem is that the iterative approximation can be poor in regions where the likelihood surface is nearly flat. Next, we investigated the performance of data augmentation versus iteratively reweighted least squares when a penalty function is applied. For this case we simulated data with a nontrivial correlation structure in the design matrix. Specifically, we defined Σ = BB T + Ψ, where B is a 5 4 matrix of standard normal random entries, and Ψ is a diagonal matrix with χ 2 1 random entries. The rows of the design matrix X were simulated from a multivariate normal distribution with mean zero and covariance matrix Σ; the coefficients β were standard normal random draws; and the size of the data set was p = 5 and n = 2. We first used a ridge-regression penalty, where g(β τ) = (β /τ) 2, leading to a convex problem. We also used the generalized double-pareto model proposed by Armagan et al. (212a), where p(β τ) ( 1 + β ) (1+a). aτ Like the Laplace prior, this is non-differentiable at zero and is therefore sparsity-inducing, but has polynomial rather than exponential tail behavior. Armagan et al. (212b) show that this model leads to strong consistency of the posterior in regression models with a diverging number of parameters, which can be thought of as a Bayesian analogue of the oracle property. The generalized double-pareto model has a conditionally Gaussian representation, making Theorem 1 applicable. 9

10 Expectation-Maximization, Initialized at.1 Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Expectation-Maximization, Initialized at Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Fitted Probability Observation Figure 2: Convergence of expectation-maximization versus iteratively re-weighted least squares from two different starting points for a simulated logistic-regression problem. The top four panes show the contours of the likelihood as a function of (β 1, β 2 ), with all other coefficients held at their maximum-likelihood values. In the left-hand panes, the coefficients were all initialized at 1; in the right-hand panes, they were all initialized at 5. The black lines trace out the values of (β 1, β 2 ) as the algorithm progresses; the grey dot is the true maximum. The bottom pane shows the fitted values pr(y i = 1) at this maximum. 1

11 Solution path with double-pareto penalty Solution path with ridge penalty Coefficients Coefficients Log of regularization parameter Log of regularization parameter Figure 3: The solution paths for β for the double-pareto and ridge penalties, as a function of the regularization parameter log(1/τ), for a simulated logistic regression problem. The black lines show the solution for iteratively re-weighted least squares; the grey lines, for our expectation-maximization algorithm. Moving from right to left, the black lines stop where iteratively re-weighted least squares fails due to numerical instability. We chose a = 2, and used each algorithm to compute a full solution path for β as a function of the regularization parameter, here expressed as log(1/τ) in keeping with the penalized-likelihood literature. We began by fitting the solution for τ = 1 3, which essentially constrained all coefficients to be zero, or very small. We then increased the value of τ along a discrete grid {τ 1,..., τ K = 1}, using the solution for τ k as the starting value for the τ k+1 case. As Figure 3 shows, iteratively re-weighted least squares fails when log(1/τ) becomes too small, and the coefficient vector becomes larger in magnitude. This happens because the linear system that must be solved at each stage of the algorithm becomes numerically singular. It does so, moreover, at a point when 2 out of 2 observations still had fitted success probabilities between 5 and 95. In fact, under the double-pareto prior, iteratively re-weighted least squares fails before all coefficients have even entered the model. No such pathology affects the expectation-maximization algorithm. Two further comments are in order. First, while we have focused here on estimating β using the posterior mode, the problems we have identified will also arise in any Bayesian treatment of logistic regression that is based upon an analytic approximation to the logistic likelihood (e.g. Gelman et al., 28a). We hope to extend these results to facilitate fully Bayesian inference. Second, sparse logistic regression via penalized likelihood is a topic of great current interest (Genkin et al., 27; Meier et al., 28). This problem involves three logically distinct issues: how to handle the logistic likelihood; which penalty function to use; and how to fit the resulting model, whether by a block updating scheme or by coordinate descent. These issues interact in poorly understood ways. For example, it is widely known that 11

12 coordinate-by-coordinate algorithms, including Gibbs sampling, can fare poorly in highly multimodal situations. Likewise, it is known that nonconvex penalties lead to multimodal obective functions, but also, subect to certain regularity conditions, exhibit more favorable statistical properties for estimating sparse signals (Fan and Li, 21; Carvalho et al., 21). Moreover, coordinate descent is tractable only if the chosen penalty leads to a univariate thresholding function whose solution is analytically available (Mazumder et al., 211). This is a fairly narrow class, and does not include most of the penalties mentioned in the introduction. The question of how to handle the likelihood complicates matters still further. For example, the area of detail in Figure 3 shows that, for a double-pareto penalty, the solution paths fit by iteratively re-weighted least squares differ in subtle but noticeable ways from those fit by data augmentation. By simply checking the maximized value of the obective function under both methods, we are able to confirm that iteratively re-weighted least squares is not converging to the true optimum. Yet we do not entirely understand why, and under what circumstances, the methods will differ, and how these differences should affect recommendations about what penalty function and algorithm should be used to fit logistic regression models. A full study of these issues is beyond the scope of the current paper, but is a subect of active inquiry. 4.2 Penalized quantile regression We now describe a data analysis whose goal is to understand the predictors of the longterm growth rate of a country s gross domestic product. Specifically, we use quantile regression to describe how the conditional quantile of a country s annualized growth rate, y i, depends upon various political and socioeconomic predictors. The data set comprises 161 observations on 87 countries over two periods, and It comes originally from a 1994 discussion paper by R. Barro and J. Lee, available from the National Bureau of Economic Research; a full description can be found in Koenker and Machado (1999). We used our data-augmentation scheme to fit penalized quantile-regression models, comparing them to the corresponding maximum-likelihood estimates. To do so, we choose p(ω i ) to be a generalized inverse-gaussian prior of unit scale, where (α, κ, µ) = (1, 1 2q, ). This leads to log p(z i ) = z i + (2q 1)z i, the pseudo-likelihood which yields quantile regression for the qth quantile (Koenker, 25; Li et al., 21). Applying Theorem 1, we get ˆω i = y i x T i β(g) 1 as the expected value of the conditional sufficient statistic needed in the expectation step of our algorithm. To illustrate our method, we applied a bridge penalty, where g(β τ) = β /τ a. We chose a = 1/2, estimating τ and β ointly via an expectation-conditional-maximization algorithm. An inverse-gamma prior with shape and rate parameters equal to 1 was assumed for ν = τ a. This leads to the following two conditional-maximization steps, given 12

13 Trade Growth Investment/GDP Trade Growth Political Instability Black Market Premium Consumption/GDP Education/GDP Human Capital - Life Expectancy 2. - Male Higher Education Female Higher Education Female Secondary Education Political Instability Black Market Premium Consumption/GDP Investment/GDP Education/GDP 1.5 Human Capital Male Higher Education Male Secondary Education Female Higher Education Life Expectancy Female Secondary Education 2.5 Male Secondary Education Figure 4: Estimated coefficients versus conditional quantile q (, 1) for the Barro GDP growth data. The top 12 panes show the maximum likelihood estimates for each coefficient as a function of q, while the bottom 12 show the corresponding estimates under a bridge penalty. The shaded grey areas show estimated error bars, extending one estimated standard error above and below the line. 13

14 λ and ω i : (ˆβ τ) = ( τ 2 ˆΛ + X T ˆΩX ) 1X T {ˆΩy (1 2q) 1 } (ˆν β) = 1 + p =1 β a, 2 + p/a allowing τ = ν 1/a to be obtained easily, and avoiding the need for cross validation. We used the same 12 predictors used by Koenker and Machado (1999), and compared our bridge-penalized quantile regressions to the maximum-likelihood estimates, obtainable using the R package for quantile regression (Koenker, 211). The predictors were standardized to have zero mean and unit scale. Figure 4 shows the results of this comparison, plotting the regression coefficients versus the conditional quantile across a discrete grid of choices for q. In each frame, the shaded grey area extends one standard error to either side of the estimate. Under the penalized-likelihood procedure, the standard error quoted here is actually the posterior standard deviation under the multivariate normal approximation at the mode: var(β y) ˆτ 2 /ˆλ. Alas, this does not lead to sensible estimates of uncertainty for coefficients shrunk to zero, which is a known shortcoming of penalized likelihood. The direct comparison is with Figure 3 of Koenker and Machado (1999). There are noticeable differences between the penalized and unpenalized fits. The bridge estimator shrinks four coefficients all the way to zero for every conditional quantile: female secondary education, female higher education, male higher education, and human capital. It shrinks four other coefficients to zero for a substantial fraction of values q (, 1): male secondary education, life expectancy, education spending, and political instability. Also interesting is that the posterior standard deviations for the predictors that remain in the model are typically larger than the asymptotic standard errors for the corresponding maximum-likelihood estimates. This is counter-intuitive; one would expect that, by excluding needless predictors from the model, uncertainty about those that remain in the model would be reduced, rather than inflated. We do not know whether this reflects the deficieny of the asymptotic approximation needed to compute standard errors under the maximum-likelihood procedure; the failure of the normal approximation at the mode; or some other unappreciated aspect of the model. 5 Discussion Our primary goal in this paper has been to show the relevance of the conditionally Gaussian representation of (2) (3), together with Theorem 1, for fitting a wide class of regularized estimators within a unified variance-mean mixture framework. We have therefore focused only on the most basic implementation of the expectation-maximization algorithm. There are many further variants on the basic algorithm, however, some of which can lead to dramatic advantages in speed and stability for certain problems. Key references include Meng and van Dyk (1997) and Gelman et al. (28b). These variants include marginal data augmentation (Liu, 1995), parameter expansion (Liu and Wu, 1999), maorization minimization (Hunter and Lange, 2), the partially collapsed Gibbs sam- 14

15 pler (van Dyk and Park, 28), and other simulation-based alternatives described by van Dyk and Meng (21). Many of these modifications require additional analytical work for particular choices of g and f. One example here includes the work of Liu (25) on the robit model. We have not explored these options here, and this remains a promising area for future research. A second important fact is that, for many purposes, such as estimating of β under squared-error loss, the relevant quantity of interest is the posterior mean, not the mode. Indeed, both Hans (29) and Efron (29) argue that, for predicting future observables, the posterior mean of β is the better choice. The following theorem represents the posterior mean for β in terms of the score function of the predictive distribution, generalizing the results of Brown (1971), Masreliez (1975), Pericchi and Smith (1992), and Carvalho et al. (21). There are a number of possible versions of such a theorem. Here we consider a variance-mean mixture prior p(β ) with a general location likelihood p(y β), but clearly a similar result holds the other way around. We consider the case where X is an orthogonal matrix, assumed without loss of generality to be the identity matrix, in which case we apply the theorem component by component. Theorem 3. Let p( y β ) be the likelihood for a location parameter β, symmetric in y β, and let p(β ) = φ(β ; µ β + κ β /λ, τ 2 /λ ) p(λ 1 ) dλ 1 be a normal variance-mean mixture prior. Define the following distributions: m(y) = p(y β )p(β ) dβ Then E(β y) = κ β τ 2 + p (λ 1 ) = λ p(λ 1 ) E(λ ) p (β ) = φ(β ; µ + κ/λ, τ 2 /λ )p (λ 1 ) m (y) = p(y β )p (β ). { µβ E(λ ) τ 2 } { m } (y) + m(y) { E(λ ) τ 2 } { m (y) m(y) } { log m } (y). y The generalization to nonorthogonal designs is straightforward, following the original Masreliez (1975) paper; see, for example, Griffin and Brown (21), along with the discussion of the Tweedie formula by Efron (211). Computing the posterior mean will typically require sampling from the full oint posterior distribution over all parameters. Our data-augmentation approach can lead to Markov-chain Monte Carlo sampling schemes for ust this purpose. The key step is the identification of the conditional distributions for λ and ω i under specific models; this remains an active area of research. 15

16 Acknowledgements The authors wish to thank two anonymous referees, in addition to the editor and associate editor, for their many helpful comments in improving the paper. A Appendix: distributional results A.1 Generalized hyperbolic distributions In all of the following cases, we assume that (θ v) N (µ + κv, v), and that v p(v). Let p(v) be a generalized inverse-gaussian distribution G(ψ, γ, δ), following the notation of Barndorff-Nielsen and Shephard (21). We consider the special case of this class where ψ = 1 and δ =, in which case p(θ) is a hyperbolic distribution having density ( α 2 κ 2 ) p(θ µ, α, κ) = exp { α θ µ + κ(θ µ)}. 2α When viewed as a pseudo-likelihood or pseudo-prior, the class of generalized hyperbolic distributions will generate many common obective functions. First, choosing (α, κ, µ) = (1,, ) leads to log p(β ) = β, and thus l 1 regularization. Second, choosing (α, κ, µ) = (1, 1 2q, ) gives log p(z i ) = z i + (2q 1)z i. This is the check-loss function, yielding quantile regression for the qth quantile. Third, choosing (α, κ, µ) = (1, 1, 1) leads to the maximum operator: (1/2) log p(z i ) = (1/2) 1 z i + (1/2)(1 z i ) = max(1 z i, ) for z i = y i x T i β. This is the obective function for support vector machines (e.g. Mallick et al., 25; Polson and Scott, 211c), and corresponds to the limiting case of a generalized inverse-gaussian prior. A.2 Z distributions Now let p P (v α, α 2κ) be a Polya distribution, which can be represented as an infinite convolution of exponentials, and leads to a Z distributed marginal. The important result is the following: p Z (θ µ, α, κ) = 1 (e θ µ ) α B(α, κ) (1 + e θ µ ) 2(α κ) = N (µ + κv, v) p P (v α, α 2κ) dv. See Barndorff-Nielsen et al. (1982). When viewed as a likelihood in θ, the class of Polya/Z distributions results in the logistic and multinomial models. 16

17 For logistic regression, choosing (α, κ, µ) = (1, 1/2, ) leads to p(z i ) = ez i 1 + e z i, which is the likelihood for logistic regression with z i = y i x T i β. Much like the support vector-machine representation, this corresponds to a limiting improper case of the Polya distribution, specifically P(1, ). The necessary mixture representation still holds, however, by applying the Fatou-Lebesgue theorem. The improper mixing measure p 1, (v) is an infinite sum of exponentials for which the integral on the right still converges to the logistic likelihood (Gramacy and Polson, 212). For the multinomial generalization of the logistic model, we require a slight modification. Suppose that y i {1,..., K} is an unordered category indicator, and that β k = (β k1,..., β kp ) T is a separate block of p regression coefficients for the kth category. Let η ik = exp ( x T ) ( ) i β k c ik /{1 + exp x T i β k c ik }, where { } c ik (β k ) = log exp(x T i β l ). We follow Holmes and Held (26) in writing the conditional likelihood for β k as Q(β k β k, y) n K i=1 l=1 n i=1 n i=1 l k η I(y i=l) il η I(y i=k) ik {w i (1 η ik )} I(y i k) { ( ) exp x T I(yi } =k) i β k c ik 1 + exp ( x T i β ), k c ik where w i is independent of β k and I is the indicator function. Thus the conditional likelihood for the kth block of coefficients β k, given all the other blocks of coefficients β k, can be written as a product of n terms, the ith term having a Polya mixture representation with α ik = I(y i = k), κ ik = α ik 1/2, and µ ik = c ik (β k ). This allows regularized multinomial logistic models to be fit using essentially the same approach of Section 4.1, where each block β k is updated using a conditional maximization step. B Appendix: Proofs B.1 Theorem 1 Since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β ( β µ β κ β /λ = τ 2 /λ ) φ(β µ β + κ β /λ, τ 2 /λ ). 17

18 We use this fact to differentiate p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) p(λ τ) dλ under the integral sign to obtain { p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) } p(λ τ) dλ. β β Dividing by p(β τ) and using the above identity for the inner function, we get ( κβ ) p(β τ) = β τ 2 ( β µ β p(β τ) τ 2 ) E (λ β, τ). Hence we can in general find the needed moment using the expression { } 1 p(β τ) = κ ( ) β p(β τ) β τ 2 β µ ( ) β τ 2 E λ β (g), τ, y. Equivalently, in terms of the penalty function log p(β τ), we have By a similar argument, we also have (β µ β )E (λ β ) = κ β τ 2 β log p(β τ). (z i µ z )E (ω i β, z i, σ) = κ z σ 2 z i log p(z i β, σ), We obtain the desired result by using the identities z i log p(z i β i ) = f (z i β, σ) B.2 Theorem 2 and β log p(β τ) = g (β τ). We demonstrate the result for a regression problem, with the classification result following as a simple modification. Begin with Equation (4). Collecting terms, we can represent the log posterior, up to an additive constant not involving β, as a sum of quadratic forms in β: log p(β ω, λ, τ, σ, z) = 1 ( ) T ( ) {y µ z 1 κ z ω 1 } Xβ Ω {y µ z 1 κ z ω 1 } Xβ 2 1 ( 2τ 2 β µ b 1 κ β λ 1) T ( Λ 1 β µ β 1 κ β λ 1). where we recall that ω 1 is a column vector whose ith entry is ω 1 i, and similarly for λ 1. This is the log posterior under a normal prior β N(µ β 1 + κ β λ 1, τ 2 Λ 1 ) after having observed the working response y µ z 1 κ z ω 1. The identity Ω(µ 1 + κω 1 ) = µω + κ 1 then gives the result. 18

19 For classification, on the other hand, let X be the matrix with rows x i = y ix i. The kernel of the conditionally normal likelihood then becomes (X β µ z 1 κ z ω 1 ) T Ω (X β µ z 1 κ z ω 1 ). Hence it is as if we observe the n-dimensional working response µ z 1+κ z ω 1 in a regression model having design matrix X, and we proceed by a similar argument to arrive at the result. B.3 Theorem 3 Our extension of Masreliez s theorem to variance-mean mixtures follows a similar path as Theorem 1. From before, since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β = β µ β κ β λ τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ). Differentiating under the integral sign and applying this result, we have that 1 τ 2 λ β φ(β µ β + κ β /λ, τ 2 /λ ) = µ β κ β τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ) β. The rest of the argument follows the standard Masreliez approach. References D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99 12, A. Armagan, D. Dunson, and J. Lee. Generalized double Pareto shrinkage. Technical report, Duke University Department of Statistical Science, 212a. A. Armagan, D. B. Dunson, J. Lee, and W. Bawa. Posterior consistency in linear models under shrinkage priors. Technical report, Duke University Department of Statistical Science, 212b. K. Bae and B. Mallick. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 2 (18):3423 3, 24. O. E. Barndorff-Nielsen and N. Shephard. Non-Gaussian Ornstein Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2): , 21. O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distributions. International Statistical Review, 5:145 59, L. Brown. Admissible estimators, recurrent diffusions and insoluble boundary problems. The Annals of Mathematical Statistics, 42:855 93, F. Caron and A. Doucet. Sparse Bayesian nonparametric regression. In Proceedings of the 25th International Conference on Machine Learning, pages Association for Computing Machinery, 28. C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 8, 21. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39(1):1 38,

20 B. Efron. Empirical Bayes estimates for large-scale prediction problems. Journal of the American Statistical Association, 14(487):115 28, 29. B. Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 16(496): , 211. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 6, 21. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):115 9, 23. A. Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Anal., 1(3): , 26. A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):136 83, 28a. A. Gelman, D. A. van Dyk, Z. Huang, and W. J. Boscardin. Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics, 17(1):95 122, 28b. A. Genkin, D. Lewis, and D. Madigan. High-dimensional generalized linear models and the lasso. Technometrics, 49:291 34, 27. R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Analysis, 212. arxiv.org/abs/ (to appear). P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society (Series B), 46(2):149 92, J. Griffin and P. Brown. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1):171 88, 21. J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Australian and New Zealand Journal of Statistics, 212. (to appear). C. M. Hans. Bayesian lasso regression. Biometrika, 96(4):835 45, 29. C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 68, 26. J. Huang, J. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2): , 28. D. R. Hunter and K. Lange. Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 9(1):6 77, 2. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28(5): , 2. R. Koenker. Quantile Regression. Cambridge University Press, New York, USA, 25. R. Koenker. quantreg: Quantile Regression, 211. URL R package version R. Koenker and J. Machado. Goodness of fit and related inference processes for quantile regression. Journal of the American Statistical Association, 94(448): , Q. Li, R. Xi, and N. Lin. Bayesian regularized quantile regression. Bayesian Analysis, 5(3):533 56, 21. C. Liu. Missing data imputation using the multivariate t distribution. Journal of Multivariate Analysis, 53(1):139 58,

21 C. Liu. Robit regression: a simple robust alternative to logistic and probit regression. In A. Gelman and X. L. Meng, editors, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin s Statistical Family, pages John Wiley & Sons, 25. J. S. Liu and Y. N. Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association, 94(448): , B. K. Mallick, D. Ghosh, and M. Ghosh. Bayesian classification of tumours by using gene expression data. Journal of the Royal Statistical Society (Series B), 67(2):219 34, 25. C. Masreliez. Approximate non-gaussian filtering with linear state and observation relations. IEEE. Trans. Autom. Control, 2(1):17 1, R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Journal of the American Statistical Association, 16(495): , 211. L. Meier, S. van de Geer, and P. B uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society (Series B), 7(1):53 71, 28. X. L. Meng and D. van Dyk. The EM algorithm an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society (Series B), 59(3):511 67, T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 13(482): 681 6, 28. L. R. Pericchi and A. Smith. Exact and approximate posterior moments for a normal location parameter. Journal of the Royal Statistical Society (Series B), 54(3):793 84, N. G. Polson and J. G. Scott. Shrink globally, act locally: sparse Bayesian regularization and prediction. In J. Bernardo, M. Bayarri, J. O. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Proceedings of the 9th Valencia World Meeting on Bayesian Statistics. Oxford University Press, 211a. N. G. Polson and J. G. Scott. The Bayesian bridge. Technical report, University of Texas at Austin, 211b. N. G. Polson and J. G. Scott. Local shrinkage rules, Lévy processes, and regularized regression. Journal of the Royal Statistical Society (Series B), 212a. (to appear). N. G. Polson and J. G. Scott. Good, great, or lucky? Screening for firms with sustained superior performance using heavy-tailed priors. The Annals of Applied Statistics, 212b. (to appear). N. G. Polson and S. Scott. Data augmentation for support vector machines (with discussion). Bayesian Analysis, 6(1):1 24, 211c. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58(1):267 88, M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211 44, 21. D. van Dyk and X. L. Meng. Cross-fertilizing strategies for better EM mountain climbing and DA field exploration: A graphical guide book. Statistical Science, 25(4):429 49, 21. D. van Dyk and T. Park. Partially collapsed Gibbs samplers: theory and methods. Journal of American Statistical Association, 13(482):79 6, 28. M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646 8, H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):159 33,

Data augmentation for non-gaussian regression models using variance-mean mixtures

Biometrika (213), 1,2,pp. 459 471 doi: 1.193/biomet/ass81 C 213 Biometrika Trust Advance Access publication 11 March 213 Printed in Great Britain Data augmentation for non-gaussian regression models using