arxiv: v3 [stat.me] 26 Feb 2012
|
|
- Mae Robertson
- 6 years ago
- Views:
Transcription
1 Data augmentation for non-gaussian regression models using variance-mean mixtures arxiv: v3 [stat.me] 26 Feb 212 Nicholas G. Polson Booth School of Business University of Chicago James G. Scott University of Texas at Austin February 28, 212 Abstract We use the theory of normal variance-mean mixtures to derive a data-augmentation scheme that unifies a wide class of statistical models under a single framework. This generalizes existing theory on normal variance mixtures for priors in regression and classification. It also allows variants of the expectation-maximization algorithm to be brought to bear on a much wider range of models than previously appreciated. We demonstate the resulting gains in accuracy and stability on several examples, including sparse quantile regression and binary logistic regression. Key words: Data augmentation; Hierarchical model; Sparsity; Variance mean mixture of normals 1 Introduction 1.1 Regularized regression and classification Many problems in regularized estimation involve an obective function of the form Q(β) = n f(y i, x T i β σ) + i=1 p g(β τ). (1) =1 Here y i is a response, which may be continuous, binary, or multinomial; x i is a known p- vector of predictors; β = (β 1,..., β p ) is an unknown vector of coefficients; f and g are the negative log likelihood and penalty function, respectively; and σ and τ are scale parameters, for now assumed fixed. Following the literature on penalized likelihood, where the log prior is interpreted as a penalty function, we may phrase the problem as one of minimizing Q(β), or equivalently maximizing the unnormalized posterior density exp{ Q(β)}. In this paper, we unify many seemingly disparate problems of this form into a single class of normal variance-mean mixtures. There are two main practical results of this 1
2 unification. First, it allows us to exploit the probabilistic latent-variable structure of exp{ Q(β)}. This leads to expectation-maximization algorithms (Dempster et al., 1977), and variants thereof, that avoid the need for analytic approximations, numerical derivatives, or other black-box optimization routines. The result is a unifed, accurate, stable approach to estimation in many non-gaussian problems, including quantile regression, logistic regression, support-vector machines, and robust regression. In the case of logistic regression, the gains in accuracy and stability hold even for maximum-likelihood estimation. We emphasize that the crucial step here is probabilistic, rather than algorithmic, in nature. Upon recognizing a certain likelihood as a variance-mean mixture, the algorithm comes essentially for free by applying Theorem 1, the paper s main result. This theorem describes a simple relationship between the derivatives of f and g and the conditional sufficient statistics for the latent variables that arise in our expectation-maximization algorithm. The expected values of these conditional sufficient statistics can usually be calculated in closed form, even if the full conditional distribution of the latent variables is unknown or intractable. Theorem 3 provides the correponding result for the posterior mean estimator, generalizing the results of Masreliez (1975) and Pericchi and Smith (1992), long recognized for their importance in robust Bayesian inference. A second maor advantage of our approach is its wide scope of potential use. For example, by representing both likelihoods and priors as mixtures, our method allows penalizedlikelihood methods to be applied within hierarchical non-gaussian models, where information is pooled across batches of related coefficients, with no essential modification of the approach. Moreover, our data-augmentation scheme can be woven together seamlessly with other latent-variable methods, such as those for discrete mixtures and missing-data problems. It also suggests an approach for deriving more sophisicated Markov-chain sampling methods for full Bayesian inference. 1.2 Relationship with previous work Our work is motivated by recent Bayesian research on sparsity-inducing priors in linear regression, where f = y Xβ 2 is the negative Gaussian log likelihood, and g corresponds to a normal variance-mixture prior (Andrews and Mallows, 1974). Examples of work in this area include the lasso (Tibshirani, 1996; Park and Casella, 28; Hans, 29); the bridge estimator (West, 1987; Knight and Fu, 2; Huang et al., 28; Polson and Scott, 211b); the relevance vector machine of Tipping (21); the normal/jeffreys model of Figueiredo (23) and Bae and Mallick (24); the normal/exponential-gamma model of Griffin and Brown (212); the normal/gamma and normal/inverse-gaussian (Caron and Doucet, 28; Griffin and Brown, 21); the horseshoe prior of Carvalho et al. (21); the hypergeometric inverted beta model of Polson and Scott (212b); and the double-pareto model of Armagan et al. (212a). A related line of work concerns the use iterative convex relaxation in nonconcave penalized-likelihood problems (e.g. Zou and Li, 28). Polson and Scott (211a) give a review of this extensive literature, and the connections between Bayesian shrinkage estimation and penalized likelihood. We generalize this work by representing both the likelihood and prior as variance- 2
3 Table 1: Variance-mean mixture representations for many common loss functions. Recall that z i = y i x T i β for regression, or z i = y i x T i β for binary classification. Error/loss function f(z i β, σ) κ z µ z p(ω i) Squared-error zi 2 /σ 2 ω i = 1 Absolute-error z i/σ Exponential Check loss z i + (2q 1)z i 1 2q Generalized inverse Gaussian Support vector machines max(1 z i, ) 1 1 Generalized inverse Gaussian Logistic log(1 + e z i ) 1/2 Polya mean mixtures of Gaussians. This data-augmentation approach relies upon the following decomposition: n p p(β τ, σ, y) e Q(β) exp f(y i, x T i β σ) g(β τ) i=1 =1 { n } p p(z i β, σ) p(β τ) i=1 = p(z β, σ) p(β τ), where the working response z i is equal to y i x T i β for regression, or y ix T i β for binary classification, with y i coded as ±1. The case of a multinomial response requires only a small modification, detailed in the appendix. Both σ and τ are hyperparameters; they are typically estimated ointly with β, although they may also be specified by the user or chosen by cross-validation. In some cases, most notably in logistic regression, the likelihood is free of hyperparameters, in which case σ does not appear in the model. One thing we do not do in this paper is to study the formal statistical properties of the resulting estimators, such as consistency as p and n both diverge. For this we refer the reader to Griffin and Brown (212) and Armagan et al. (212b), who discuss this issue from both a Bayesian and classical perspective. Rather, we provide a representation theorem that makes these estimators both easier to compute and more widely applicable to hierarchical, non-gaussian models. =1 2 Normal variance-mean mixtures There are two key steps in our approach: first, we use variance-mean mixtures, rather than ust variance mixtures; and second, we interweave two different mixture representations, one for the likelihood and one for the prior. The introduction of latent variables {ω i } and {λ } in Equations (2) and (3), below, reduces exp{ Q(β)} to a Gaussian linear model 3
4 with heteroscedastic errors: p(z i β, σ) = p(β τ) = φ(z i µ z + κ z ωi 1, σ 2 ωi 1 ) dp (ω i ) (2) φ(β µ β + κ β λ 1, τ 2 λ 1 ) dp (λ ), (3) where φ(a m, v) is the normal density, evaluated at a, for mean m and variance v. By marginalizing over these latent variables with respect to different fixed combinations of (µ z, κ z, µ β, κ β ) and the mixing measures P (λ ) and P (ω i ), it is possible to generate many commonly used obective functions that have not been widely recognized as Gaussian mixtures. Table 1 lists several common likelihoods in this class, along with the corresponding fixed choices for (κ z, µ z ) and the mixing distribution P (ω i ). A discussion of priors and penalty functions that fall within this class can be found in Polson and Scott (212a). An important feature of our approach is that we avoid dealing directly with conditional distributions for these latent variables. To find the posterior mode, it is sufficient to use Theorem 1 to calculate moments of these distributions, exploiting known facts about Gaussian mixtures. These moments in turn depend only upon the derivatives of f and g, along with the hyperparameters. We focus on two choices of the mixing measure: the generalized inverse-gaussian distribution; and the Polya distribution, which is essentially an infinite sum of exponential random variables (Barndorff-Nielsen et al., 1982). These two choices lead to the hyperbolic and Z distributions, respectively, for the resulting variance-mean mixture. The two key integral identities are α 2 κ 2 2α e α θ µ +κ(θ µ) = e α(θ µ) 1 B(α, κ) (1 + e θ µ ) 2(α κ) = φ (θ µ + κv, v) p G ( v 1,, α 2 κ 2) dv φ (θ µ + κv, v) p P (v α, α 2κ) dv, where p G and p P are the density functions of the generalized inverse-gaussian and Polya distributions, respectively. We use θ to denote a dummy argument that could involve either data or parameters, and v to denote a latent variance; all other terms are hyperparameters specified by the user for the purposes of representing a particular density or function. These two expressions lead, by a simple application of the Fatou Lebesgue theorem, to three further identities for the improper limiting cases of the two densities above: a 1 exp { 2c 1 max(aθ, ) } = c 1 exp { 2c 1 ρ q (θ) } = (1 + exp{θ µ}) 1 = φ(θ av, cv) dv φ(θ (2τ 1)v, cv)e 2τ(1 τ)v dv φ(θ µ (1/2)v, v) p P (v, 1) dv, where ρ q (θ) = 1 2 θ + ( q 1 2) θ is the check-loss function (Koenker, 25; Li et al., 21). 4
5 The first leads to the likelihood for support-vector machines; the second, to quantile and lasso regression; and the third, to logistic and multinomial logistic regression. The function p P (v, 1) is an improper density corresponding to a sum of exponential random variables. Thus by using either a generalized inverse-gaussian or a Polya mixing distribution, one can generate obective functions corresponding to the lasso estimator, support-vector machines, the check-loss function, and the binary and multinomial logistic-regression models. The relevant distribution theory leading to these integral identities can be found in Appendix A. Previous studies (e.g. Polson and Scott, 211c; Gramacy and Polson, 212) have presented similar results for specific models, including support-vector machines and the so-called powered-logit likelihood. But as far as we are aware, ours is the first characterization of the full class. 3 An expectation-maximization algorithm 3.1 Overview of approach We now show that this conditionally Gaussian representation results in a simple expectationmaximization algorithm for any model in the proposed class, given our data augmentation scheme. In the expectation step, one computes the expected value of the log posterior, given hyperparameters and the current estimate β (g) at step g of the algorithm: C(β β (g) ) = log p(β ω, λ, τ, σ, z)p(ω, λ β (g), τ, z) dω dλ. Then in the maximization step, one maximizes the complete-data posterior as a function of β: β (g+1) = arg max β C(β β (g) ). The advantages of the expectation-maximization algorithm are its lack of user-specified tuning constants, and its well known theoretical properties: the sequence of estimated parameter values {β (1), β (2),... } monotonically increases the observed-data log posterior density, and will converge to a global maximum if this function is concave. The complete-data log posterior can be represented in two different ways. We appeal to both of these representations in deriving the expectation and maximization steps. Under a normal variance-mean mixture of the form in (2) (3), log p(β ω, λ, τ, σ, z) = c (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n i=1 p =1 ( ω i zi µ z κ y ωi 1 ) 2 λ (β µ β κ β λ 1 ) 2 (4) for some constant c, recalling that z i = y i x T i β for regression or z i = y i x T i β for classifi- 5
6 cation. Factorizing this further as a function of β yields that log p(β ω, λ, τ, σ, z) = c 1 (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n ω i (z i µ z ) 2 + κ y i=1 k λ (β µ β ) 2 + κ β =1 n (z i µ z ) i=1 p (β µ β ) (5) for some constant c 1. We now explicitly derive the expectation and maximization steps, along with the necessary conditional sufficient statistics. 3.2 The expectation step From (5), observe that the complete-data obective function is linear in both ω i and λ. Therefore, in the expectation step, we calculate the complete-data log posterior by (g) replacing λ and ω i with their conditional expectations ˆλ and ˆω (g) i, given the data and the current β (g). The following theorem provides expressions for these conditional moments under any model where both the likelihood and prior can be represented by normal variance-mean mixtures. Theorem 1. Suppose that the obective function Q(β) in (1) can be represented by a hierarchical variance-mean Gaussian mixture, as in Equations (2) and (3). Then the conditional moments ˆλ = E (λ β, τ, z) and ˆω i = E (ω i σ, z i ) are given by the following expressions: (β µ β )ˆλ = κ β + τ 2 g (β τ) (z i µ z )ˆω i = κ z + σ 2 f (z i β, σ), where f and g are the derivatives of negative log likelihood and negative log prior from (1), respectively. The advantage of the theorem is that it characterizes the required moments purely in terms of the likelihood and penalty functions, which are pre-specified in most regularization problems: f(z i ) = log p(z i β, σ), g(β ) = log p(β τ). One caveat is that when β µ β, the conditional moment for λ in the expectation step may be numerically infinite, and care must be taken. Indeed, infinite values for λ will arise naturally under certain sparsity-inducing choices of g, such as the lasso, and indicate that the algorithm has converged to a sparse solution. One way to handle the resulting problem of numerical infinities is to start the algorithm from a value where (β µ β ) has no zeros, and to remove β from the model when it gets within a small numerical threshold of its mean (c.f. Fan and Li, 21). This conveys the added benefit of hastening the matrix computations in the maximization step. Although we have found this approach to work well in practice, it has the disadvantage that a variable cannot re-enter the model once it has been deleted. Therefore, if using a sparsity-inducing prior, we check for convergence =1 6
7 once a putative zero has been found, by proposing small perturbations in each component of β to assess whether any variables should re-enter the model. An alternate approach involves the use of restricted least-squares; for details, see Section 3.2 of Polson and Scott (211c). Our method, of course, does not sidestep the problem of optimization over a combinatorially large space. In particular, there is no way to guarantee convergence to a global maximum if the penalty function is concave, in which case multiple restarts from different initial values will be necessary to check for the presence of local modes. 3.3 The maximization step Returning to (4), the maximization step involves computing the posterior mode under a heteroscedastic Gaussian error model and a conditionally Gaussian prior for β, given the latent variables {ω i } and {λ }. This posterior mode is recognizable as a generalized ridge estimator. Theorem 2. Suppose that the obective function Q(β) in (1) can be represented by variancemean Gaussian mixtures, as in (2)-(3). Then given estimates {ˆω i } and {ˆλ }, we have the following expressions for the conditional maximum of β, where ω = (ω 1,..., ω n ) and λ = (λ 1,..., λ p ) are vectors, and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ). 1. In a regression problem, ˆβ = ( τ 2 ) ˆΛ + X T 1(y ˆΩX + b (ˆΩy ) ) y = X T µz ω κ z 1 b = (τ 2 )(µ β λ + κ β 1). 2. In a binary classification problem where y i = ±1 and X has rows x i = y ix i, ˆβ = ) 1 ( ) (τ 2 ˆΛ + X T ˆΩX X T µ z ˆω + κ z 1. The maximization step can be easily extended to encompass a series of conditional maximization steps: first for the regression coefficients β, and then for hyperparameters such as σ and τ. These latter steps exploit standard results on variance components in linear models; we therefore omit the details, and refer the reader to, for example, Gelman (26). 4 Examples 4.1 Binary logistic regression The simplicity and generality of our approach are best illustrated by an initial example involving a familiar likelihood. Suppose we wish to fit a penalized logistic regression, where n p ˆβ = arg min log{1 + exp( y i x T i β)} + g(β τ), β R p i=1 =1 7
8 assuming that the outcomes y i are coded as ±1, and that τ is fixed. Many factors conspire to make this a difficult problem, but we focus particularly on the non-gaussianity of the likelihood. For the logistic likelihood, recall that ω 1 has a Polya distribution with α = 1, κ = 1/2. Theorem 1 gives the relevant conditional moment as ˆω i = 1 ( e z i z i 1 + e z 1 ). i 2 Therefore, if the log prior g satisfies (3), then the following three updates, when iterated repeatedly, will generate a sequence of estimates that converges to stationary point of Q(β): β (g+1) = ˆω (g+1) i = ˆλ (g+1) ( τ 2 ˆΛ(g) + X T ˆΩ ) ( ) 1 1 (g) X 2 XT 1 1 z (g+1) i ( e z(g+1) i 1 + e z(g+1) i 1 2 = κ β + τ 2 g (β (g+1) τ) β (g+1), µ β where z (g) i = y i x T i β(g) ; X is the matrix having rows x i = y i x i ; 1 is a vector of ones; and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ) are diagonal matrices. If the penalty function g is convex on (, ), this stationary point will be the global maximum (Zou and Li, 28). This sequence of steps resembles iteratively re-weighted least squares (Green, 1984) due to the presence of the diagonal weights matrix Ω. But there are subtle and important differences, even in the unpenalized case where λ and the solution is the maximumlikelihood estimator. In iteratively re-weighted least squares, for example, the analogous weight matrix Ω has diagonal entries ω i = µ i (1 µ i ), where µ i = 1/(1 + e xt i β ) is the estimated value of pr(y i = 1) at each stage of the algorithm. These weights arise from a sequential approximation to the likelihood. In contrast, the weights ω i in our second step above arise from an exact mixture representation of the likelihood. Figure 1 shows a comparison of the weights for each algorithm, as a function of the linear predictor x T i β. The weights under iteratively re-weighted least squares decay to zero much more rapidly than in our algorithm. This can lead to numerical difficulties when the successes and failures are nearly separable by a hyperplane in R p. To illustrate this phenomenon, we simulated a logistic-regression problem with 2 variables and 1 observations, where each coefficient β and each design point x i were independent draws from a standard normal distribution. We then compared our expectationmaximization algorithm to iteratively re-weighted least squares for computing the unpenalized maximum-likelihood estimate. Each algorithm was run from two different starting values: β () = (1/1,..., 1/1) T, and β () = (1/2,..., 1/2) T. As Figure 2 shows, iteratively re-weighted least squares is highly sensitive to the choice ) 8
9 Weight Function Iteratively re-weighted least squares Expectation-maximization Absolute Value of Linear Predictor Figure 1: The weights ω i on the diagonal weight matrix Ω that arise in iteratively re-weighted least squares for logistic regression, versus those that arise in our dataaugmentation approach, as a function of the linear predictor x T i β. of starting value. It converges when initialized at β () = (1/1,..., 1/1) T, but not when initialized at β () = (1/2,..., 1/2) T. Our data-augmentation algorithm is far more robust, finding the maximum easily in both cases. We emphasize that this is not a problem caused by perfect separability of the successes and failures, in which case there is no unique maximum-likelihood solution. As the bottom pane of Figure 2 shows, 18 of 1 observations had fitted success probabilities between 5 and 95. Rather, the problem is that the iterative approximation can be poor in regions where the likelihood surface is nearly flat. Next, we investigated the performance of data augmentation versus iteratively reweighted least squares when a penalty function is applied. For this case we simulated data with a nontrivial correlation structure in the design matrix. Specifically, we defined Σ = BB T + Ψ, where B is a 5 4 matrix of standard normal random entries, and Ψ is a diagonal matrix with χ 2 1 random entries. The rows of the design matrix X were simulated from a multivariate normal distribution with mean zero and covariance matrix Σ; the coefficients β were standard normal random draws; and the size of the data set was p = 5 and n = 2. We first used a ridge-regression penalty, where g(β τ) = (β /τ) 2, leading to a convex problem. We also used the generalized double-pareto model proposed by Armagan et al. (212a), where p(β τ) ( 1 + β ) (1+a). aτ Like the Laplace prior, this is non-differentiable at zero and is therefore sparsity-inducing, but has polynomial rather than exponential tail behavior. Armagan et al. (212b) show that this model leads to strong consistency of the posterior in regression models with a diverging number of parameters, which can be thought of as a Bayesian analogue of the oracle property. The generalized double-pareto model has a conditionally Gaussian representation, making Theorem 1 applicable. 9
10 Expectation-Maximization, Initialized at.1 Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Expectation-Maximization, Initialized at Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Fitted Probability Observation Figure 2: Convergence of expectation-maximization versus iteratively re-weighted least squares from two different starting points for a simulated logistic-regression problem. The top four panes show the contours of the likelihood as a function of (β 1, β 2 ), with all other coefficients held at their maximum-likelihood values. In the left-hand panes, the coefficients were all initialized at 1; in the right-hand panes, they were all initialized at 5. The black lines trace out the values of (β 1, β 2 ) as the algorithm progresses; the grey dot is the true maximum. The bottom pane shows the fitted values pr(y i = 1) at this maximum. 1
11 Solution path with double-pareto penalty Solution path with ridge penalty Coefficients Coefficients Log of regularization parameter Log of regularization parameter Figure 3: The solution paths for β for the double-pareto and ridge penalties, as a function of the regularization parameter log(1/τ), for a simulated logistic regression problem. The black lines show the solution for iteratively re-weighted least squares; the grey lines, for our expectation-maximization algorithm. Moving from right to left, the black lines stop where iteratively re-weighted least squares fails due to numerical instability. We chose a = 2, and used each algorithm to compute a full solution path for β as a function of the regularization parameter, here expressed as log(1/τ) in keeping with the penalized-likelihood literature. We began by fitting the solution for τ = 1 3, which essentially constrained all coefficients to be zero, or very small. We then increased the value of τ along a discrete grid {τ 1,..., τ K = 1}, using the solution for τ k as the starting value for the τ k+1 case. As Figure 3 shows, iteratively re-weighted least squares fails when log(1/τ) becomes too small, and the coefficient vector becomes larger in magnitude. This happens because the linear system that must be solved at each stage of the algorithm becomes numerically singular. It does so, moreover, at a point when 2 out of 2 observations still had fitted success probabilities between 5 and 95. In fact, under the double-pareto prior, iteratively re-weighted least squares fails before all coefficients have even entered the model. No such pathology affects the expectation-maximization algorithm. Two further comments are in order. First, while we have focused here on estimating β using the posterior mode, the problems we have identified will also arise in any Bayesian treatment of logistic regression that is based upon an analytic approximation to the logistic likelihood (e.g. Gelman et al., 28a). We hope to extend these results to facilitate fully Bayesian inference. Second, sparse logistic regression via penalized likelihood is a topic of great current interest (Genkin et al., 27; Meier et al., 28). This problem involves three logically distinct issues: how to handle the logistic likelihood; which penalty function to use; and how to fit the resulting model, whether by a block updating scheme or by coordinate descent. These issues interact in poorly understood ways. For example, it is widely known that 11
12 coordinate-by-coordinate algorithms, including Gibbs sampling, can fare poorly in highly multimodal situations. Likewise, it is known that nonconvex penalties lead to multimodal obective functions, but also, subect to certain regularity conditions, exhibit more favorable statistical properties for estimating sparse signals (Fan and Li, 21; Carvalho et al., 21). Moreover, coordinate descent is tractable only if the chosen penalty leads to a univariate thresholding function whose solution is analytically available (Mazumder et al., 211). This is a fairly narrow class, and does not include most of the penalties mentioned in the introduction. The question of how to handle the likelihood complicates matters still further. For example, the area of detail in Figure 3 shows that, for a double-pareto penalty, the solution paths fit by iteratively re-weighted least squares differ in subtle but noticeable ways from those fit by data augmentation. By simply checking the maximized value of the obective function under both methods, we are able to confirm that iteratively re-weighted least squares is not converging to the true optimum. Yet we do not entirely understand why, and under what circumstances, the methods will differ, and how these differences should affect recommendations about what penalty function and algorithm should be used to fit logistic regression models. A full study of these issues is beyond the scope of the current paper, but is a subect of active inquiry. 4.2 Penalized quantile regression We now describe a data analysis whose goal is to understand the predictors of the longterm growth rate of a country s gross domestic product. Specifically, we use quantile regression to describe how the conditional quantile of a country s annualized growth rate, y i, depends upon various political and socioeconomic predictors. The data set comprises 161 observations on 87 countries over two periods, and It comes originally from a 1994 discussion paper by R. Barro and J. Lee, available from the National Bureau of Economic Research; a full description can be found in Koenker and Machado (1999). We used our data-augmentation scheme to fit penalized quantile-regression models, comparing them to the corresponding maximum-likelihood estimates. To do so, we choose p(ω i ) to be a generalized inverse-gaussian prior of unit scale, where (α, κ, µ) = (1, 1 2q, ). This leads to log p(z i ) = z i + (2q 1)z i, the pseudo-likelihood which yields quantile regression for the qth quantile (Koenker, 25; Li et al., 21). Applying Theorem 1, we get ˆω i = y i x T i β(g) 1 as the expected value of the conditional sufficient statistic needed in the expectation step of our algorithm. To illustrate our method, we applied a bridge penalty, where g(β τ) = β /τ a. We chose a = 1/2, estimating τ and β ointly via an expectation-conditional-maximization algorithm. An inverse-gamma prior with shape and rate parameters equal to 1 was assumed for ν = τ a. This leads to the following two conditional-maximization steps, given 12
13 Trade Growth Investment/GDP Trade Growth Political Instability Black Market Premium Consumption/GDP Education/GDP Human Capital - Life Expectancy 2. - Male Higher Education Female Higher Education Female Secondary Education Political Instability Black Market Premium Consumption/GDP Investment/GDP Education/GDP 1.5 Human Capital Male Higher Education Male Secondary Education Female Higher Education Life Expectancy Female Secondary Education 2.5 Male Secondary Education Figure 4: Estimated coefficients versus conditional quantile q (, 1) for the Barro GDP growth data. The top 12 panes show the maximum likelihood estimates for each coefficient as a function of q, while the bottom 12 show the corresponding estimates under a bridge penalty. The shaded grey areas show estimated error bars, extending one estimated standard error above and below the line. 13
14 λ and ω i : (ˆβ τ) = ( τ 2 ˆΛ + X T ˆΩX ) 1X T {ˆΩy (1 2q) 1 } (ˆν β) = 1 + p =1 β a, 2 + p/a allowing τ = ν 1/a to be obtained easily, and avoiding the need for cross validation. We used the same 12 predictors used by Koenker and Machado (1999), and compared our bridge-penalized quantile regressions to the maximum-likelihood estimates, obtainable using the R package for quantile regression (Koenker, 211). The predictors were standardized to have zero mean and unit scale. Figure 4 shows the results of this comparison, plotting the regression coefficients versus the conditional quantile across a discrete grid of choices for q. In each frame, the shaded grey area extends one standard error to either side of the estimate. Under the penalized-likelihood procedure, the standard error quoted here is actually the posterior standard deviation under the multivariate normal approximation at the mode: var(β y) ˆτ 2 /ˆλ. Alas, this does not lead to sensible estimates of uncertainty for coefficients shrunk to zero, which is a known shortcoming of penalized likelihood. The direct comparison is with Figure 3 of Koenker and Machado (1999). There are noticeable differences between the penalized and unpenalized fits. The bridge estimator shrinks four coefficients all the way to zero for every conditional quantile: female secondary education, female higher education, male higher education, and human capital. It shrinks four other coefficients to zero for a substantial fraction of values q (, 1): male secondary education, life expectancy, education spending, and political instability. Also interesting is that the posterior standard deviations for the predictors that remain in the model are typically larger than the asymptotic standard errors for the corresponding maximum-likelihood estimates. This is counter-intuitive; one would expect that, by excluding needless predictors from the model, uncertainty about those that remain in the model would be reduced, rather than inflated. We do not know whether this reflects the deficieny of the asymptotic approximation needed to compute standard errors under the maximum-likelihood procedure; the failure of the normal approximation at the mode; or some other unappreciated aspect of the model. 5 Discussion Our primary goal in this paper has been to show the relevance of the conditionally Gaussian representation of (2) (3), together with Theorem 1, for fitting a wide class of regularized estimators within a unified variance-mean mixture framework. We have therefore focused only on the most basic implementation of the expectation-maximization algorithm. There are many further variants on the basic algorithm, however, some of which can lead to dramatic advantages in speed and stability for certain problems. Key references include Meng and van Dyk (1997) and Gelman et al. (28b). These variants include marginal data augmentation (Liu, 1995), parameter expansion (Liu and Wu, 1999), maorization minimization (Hunter and Lange, 2), the partially collapsed Gibbs sam- 14
15 pler (van Dyk and Park, 28), and other simulation-based alternatives described by van Dyk and Meng (21). Many of these modifications require additional analytical work for particular choices of g and f. One example here includes the work of Liu (25) on the robit model. We have not explored these options here, and this remains a promising area for future research. A second important fact is that, for many purposes, such as estimating of β under squared-error loss, the relevant quantity of interest is the posterior mean, not the mode. Indeed, both Hans (29) and Efron (29) argue that, for predicting future observables, the posterior mean of β is the better choice. The following theorem represents the posterior mean for β in terms of the score function of the predictive distribution, generalizing the results of Brown (1971), Masreliez (1975), Pericchi and Smith (1992), and Carvalho et al. (21). There are a number of possible versions of such a theorem. Here we consider a variance-mean mixture prior p(β ) with a general location likelihood p(y β), but clearly a similar result holds the other way around. We consider the case where X is an orthogonal matrix, assumed without loss of generality to be the identity matrix, in which case we apply the theorem component by component. Theorem 3. Let p( y β ) be the likelihood for a location parameter β, symmetric in y β, and let p(β ) = φ(β ; µ β + κ β /λ, τ 2 /λ ) p(λ 1 ) dλ 1 be a normal variance-mean mixture prior. Define the following distributions: m(y) = p(y β )p(β ) dβ Then E(β y) = κ β τ 2 + p (λ 1 ) = λ p(λ 1 ) E(λ ) p (β ) = φ(β ; µ + κ/λ, τ 2 /λ )p (λ 1 ) m (y) = p(y β )p (β ). { µβ E(λ ) τ 2 } { m } (y) + m(y) { E(λ ) τ 2 } { m (y) m(y) } { log m } (y). y The generalization to nonorthogonal designs is straightforward, following the original Masreliez (1975) paper; see, for example, Griffin and Brown (21), along with the discussion of the Tweedie formula by Efron (211). Computing the posterior mean will typically require sampling from the full oint posterior distribution over all parameters. Our data-augmentation approach can lead to Markov-chain Monte Carlo sampling schemes for ust this purpose. The key step is the identification of the conditional distributions for λ and ω i under specific models; this remains an active area of research. 15
16 Acknowledgements The authors wish to thank two anonymous referees, in addition to the editor and associate editor, for their many helpful comments in improving the paper. A Appendix: distributional results A.1 Generalized hyperbolic distributions In all of the following cases, we assume that (θ v) N (µ + κv, v), and that v p(v). Let p(v) be a generalized inverse-gaussian distribution G(ψ, γ, δ), following the notation of Barndorff-Nielsen and Shephard (21). We consider the special case of this class where ψ = 1 and δ =, in which case p(θ) is a hyperbolic distribution having density ( α 2 κ 2 ) p(θ µ, α, κ) = exp { α θ µ + κ(θ µ)}. 2α When viewed as a pseudo-likelihood or pseudo-prior, the class of generalized hyperbolic distributions will generate many common obective functions. First, choosing (α, κ, µ) = (1,, ) leads to log p(β ) = β, and thus l 1 regularization. Second, choosing (α, κ, µ) = (1, 1 2q, ) gives log p(z i ) = z i + (2q 1)z i. This is the check-loss function, yielding quantile regression for the qth quantile. Third, choosing (α, κ, µ) = (1, 1, 1) leads to the maximum operator: (1/2) log p(z i ) = (1/2) 1 z i + (1/2)(1 z i ) = max(1 z i, ) for z i = y i x T i β. This is the obective function for support vector machines (e.g. Mallick et al., 25; Polson and Scott, 211c), and corresponds to the limiting case of a generalized inverse-gaussian prior. A.2 Z distributions Now let p P (v α, α 2κ) be a Polya distribution, which can be represented as an infinite convolution of exponentials, and leads to a Z distributed marginal. The important result is the following: p Z (θ µ, α, κ) = 1 (e θ µ ) α B(α, κ) (1 + e θ µ ) 2(α κ) = N (µ + κv, v) p P (v α, α 2κ) dv. See Barndorff-Nielsen et al. (1982). When viewed as a likelihood in θ, the class of Polya/Z distributions results in the logistic and multinomial models. 16
17 For logistic regression, choosing (α, κ, µ) = (1, 1/2, ) leads to p(z i ) = ez i 1 + e z i, which is the likelihood for logistic regression with z i = y i x T i β. Much like the support vector-machine representation, this corresponds to a limiting improper case of the Polya distribution, specifically P(1, ). The necessary mixture representation still holds, however, by applying the Fatou-Lebesgue theorem. The improper mixing measure p 1, (v) is an infinite sum of exponentials for which the integral on the right still converges to the logistic likelihood (Gramacy and Polson, 212). For the multinomial generalization of the logistic model, we require a slight modification. Suppose that y i {1,..., K} is an unordered category indicator, and that β k = (β k1,..., β kp ) T is a separate block of p regression coefficients for the kth category. Let η ik = exp ( x T ) ( ) i β k c ik /{1 + exp x T i β k c ik }, where { } c ik (β k ) = log exp(x T i β l ). We follow Holmes and Held (26) in writing the conditional likelihood for β k as Q(β k β k, y) n K i=1 l=1 n i=1 n i=1 l k η I(y i=l) il η I(y i=k) ik {w i (1 η ik )} I(y i k) { ( ) exp x T I(yi } =k) i β k c ik 1 + exp ( x T i β ), k c ik where w i is independent of β k and I is the indicator function. Thus the conditional likelihood for the kth block of coefficients β k, given all the other blocks of coefficients β k, can be written as a product of n terms, the ith term having a Polya mixture representation with α ik = I(y i = k), κ ik = α ik 1/2, and µ ik = c ik (β k ). This allows regularized multinomial logistic models to be fit using essentially the same approach of Section 4.1, where each block β k is updated using a conditional maximization step. B Appendix: Proofs B.1 Theorem 1 Since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β ( β µ β κ β /λ = τ 2 /λ ) φ(β µ β + κ β /λ, τ 2 /λ ). 17
18 We use this fact to differentiate p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) p(λ τ) dλ under the integral sign to obtain { p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) } p(λ τ) dλ. β β Dividing by p(β τ) and using the above identity for the inner function, we get ( κβ ) p(β τ) = β τ 2 ( β µ β p(β τ) τ 2 ) E (λ β, τ). Hence we can in general find the needed moment using the expression { } 1 p(β τ) = κ ( ) β p(β τ) β τ 2 β µ ( ) β τ 2 E λ β (g), τ, y. Equivalently, in terms of the penalty function log p(β τ), we have By a similar argument, we also have (β µ β )E (λ β ) = κ β τ 2 β log p(β τ). (z i µ z )E (ω i β, z i, σ) = κ z σ 2 z i log p(z i β, σ), We obtain the desired result by using the identities z i log p(z i β i ) = f (z i β, σ) B.2 Theorem 2 and β log p(β τ) = g (β τ). We demonstrate the result for a regression problem, with the classification result following as a simple modification. Begin with Equation (4). Collecting terms, we can represent the log posterior, up to an additive constant not involving β, as a sum of quadratic forms in β: log p(β ω, λ, τ, σ, z) = 1 ( ) T ( ) {y µ z 1 κ z ω 1 } Xβ Ω {y µ z 1 κ z ω 1 } Xβ 2 1 ( 2τ 2 β µ b 1 κ β λ 1) T ( Λ 1 β µ β 1 κ β λ 1). where we recall that ω 1 is a column vector whose ith entry is ω 1 i, and similarly for λ 1. This is the log posterior under a normal prior β N(µ β 1 + κ β λ 1, τ 2 Λ 1 ) after having observed the working response y µ z 1 κ z ω 1. The identity Ω(µ 1 + κω 1 ) = µω + κ 1 then gives the result. 18
19 For classification, on the other hand, let X be the matrix with rows x i = y ix i. The kernel of the conditionally normal likelihood then becomes (X β µ z 1 κ z ω 1 ) T Ω (X β µ z 1 κ z ω 1 ). Hence it is as if we observe the n-dimensional working response µ z 1+κ z ω 1 in a regression model having design matrix X, and we proceed by a similar argument to arrive at the result. B.3 Theorem 3 Our extension of Masreliez s theorem to variance-mean mixtures follows a similar path as Theorem 1. From before, since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β = β µ β κ β λ τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ). Differentiating under the integral sign and applying this result, we have that 1 τ 2 λ β φ(β µ β + κ β /λ, τ 2 /λ ) = µ β κ β τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ) β. The rest of the argument follows the standard Masreliez approach. References D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99 12, A. Armagan, D. Dunson, and J. Lee. Generalized double Pareto shrinkage. Technical report, Duke University Department of Statistical Science, 212a. A. Armagan, D. B. Dunson, J. Lee, and W. Bawa. Posterior consistency in linear models under shrinkage priors. Technical report, Duke University Department of Statistical Science, 212b. K. Bae and B. Mallick. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 2 (18):3423 3, 24. O. E. Barndorff-Nielsen and N. Shephard. Non-Gaussian Ornstein Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2): , 21. O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distributions. International Statistical Review, 5:145 59, L. Brown. Admissible estimators, recurrent diffusions and insoluble boundary problems. The Annals of Mathematical Statistics, 42:855 93, F. Caron and A. Doucet. Sparse Bayesian nonparametric regression. In Proceedings of the 25th International Conference on Machine Learning, pages Association for Computing Machinery, 28. C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 8, 21. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39(1):1 38,
20 B. Efron. Empirical Bayes estimates for large-scale prediction problems. Journal of the American Statistical Association, 14(487):115 28, 29. B. Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 16(496): , 211. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 6, 21. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):115 9, 23. A. Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Anal., 1(3): , 26. A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):136 83, 28a. A. Gelman, D. A. van Dyk, Z. Huang, and W. J. Boscardin. Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics, 17(1):95 122, 28b. A. Genkin, D. Lewis, and D. Madigan. High-dimensional generalized linear models and the lasso. Technometrics, 49:291 34, 27. R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Analysis, 212. arxiv.org/abs/ (to appear). P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society (Series B), 46(2):149 92, J. Griffin and P. Brown. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1):171 88, 21. J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Australian and New Zealand Journal of Statistics, 212. (to appear). C. M. Hans. Bayesian lasso regression. Biometrika, 96(4):835 45, 29. C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 68, 26. J. Huang, J. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2): , 28. D. R. Hunter and K. Lange. Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 9(1):6 77, 2. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28(5): , 2. R. Koenker. Quantile Regression. Cambridge University Press, New York, USA, 25. R. Koenker. quantreg: Quantile Regression, 211. URL R package version R. Koenker and J. Machado. Goodness of fit and related inference processes for quantile regression. Journal of the American Statistical Association, 94(448): , Q. Li, R. Xi, and N. Lin. Bayesian regularized quantile regression. Bayesian Analysis, 5(3):533 56, 21. C. Liu. Missing data imputation using the multivariate t distribution. Journal of Multivariate Analysis, 53(1):139 58,
21 C. Liu. Robit regression: a simple robust alternative to logistic and probit regression. In A. Gelman and X. L. Meng, editors, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin s Statistical Family, pages John Wiley & Sons, 25. J. S. Liu and Y. N. Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association, 94(448): , B. K. Mallick, D. Ghosh, and M. Ghosh. Bayesian classification of tumours by using gene expression data. Journal of the Royal Statistical Society (Series B), 67(2):219 34, 25. C. Masreliez. Approximate non-gaussian filtering with linear state and observation relations. IEEE. Trans. Autom. Control, 2(1):17 1, R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Journal of the American Statistical Association, 16(495): , 211. L. Meier, S. van de Geer, and P. B uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society (Series B), 7(1):53 71, 28. X. L. Meng and D. van Dyk. The EM algorithm an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society (Series B), 59(3):511 67, T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 13(482): 681 6, 28. L. R. Pericchi and A. Smith. Exact and approximate posterior moments for a normal location parameter. Journal of the Royal Statistical Society (Series B), 54(3):793 84, N. G. Polson and J. G. Scott. Shrink globally, act locally: sparse Bayesian regularization and prediction. In J. Bernardo, M. Bayarri, J. O. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Proceedings of the 9th Valencia World Meeting on Bayesian Statistics. Oxford University Press, 211a. N. G. Polson and J. G. Scott. The Bayesian bridge. Technical report, University of Texas at Austin, 211b. N. G. Polson and J. G. Scott. Local shrinkage rules, Lévy processes, and regularized regression. Journal of the Royal Statistical Society (Series B), 212a. (to appear). N. G. Polson and J. G. Scott. Good, great, or lucky? Screening for firms with sustained superior performance using heavy-tailed priors. The Annals of Applied Statistics, 212b. (to appear). N. G. Polson and S. Scott. Data augmentation for support vector machines (with discussion). Bayesian Analysis, 6(1):1 24, 211c. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58(1):267 88, M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211 44, 21. D. van Dyk and X. L. Meng. Cross-fertilizing strategies for better EM mountain climbing and DA field exploration: A graphical guide book. Statistical Science, 25(4):429 49, 21. D. van Dyk and T. Park. Partially collapsed Gibbs samplers: theory and methods. Journal of American Statistical Association, 13(482):79 6, 28. M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646 8, H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):159 33,
Data augmentation for non-gaussian regression models using variance-mean mixtures
Biometrika (213), 1,2,pp. 459 471 doi: 1.193/biomet/ass81 C 213 Biometrika Trust Advance Access publication 11 March 213 Printed in Great Britain Data augmentation for non-gaussian regression models using
More informationOr How to select variables Using Bayesian LASSO
Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection
More informationRegularization with variance-mean mixtures
Regularization with variance-mean mixtures Nick Polson University of Chicago James Scott University of Texas at Austin Workshop on Sensing and Analysis of High-Dimensional Data Duke University July 2011
More informationGeometric ergodicity of the Bayesian lasso
Geometric ergodicity of the Bayesian lasso Kshiti Khare and James P. Hobert Department of Statistics University of Florida June 3 Abstract Consider the standard linear model y = X +, where the components
More informationGeneralized Elastic Net Regression
Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1
More informationDefault Priors and Effcient Posterior Computation in Bayesian
Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature
More informationBayesian linear regression
Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding
More informationarxiv: v2 [stat.me] 27 Oct 2012
The Bayesian Bridge Nicholas G. Polson University of Chicago arxiv:119.2279v2 [stat.me] 27 Oct 212 James G. Scott Jesse Windle University of Texas at Austin First Version: July 211 This Version: October
More informationOn the Half-Cauchy Prior for a Global Scale Parameter
Bayesian Analysis (2012) 7, Number 2, pp. 1 16 On the Half-Cauchy Prior for a Global Scale Parameter Nicholas G. Polson and James G. Scott Abstract. This paper argues that the half-cauchy distribution
More informationMotivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University
Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne
More informationSparse Linear Models (10/7/13)
STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine
More informationBayesian shrinkage approach in variable selection for mixed
Bayesian shrinkage approach in variable selection for mixed effects s GGI Statistics Conference, Florence, 2015 Bayesian Variable Selection June 22-26, 2015 Outline 1 Introduction 2 3 4 Outline Introduction
More informationBayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence
Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns
More informationStatistical Inference
Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park
More informationRegularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics
Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,
More informationNonparametric Bayesian Methods (Gaussian Processes)
[70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent
More informationPartial factor modeling: predictor-dependent shrinkage for linear regression
modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework
More informationLasso & Bayesian Lasso
Readings Chapter 15 Christensen Merlise Clyde October 6, 2015 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationAnalysis Methods for Supersaturated Design: Some Comparisons
Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate
More informationChris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010
Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,
More informationOn the half-cauchy prior for a global scale parameter
On the half-cauchy prior for a global scale parameter Nicholas G. Polson University of Chicago arxiv:1104.4937v2 [stat.me] 25 Sep 2011 James G. Scott The University of Texas at Austin First draft: June
More informationσ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =
Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,
More informationLatent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent
Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary
More informationSTA414/2104 Statistical Methods for Machine Learning II
STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements
More informationSlice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method
Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Madeleine B. Thompson Radford M. Neal Abstract The shrinking rank method is a variation of slice sampling that is efficient at
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationThe Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic
he Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic Hee Min Choi and James P. Hobert Department of Statistics University of Florida August 013 Abstract One of the most widely
More informationBayes methods for categorical data. April 25, 2017
Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,
More informationSOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu
SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray
More informationBiostatistics Advanced Methods in Biostatistics IV
Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results
More informationBayesian Grouped Horseshoe Regression with Application to Additive Models
Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,
More informationChoosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation
Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble
More informationBayesian Methods for Machine Learning
Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),
More informationOn the Conditional Distribution of the Multivariate t Distribution
On the Conditional Distribution of the Multivariate t Distribution arxiv:604.0056v [math.st] 2 Apr 206 Peng Ding Abstract As alternatives to the normal distributions, t distributions are widely applied
More informationDeep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationDynamic Generalized Linear Models
Dynamic Generalized Linear Models Jesse Windle Oct. 24, 2012 Contents 1 Introduction 1 2 Binary Data (Static Case) 2 3 Data Augmentation (de-marginalization) by 4 examples 3 3.1 Example 1: CDF method.............................
More informationThe Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations
The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture
More informationSelection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty
Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the
More informationSTA 4273H: Sta-s-cal Machine Learning
STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our
More informationStat 5101 Lecture Notes
Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin
More informationComputational statistics
Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated
More informationSTA414/2104. Lecture 11: Gaussian Processes. Department of Statistics
STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations
More informationOn Reparametrization and the Gibbs Sampler
On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department
More informationPACKAGE LMest FOR LATENT MARKOV ANALYSIS
PACKAGE LMest FOR LATENT MARKOV ANALYSIS OF LONGITUDINAL CATEGORICAL DATA Francesco Bartolucci 1, Silvia Pandofi 1, and Fulvia Pennoni 2 1 Department of Economics, University of Perugia (e-mail: francesco.bartolucci@unipg.it,
More informationHorseshoe, Lasso and Related Shrinkage Methods
Readings Chapter 15 Christensen Merlise Clyde October 15, 2015 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Bayesian Lasso Park & Casella
More informationFast Regularization Paths via Coordinate Descent
August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor
More informationDefault Priors and Efficient Posterior Computation in Bayesian Factor Analysis
Default Priors and Efficient Posterior Computation in Bayesian Factor Analysis Joyee Ghosh Institute of Statistics and Decision Sciences, Duke University Box 90251, Durham, NC 27708 joyee@stat.duke.edu
More informationApproximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)
Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University
More informationMark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.
CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.
More informationECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference
ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring
More informationStat 542: Item Response Theory Modeling Using The Extended Rank Likelihood
Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal
More informationA Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models
A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los
More informationStatistical Data Mining and Machine Learning Hilary Term 2016
Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes
More informationBayesian Regression Linear and Logistic Regression
When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationA New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables
A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,
More informationMARGINAL MARKOV CHAIN MONTE CARLO METHODS
Statistica Sinica 20 (2010), 1423-1454 MARGINAL MARKOV CHAIN MONTE CARLO METHODS David A. van Dyk University of California, Irvine Abstract: Marginal Data Augmentation and Parameter-Expanded Data Augmentation
More informationCOS513 LECTURE 8 STATISTICAL CONCEPTS
COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions
More informationMH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution
MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous
More informationContents. Part I: Fundamentals of Bayesian Inference 1
Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian
More informationProbabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms
Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,
More informationEM Algorithm II. September 11, 2018
EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data
More informationProbabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model
Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and
More informationPrinciples of Bayesian Inference
Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters
More informationSparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference
Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which
More informationLecture 3. Linear Regression II Bastian Leibe RWTH Aachen
Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression
More informationStability and the elastic net
Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for
More informationAn Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models
Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong
More informationKneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"
Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/
More informationA Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression
A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent
More informationFactor Analysis (10/2/13)
STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.
More informationCoordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /
Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods
More informationPerformance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project
Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore
More informationIntroduction to Machine Learning Midterm, Tues April 8
Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend
More informationDiscrete Mathematics and Probability Theory Fall 2015 Lecture 21
CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about
More informationThe Bayesian Approach to Multi-equation Econometric Model Estimation
Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation
More informationMachine Learning Techniques for Computer Vision
Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM
More informationBAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage
BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement
More informationSparse Bayesian Nonparametric Regression
François Caron caronfr@cs.ubc.ca Arnaud Doucet arnaud@cs.ubc.ca Departments of Computer Science and Statistics, University of British Columbia, Vancouver, Canada Abstract One of the most common problems
More informationBayesian Modeling of Conditional Distributions
Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings
More informationWeakly informative priors
Department of Statistics and Department of Political Science Columbia University 23 Apr 2014 Collaborators (in order of appearance): Gary King, Frederic Bois, Aleks Jakulin, Vince Dorie, Sophia Rabe-Hesketh,
More informationWEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract
Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of
More informationLearning discrete graphical models via generalized inverse covariance matrices
Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,
More informationSTA 216, GLM, Lecture 16. October 29, 2007
STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural
More informationarxiv: v1 [stat.me] 6 Jul 2017
Sparsity information and regularization in the horseshoe and other shrinkage priors arxiv:77.694v [stat.me] 6 Jul 7 Juho Piironen and Aki Vehtari Helsinki Institute for Information Technology, HIIT Department
More informationKazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract
Bayesian Estimation of A Distance Functional Weight Matrix Model Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies Abstract This paper considers the distance functional weight
More informationDesign of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.
Design of Text Mining Experiments Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.taddy/research Active Learning: a flavor of design of experiments Optimal : consider
More informationBayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units
Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional
More informationBayesian Linear Regression
Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective
More informationLecture 2 Part 1 Optimization
Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss
More informationCOMS 4771 Regression. Nakul Verma
COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationMCMC algorithms for fitting Bayesian models
MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models
More informationPENALIZING YOUR MODELS
PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights
More information