arxiv: v3 [stat.me] 26 Feb 2012

Size: px
Start display at page:

Download "arxiv: v3 [stat.me] 26 Feb 2012"

Transcription

1 Data augmentation for non-gaussian regression models using variance-mean mixtures arxiv: v3 [stat.me] 26 Feb 212 Nicholas G. Polson Booth School of Business University of Chicago James G. Scott University of Texas at Austin February 28, 212 Abstract We use the theory of normal variance-mean mixtures to derive a data-augmentation scheme that unifies a wide class of statistical models under a single framework. This generalizes existing theory on normal variance mixtures for priors in regression and classification. It also allows variants of the expectation-maximization algorithm to be brought to bear on a much wider range of models than previously appreciated. We demonstate the resulting gains in accuracy and stability on several examples, including sparse quantile regression and binary logistic regression. Key words: Data augmentation; Hierarchical model; Sparsity; Variance mean mixture of normals 1 Introduction 1.1 Regularized regression and classification Many problems in regularized estimation involve an obective function of the form Q(β) = n f(y i, x T i β σ) + i=1 p g(β τ). (1) =1 Here y i is a response, which may be continuous, binary, or multinomial; x i is a known p- vector of predictors; β = (β 1,..., β p ) is an unknown vector of coefficients; f and g are the negative log likelihood and penalty function, respectively; and σ and τ are scale parameters, for now assumed fixed. Following the literature on penalized likelihood, where the log prior is interpreted as a penalty function, we may phrase the problem as one of minimizing Q(β), or equivalently maximizing the unnormalized posterior density exp{ Q(β)}. In this paper, we unify many seemingly disparate problems of this form into a single class of normal variance-mean mixtures. There are two main practical results of this 1

2 unification. First, it allows us to exploit the probabilistic latent-variable structure of exp{ Q(β)}. This leads to expectation-maximization algorithms (Dempster et al., 1977), and variants thereof, that avoid the need for analytic approximations, numerical derivatives, or other black-box optimization routines. The result is a unifed, accurate, stable approach to estimation in many non-gaussian problems, including quantile regression, logistic regression, support-vector machines, and robust regression. In the case of logistic regression, the gains in accuracy and stability hold even for maximum-likelihood estimation. We emphasize that the crucial step here is probabilistic, rather than algorithmic, in nature. Upon recognizing a certain likelihood as a variance-mean mixture, the algorithm comes essentially for free by applying Theorem 1, the paper s main result. This theorem describes a simple relationship between the derivatives of f and g and the conditional sufficient statistics for the latent variables that arise in our expectation-maximization algorithm. The expected values of these conditional sufficient statistics can usually be calculated in closed form, even if the full conditional distribution of the latent variables is unknown or intractable. Theorem 3 provides the correponding result for the posterior mean estimator, generalizing the results of Masreliez (1975) and Pericchi and Smith (1992), long recognized for their importance in robust Bayesian inference. A second maor advantage of our approach is its wide scope of potential use. For example, by representing both likelihoods and priors as mixtures, our method allows penalizedlikelihood methods to be applied within hierarchical non-gaussian models, where information is pooled across batches of related coefficients, with no essential modification of the approach. Moreover, our data-augmentation scheme can be woven together seamlessly with other latent-variable methods, such as those for discrete mixtures and missing-data problems. It also suggests an approach for deriving more sophisicated Markov-chain sampling methods for full Bayesian inference. 1.2 Relationship with previous work Our work is motivated by recent Bayesian research on sparsity-inducing priors in linear regression, where f = y Xβ 2 is the negative Gaussian log likelihood, and g corresponds to a normal variance-mixture prior (Andrews and Mallows, 1974). Examples of work in this area include the lasso (Tibshirani, 1996; Park and Casella, 28; Hans, 29); the bridge estimator (West, 1987; Knight and Fu, 2; Huang et al., 28; Polson and Scott, 211b); the relevance vector machine of Tipping (21); the normal/jeffreys model of Figueiredo (23) and Bae and Mallick (24); the normal/exponential-gamma model of Griffin and Brown (212); the normal/gamma and normal/inverse-gaussian (Caron and Doucet, 28; Griffin and Brown, 21); the horseshoe prior of Carvalho et al. (21); the hypergeometric inverted beta model of Polson and Scott (212b); and the double-pareto model of Armagan et al. (212a). A related line of work concerns the use iterative convex relaxation in nonconcave penalized-likelihood problems (e.g. Zou and Li, 28). Polson and Scott (211a) give a review of this extensive literature, and the connections between Bayesian shrinkage estimation and penalized likelihood. We generalize this work by representing both the likelihood and prior as variance- 2

3 Table 1: Variance-mean mixture representations for many common loss functions. Recall that z i = y i x T i β for regression, or z i = y i x T i β for binary classification. Error/loss function f(z i β, σ) κ z µ z p(ω i) Squared-error zi 2 /σ 2 ω i = 1 Absolute-error z i/σ Exponential Check loss z i + (2q 1)z i 1 2q Generalized inverse Gaussian Support vector machines max(1 z i, ) 1 1 Generalized inverse Gaussian Logistic log(1 + e z i ) 1/2 Polya mean mixtures of Gaussians. This data-augmentation approach relies upon the following decomposition: n p p(β τ, σ, y) e Q(β) exp f(y i, x T i β σ) g(β τ) i=1 =1 { n } p p(z i β, σ) p(β τ) i=1 = p(z β, σ) p(β τ), where the working response z i is equal to y i x T i β for regression, or y ix T i β for binary classification, with y i coded as ±1. The case of a multinomial response requires only a small modification, detailed in the appendix. Both σ and τ are hyperparameters; they are typically estimated ointly with β, although they may also be specified by the user or chosen by cross-validation. In some cases, most notably in logistic regression, the likelihood is free of hyperparameters, in which case σ does not appear in the model. One thing we do not do in this paper is to study the formal statistical properties of the resulting estimators, such as consistency as p and n both diverge. For this we refer the reader to Griffin and Brown (212) and Armagan et al. (212b), who discuss this issue from both a Bayesian and classical perspective. Rather, we provide a representation theorem that makes these estimators both easier to compute and more widely applicable to hierarchical, non-gaussian models. =1 2 Normal variance-mean mixtures There are two key steps in our approach: first, we use variance-mean mixtures, rather than ust variance mixtures; and second, we interweave two different mixture representations, one for the likelihood and one for the prior. The introduction of latent variables {ω i } and {λ } in Equations (2) and (3), below, reduces exp{ Q(β)} to a Gaussian linear model 3

4 with heteroscedastic errors: p(z i β, σ) = p(β τ) = φ(z i µ z + κ z ωi 1, σ 2 ωi 1 ) dp (ω i ) (2) φ(β µ β + κ β λ 1, τ 2 λ 1 ) dp (λ ), (3) where φ(a m, v) is the normal density, evaluated at a, for mean m and variance v. By marginalizing over these latent variables with respect to different fixed combinations of (µ z, κ z, µ β, κ β ) and the mixing measures P (λ ) and P (ω i ), it is possible to generate many commonly used obective functions that have not been widely recognized as Gaussian mixtures. Table 1 lists several common likelihoods in this class, along with the corresponding fixed choices for (κ z, µ z ) and the mixing distribution P (ω i ). A discussion of priors and penalty functions that fall within this class can be found in Polson and Scott (212a). An important feature of our approach is that we avoid dealing directly with conditional distributions for these latent variables. To find the posterior mode, it is sufficient to use Theorem 1 to calculate moments of these distributions, exploiting known facts about Gaussian mixtures. These moments in turn depend only upon the derivatives of f and g, along with the hyperparameters. We focus on two choices of the mixing measure: the generalized inverse-gaussian distribution; and the Polya distribution, which is essentially an infinite sum of exponential random variables (Barndorff-Nielsen et al., 1982). These two choices lead to the hyperbolic and Z distributions, respectively, for the resulting variance-mean mixture. The two key integral identities are α 2 κ 2 2α e α θ µ +κ(θ µ) = e α(θ µ) 1 B(α, κ) (1 + e θ µ ) 2(α κ) = φ (θ µ + κv, v) p G ( v 1,, α 2 κ 2) dv φ (θ µ + κv, v) p P (v α, α 2κ) dv, where p G and p P are the density functions of the generalized inverse-gaussian and Polya distributions, respectively. We use θ to denote a dummy argument that could involve either data or parameters, and v to denote a latent variance; all other terms are hyperparameters specified by the user for the purposes of representing a particular density or function. These two expressions lead, by a simple application of the Fatou Lebesgue theorem, to three further identities for the improper limiting cases of the two densities above: a 1 exp { 2c 1 max(aθ, ) } = c 1 exp { 2c 1 ρ q (θ) } = (1 + exp{θ µ}) 1 = φ(θ av, cv) dv φ(θ (2τ 1)v, cv)e 2τ(1 τ)v dv φ(θ µ (1/2)v, v) p P (v, 1) dv, where ρ q (θ) = 1 2 θ + ( q 1 2) θ is the check-loss function (Koenker, 25; Li et al., 21). 4

5 The first leads to the likelihood for support-vector machines; the second, to quantile and lasso regression; and the third, to logistic and multinomial logistic regression. The function p P (v, 1) is an improper density corresponding to a sum of exponential random variables. Thus by using either a generalized inverse-gaussian or a Polya mixing distribution, one can generate obective functions corresponding to the lasso estimator, support-vector machines, the check-loss function, and the binary and multinomial logistic-regression models. The relevant distribution theory leading to these integral identities can be found in Appendix A. Previous studies (e.g. Polson and Scott, 211c; Gramacy and Polson, 212) have presented similar results for specific models, including support-vector machines and the so-called powered-logit likelihood. But as far as we are aware, ours is the first characterization of the full class. 3 An expectation-maximization algorithm 3.1 Overview of approach We now show that this conditionally Gaussian representation results in a simple expectationmaximization algorithm for any model in the proposed class, given our data augmentation scheme. In the expectation step, one computes the expected value of the log posterior, given hyperparameters and the current estimate β (g) at step g of the algorithm: C(β β (g) ) = log p(β ω, λ, τ, σ, z)p(ω, λ β (g), τ, z) dω dλ. Then in the maximization step, one maximizes the complete-data posterior as a function of β: β (g+1) = arg max β C(β β (g) ). The advantages of the expectation-maximization algorithm are its lack of user-specified tuning constants, and its well known theoretical properties: the sequence of estimated parameter values {β (1), β (2),... } monotonically increases the observed-data log posterior density, and will converge to a global maximum if this function is concave. The complete-data log posterior can be represented in two different ways. We appeal to both of these representations in deriving the expectation and maximization steps. Under a normal variance-mean mixture of the form in (2) (3), log p(β ω, λ, τ, σ, z) = c (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n i=1 p =1 ( ω i zi µ z κ y ωi 1 ) 2 λ (β µ β κ β λ 1 ) 2 (4) for some constant c, recalling that z i = y i x T i β for regression or z i = y i x T i β for classifi- 5

6 cation. Factorizing this further as a function of β yields that log p(β ω, λ, τ, σ, z) = c 1 (ω, λ, τ, σ, z) 1 2σ 2 1 2τ 2 n ω i (z i µ z ) 2 + κ y i=1 k λ (β µ β ) 2 + κ β =1 n (z i µ z ) i=1 p (β µ β ) (5) for some constant c 1. We now explicitly derive the expectation and maximization steps, along with the necessary conditional sufficient statistics. 3.2 The expectation step From (5), observe that the complete-data obective function is linear in both ω i and λ. Therefore, in the expectation step, we calculate the complete-data log posterior by (g) replacing λ and ω i with their conditional expectations ˆλ and ˆω (g) i, given the data and the current β (g). The following theorem provides expressions for these conditional moments under any model where both the likelihood and prior can be represented by normal variance-mean mixtures. Theorem 1. Suppose that the obective function Q(β) in (1) can be represented by a hierarchical variance-mean Gaussian mixture, as in Equations (2) and (3). Then the conditional moments ˆλ = E (λ β, τ, z) and ˆω i = E (ω i σ, z i ) are given by the following expressions: (β µ β )ˆλ = κ β + τ 2 g (β τ) (z i µ z )ˆω i = κ z + σ 2 f (z i β, σ), where f and g are the derivatives of negative log likelihood and negative log prior from (1), respectively. The advantage of the theorem is that it characterizes the required moments purely in terms of the likelihood and penalty functions, which are pre-specified in most regularization problems: f(z i ) = log p(z i β, σ), g(β ) = log p(β τ). One caveat is that when β µ β, the conditional moment for λ in the expectation step may be numerically infinite, and care must be taken. Indeed, infinite values for λ will arise naturally under certain sparsity-inducing choices of g, such as the lasso, and indicate that the algorithm has converged to a sparse solution. One way to handle the resulting problem of numerical infinities is to start the algorithm from a value where (β µ β ) has no zeros, and to remove β from the model when it gets within a small numerical threshold of its mean (c.f. Fan and Li, 21). This conveys the added benefit of hastening the matrix computations in the maximization step. Although we have found this approach to work well in practice, it has the disadvantage that a variable cannot re-enter the model once it has been deleted. Therefore, if using a sparsity-inducing prior, we check for convergence =1 6

7 once a putative zero has been found, by proposing small perturbations in each component of β to assess whether any variables should re-enter the model. An alternate approach involves the use of restricted least-squares; for details, see Section 3.2 of Polson and Scott (211c). Our method, of course, does not sidestep the problem of optimization over a combinatorially large space. In particular, there is no way to guarantee convergence to a global maximum if the penalty function is concave, in which case multiple restarts from different initial values will be necessary to check for the presence of local modes. 3.3 The maximization step Returning to (4), the maximization step involves computing the posterior mode under a heteroscedastic Gaussian error model and a conditionally Gaussian prior for β, given the latent variables {ω i } and {λ }. This posterior mode is recognizable as a generalized ridge estimator. Theorem 2. Suppose that the obective function Q(β) in (1) can be represented by variancemean Gaussian mixtures, as in (2)-(3). Then given estimates {ˆω i } and {ˆλ }, we have the following expressions for the conditional maximum of β, where ω = (ω 1,..., ω n ) and λ = (λ 1,..., λ p ) are vectors, and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ). 1. In a regression problem, ˆβ = ( τ 2 ) ˆΛ + X T 1(y ˆΩX + b (ˆΩy ) ) y = X T µz ω κ z 1 b = (τ 2 )(µ β λ + κ β 1). 2. In a binary classification problem where y i = ±1 and X has rows x i = y ix i, ˆβ = ) 1 ( ) (τ 2 ˆΛ + X T ˆΩX X T µ z ˆω + κ z 1. The maximization step can be easily extended to encompass a series of conditional maximization steps: first for the regression coefficients β, and then for hyperparameters such as σ and τ. These latter steps exploit standard results on variance components in linear models; we therefore omit the details, and refer the reader to, for example, Gelman (26). 4 Examples 4.1 Binary logistic regression The simplicity and generality of our approach are best illustrated by an initial example involving a familiar likelihood. Suppose we wish to fit a penalized logistic regression, where n p ˆβ = arg min log{1 + exp( y i x T i β)} + g(β τ), β R p i=1 =1 7

8 assuming that the outcomes y i are coded as ±1, and that τ is fixed. Many factors conspire to make this a difficult problem, but we focus particularly on the non-gaussianity of the likelihood. For the logistic likelihood, recall that ω 1 has a Polya distribution with α = 1, κ = 1/2. Theorem 1 gives the relevant conditional moment as ˆω i = 1 ( e z i z i 1 + e z 1 ). i 2 Therefore, if the log prior g satisfies (3), then the following three updates, when iterated repeatedly, will generate a sequence of estimates that converges to stationary point of Q(β): β (g+1) = ˆω (g+1) i = ˆλ (g+1) ( τ 2 ˆΛ(g) + X T ˆΩ ) ( ) 1 1 (g) X 2 XT 1 1 z (g+1) i ( e z(g+1) i 1 + e z(g+1) i 1 2 = κ β + τ 2 g (β (g+1) τ) β (g+1), µ β where z (g) i = y i x T i β(g) ; X is the matrix having rows x i = y i x i ; 1 is a vector of ones; and where Ω = diag(ω 1,..., ω n ) and Λ = diag(λ 1,..., λ p ) are diagonal matrices. If the penalty function g is convex on (, ), this stationary point will be the global maximum (Zou and Li, 28). This sequence of steps resembles iteratively re-weighted least squares (Green, 1984) due to the presence of the diagonal weights matrix Ω. But there are subtle and important differences, even in the unpenalized case where λ and the solution is the maximumlikelihood estimator. In iteratively re-weighted least squares, for example, the analogous weight matrix Ω has diagonal entries ω i = µ i (1 µ i ), where µ i = 1/(1 + e xt i β ) is the estimated value of pr(y i = 1) at each stage of the algorithm. These weights arise from a sequential approximation to the likelihood. In contrast, the weights ω i in our second step above arise from an exact mixture representation of the likelihood. Figure 1 shows a comparison of the weights for each algorithm, as a function of the linear predictor x T i β. The weights under iteratively re-weighted least squares decay to zero much more rapidly than in our algorithm. This can lead to numerical difficulties when the successes and failures are nearly separable by a hyperplane in R p. To illustrate this phenomenon, we simulated a logistic-regression problem with 2 variables and 1 observations, where each coefficient β and each design point x i were independent draws from a standard normal distribution. We then compared our expectationmaximization algorithm to iteratively re-weighted least squares for computing the unpenalized maximum-likelihood estimate. Each algorithm was run from two different starting values: β () = (1/1,..., 1/1) T, and β () = (1/2,..., 1/2) T. As Figure 2 shows, iteratively re-weighted least squares is highly sensitive to the choice ) 8

9 Weight Function Iteratively re-weighted least squares Expectation-maximization Absolute Value of Linear Predictor Figure 1: The weights ω i on the diagonal weight matrix Ω that arise in iteratively re-weighted least squares for logistic regression, versus those that arise in our dataaugmentation approach, as a function of the linear predictor x T i β. of starting value. It converges when initialized at β () = (1/1,..., 1/1) T, but not when initialized at β () = (1/2,..., 1/2) T. Our data-augmentation algorithm is far more robust, finding the maximum easily in both cases. We emphasize that this is not a problem caused by perfect separability of the successes and failures, in which case there is no unique maximum-likelihood solution. As the bottom pane of Figure 2 shows, 18 of 1 observations had fitted success probabilities between 5 and 95. Rather, the problem is that the iterative approximation can be poor in regions where the likelihood surface is nearly flat. Next, we investigated the performance of data augmentation versus iteratively reweighted least squares when a penalty function is applied. For this case we simulated data with a nontrivial correlation structure in the design matrix. Specifically, we defined Σ = BB T + Ψ, where B is a 5 4 matrix of standard normal random entries, and Ψ is a diagonal matrix with χ 2 1 random entries. The rows of the design matrix X were simulated from a multivariate normal distribution with mean zero and covariance matrix Σ; the coefficients β were standard normal random draws; and the size of the data set was p = 5 and n = 2. We first used a ridge-regression penalty, where g(β τ) = (β /τ) 2, leading to a convex problem. We also used the generalized double-pareto model proposed by Armagan et al. (212a), where p(β τ) ( 1 + β ) (1+a). aτ Like the Laplace prior, this is non-differentiable at zero and is therefore sparsity-inducing, but has polynomial rather than exponential tail behavior. Armagan et al. (212b) show that this model leads to strong consistency of the posterior in regression models with a diverging number of parameters, which can be thought of as a Bayesian analogue of the oracle property. The generalized double-pareto model has a conditionally Gaussian representation, making Theorem 1 applicable. 9

10 Expectation-Maximization, Initialized at.1 Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Expectation-Maximization, Initialized at Iteratively Re-weighted Least Squares, Initialized at β 2 β β 1 β 1 Fitted Probability Observation Figure 2: Convergence of expectation-maximization versus iteratively re-weighted least squares from two different starting points for a simulated logistic-regression problem. The top four panes show the contours of the likelihood as a function of (β 1, β 2 ), with all other coefficients held at their maximum-likelihood values. In the left-hand panes, the coefficients were all initialized at 1; in the right-hand panes, they were all initialized at 5. The black lines trace out the values of (β 1, β 2 ) as the algorithm progresses; the grey dot is the true maximum. The bottom pane shows the fitted values pr(y i = 1) at this maximum. 1

11 Solution path with double-pareto penalty Solution path with ridge penalty Coefficients Coefficients Log of regularization parameter Log of regularization parameter Figure 3: The solution paths for β for the double-pareto and ridge penalties, as a function of the regularization parameter log(1/τ), for a simulated logistic regression problem. The black lines show the solution for iteratively re-weighted least squares; the grey lines, for our expectation-maximization algorithm. Moving from right to left, the black lines stop where iteratively re-weighted least squares fails due to numerical instability. We chose a = 2, and used each algorithm to compute a full solution path for β as a function of the regularization parameter, here expressed as log(1/τ) in keeping with the penalized-likelihood literature. We began by fitting the solution for τ = 1 3, which essentially constrained all coefficients to be zero, or very small. We then increased the value of τ along a discrete grid {τ 1,..., τ K = 1}, using the solution for τ k as the starting value for the τ k+1 case. As Figure 3 shows, iteratively re-weighted least squares fails when log(1/τ) becomes too small, and the coefficient vector becomes larger in magnitude. This happens because the linear system that must be solved at each stage of the algorithm becomes numerically singular. It does so, moreover, at a point when 2 out of 2 observations still had fitted success probabilities between 5 and 95. In fact, under the double-pareto prior, iteratively re-weighted least squares fails before all coefficients have even entered the model. No such pathology affects the expectation-maximization algorithm. Two further comments are in order. First, while we have focused here on estimating β using the posterior mode, the problems we have identified will also arise in any Bayesian treatment of logistic regression that is based upon an analytic approximation to the logistic likelihood (e.g. Gelman et al., 28a). We hope to extend these results to facilitate fully Bayesian inference. Second, sparse logistic regression via penalized likelihood is a topic of great current interest (Genkin et al., 27; Meier et al., 28). This problem involves three logically distinct issues: how to handle the logistic likelihood; which penalty function to use; and how to fit the resulting model, whether by a block updating scheme or by coordinate descent. These issues interact in poorly understood ways. For example, it is widely known that 11

12 coordinate-by-coordinate algorithms, including Gibbs sampling, can fare poorly in highly multimodal situations. Likewise, it is known that nonconvex penalties lead to multimodal obective functions, but also, subect to certain regularity conditions, exhibit more favorable statistical properties for estimating sparse signals (Fan and Li, 21; Carvalho et al., 21). Moreover, coordinate descent is tractable only if the chosen penalty leads to a univariate thresholding function whose solution is analytically available (Mazumder et al., 211). This is a fairly narrow class, and does not include most of the penalties mentioned in the introduction. The question of how to handle the likelihood complicates matters still further. For example, the area of detail in Figure 3 shows that, for a double-pareto penalty, the solution paths fit by iteratively re-weighted least squares differ in subtle but noticeable ways from those fit by data augmentation. By simply checking the maximized value of the obective function under both methods, we are able to confirm that iteratively re-weighted least squares is not converging to the true optimum. Yet we do not entirely understand why, and under what circumstances, the methods will differ, and how these differences should affect recommendations about what penalty function and algorithm should be used to fit logistic regression models. A full study of these issues is beyond the scope of the current paper, but is a subect of active inquiry. 4.2 Penalized quantile regression We now describe a data analysis whose goal is to understand the predictors of the longterm growth rate of a country s gross domestic product. Specifically, we use quantile regression to describe how the conditional quantile of a country s annualized growth rate, y i, depends upon various political and socioeconomic predictors. The data set comprises 161 observations on 87 countries over two periods, and It comes originally from a 1994 discussion paper by R. Barro and J. Lee, available from the National Bureau of Economic Research; a full description can be found in Koenker and Machado (1999). We used our data-augmentation scheme to fit penalized quantile-regression models, comparing them to the corresponding maximum-likelihood estimates. To do so, we choose p(ω i ) to be a generalized inverse-gaussian prior of unit scale, where (α, κ, µ) = (1, 1 2q, ). This leads to log p(z i ) = z i + (2q 1)z i, the pseudo-likelihood which yields quantile regression for the qth quantile (Koenker, 25; Li et al., 21). Applying Theorem 1, we get ˆω i = y i x T i β(g) 1 as the expected value of the conditional sufficient statistic needed in the expectation step of our algorithm. To illustrate our method, we applied a bridge penalty, where g(β τ) = β /τ a. We chose a = 1/2, estimating τ and β ointly via an expectation-conditional-maximization algorithm. An inverse-gamma prior with shape and rate parameters equal to 1 was assumed for ν = τ a. This leads to the following two conditional-maximization steps, given 12

13 Trade Growth Investment/GDP Trade Growth Political Instability Black Market Premium Consumption/GDP Education/GDP Human Capital - Life Expectancy 2. - Male Higher Education Female Higher Education Female Secondary Education Political Instability Black Market Premium Consumption/GDP Investment/GDP Education/GDP 1.5 Human Capital Male Higher Education Male Secondary Education Female Higher Education Life Expectancy Female Secondary Education 2.5 Male Secondary Education Figure 4: Estimated coefficients versus conditional quantile q (, 1) for the Barro GDP growth data. The top 12 panes show the maximum likelihood estimates for each coefficient as a function of q, while the bottom 12 show the corresponding estimates under a bridge penalty. The shaded grey areas show estimated error bars, extending one estimated standard error above and below the line. 13

14 λ and ω i : (ˆβ τ) = ( τ 2 ˆΛ + X T ˆΩX ) 1X T {ˆΩy (1 2q) 1 } (ˆν β) = 1 + p =1 β a, 2 + p/a allowing τ = ν 1/a to be obtained easily, and avoiding the need for cross validation. We used the same 12 predictors used by Koenker and Machado (1999), and compared our bridge-penalized quantile regressions to the maximum-likelihood estimates, obtainable using the R package for quantile regression (Koenker, 211). The predictors were standardized to have zero mean and unit scale. Figure 4 shows the results of this comparison, plotting the regression coefficients versus the conditional quantile across a discrete grid of choices for q. In each frame, the shaded grey area extends one standard error to either side of the estimate. Under the penalized-likelihood procedure, the standard error quoted here is actually the posterior standard deviation under the multivariate normal approximation at the mode: var(β y) ˆτ 2 /ˆλ. Alas, this does not lead to sensible estimates of uncertainty for coefficients shrunk to zero, which is a known shortcoming of penalized likelihood. The direct comparison is with Figure 3 of Koenker and Machado (1999). There are noticeable differences between the penalized and unpenalized fits. The bridge estimator shrinks four coefficients all the way to zero for every conditional quantile: female secondary education, female higher education, male higher education, and human capital. It shrinks four other coefficients to zero for a substantial fraction of values q (, 1): male secondary education, life expectancy, education spending, and political instability. Also interesting is that the posterior standard deviations for the predictors that remain in the model are typically larger than the asymptotic standard errors for the corresponding maximum-likelihood estimates. This is counter-intuitive; one would expect that, by excluding needless predictors from the model, uncertainty about those that remain in the model would be reduced, rather than inflated. We do not know whether this reflects the deficieny of the asymptotic approximation needed to compute standard errors under the maximum-likelihood procedure; the failure of the normal approximation at the mode; or some other unappreciated aspect of the model. 5 Discussion Our primary goal in this paper has been to show the relevance of the conditionally Gaussian representation of (2) (3), together with Theorem 1, for fitting a wide class of regularized estimators within a unified variance-mean mixture framework. We have therefore focused only on the most basic implementation of the expectation-maximization algorithm. There are many further variants on the basic algorithm, however, some of which can lead to dramatic advantages in speed and stability for certain problems. Key references include Meng and van Dyk (1997) and Gelman et al. (28b). These variants include marginal data augmentation (Liu, 1995), parameter expansion (Liu and Wu, 1999), maorization minimization (Hunter and Lange, 2), the partially collapsed Gibbs sam- 14

15 pler (van Dyk and Park, 28), and other simulation-based alternatives described by van Dyk and Meng (21). Many of these modifications require additional analytical work for particular choices of g and f. One example here includes the work of Liu (25) on the robit model. We have not explored these options here, and this remains a promising area for future research. A second important fact is that, for many purposes, such as estimating of β under squared-error loss, the relevant quantity of interest is the posterior mean, not the mode. Indeed, both Hans (29) and Efron (29) argue that, for predicting future observables, the posterior mean of β is the better choice. The following theorem represents the posterior mean for β in terms of the score function of the predictive distribution, generalizing the results of Brown (1971), Masreliez (1975), Pericchi and Smith (1992), and Carvalho et al. (21). There are a number of possible versions of such a theorem. Here we consider a variance-mean mixture prior p(β ) with a general location likelihood p(y β), but clearly a similar result holds the other way around. We consider the case where X is an orthogonal matrix, assumed without loss of generality to be the identity matrix, in which case we apply the theorem component by component. Theorem 3. Let p( y β ) be the likelihood for a location parameter β, symmetric in y β, and let p(β ) = φ(β ; µ β + κ β /λ, τ 2 /λ ) p(λ 1 ) dλ 1 be a normal variance-mean mixture prior. Define the following distributions: m(y) = p(y β )p(β ) dβ Then E(β y) = κ β τ 2 + p (λ 1 ) = λ p(λ 1 ) E(λ ) p (β ) = φ(β ; µ + κ/λ, τ 2 /λ )p (λ 1 ) m (y) = p(y β )p (β ). { µβ E(λ ) τ 2 } { m } (y) + m(y) { E(λ ) τ 2 } { m (y) m(y) } { log m } (y). y The generalization to nonorthogonal designs is straightforward, following the original Masreliez (1975) paper; see, for example, Griffin and Brown (21), along with the discussion of the Tweedie formula by Efron (211). Computing the posterior mean will typically require sampling from the full oint posterior distribution over all parameters. Our data-augmentation approach can lead to Markov-chain Monte Carlo sampling schemes for ust this purpose. The key step is the identification of the conditional distributions for λ and ω i under specific models; this remains an active area of research. 15

16 Acknowledgements The authors wish to thank two anonymous referees, in addition to the editor and associate editor, for their many helpful comments in improving the paper. A Appendix: distributional results A.1 Generalized hyperbolic distributions In all of the following cases, we assume that (θ v) N (µ + κv, v), and that v p(v). Let p(v) be a generalized inverse-gaussian distribution G(ψ, γ, δ), following the notation of Barndorff-Nielsen and Shephard (21). We consider the special case of this class where ψ = 1 and δ =, in which case p(θ) is a hyperbolic distribution having density ( α 2 κ 2 ) p(θ µ, α, κ) = exp { α θ µ + κ(θ µ)}. 2α When viewed as a pseudo-likelihood or pseudo-prior, the class of generalized hyperbolic distributions will generate many common obective functions. First, choosing (α, κ, µ) = (1,, ) leads to log p(β ) = β, and thus l 1 regularization. Second, choosing (α, κ, µ) = (1, 1 2q, ) gives log p(z i ) = z i + (2q 1)z i. This is the check-loss function, yielding quantile regression for the qth quantile. Third, choosing (α, κ, µ) = (1, 1, 1) leads to the maximum operator: (1/2) log p(z i ) = (1/2) 1 z i + (1/2)(1 z i ) = max(1 z i, ) for z i = y i x T i β. This is the obective function for support vector machines (e.g. Mallick et al., 25; Polson and Scott, 211c), and corresponds to the limiting case of a generalized inverse-gaussian prior. A.2 Z distributions Now let p P (v α, α 2κ) be a Polya distribution, which can be represented as an infinite convolution of exponentials, and leads to a Z distributed marginal. The important result is the following: p Z (θ µ, α, κ) = 1 (e θ µ ) α B(α, κ) (1 + e θ µ ) 2(α κ) = N (µ + κv, v) p P (v α, α 2κ) dv. See Barndorff-Nielsen et al. (1982). When viewed as a likelihood in θ, the class of Polya/Z distributions results in the logistic and multinomial models. 16

17 For logistic regression, choosing (α, κ, µ) = (1, 1/2, ) leads to p(z i ) = ez i 1 + e z i, which is the likelihood for logistic regression with z i = y i x T i β. Much like the support vector-machine representation, this corresponds to a limiting improper case of the Polya distribution, specifically P(1, ). The necessary mixture representation still holds, however, by applying the Fatou-Lebesgue theorem. The improper mixing measure p 1, (v) is an infinite sum of exponentials for which the integral on the right still converges to the logistic likelihood (Gramacy and Polson, 212). For the multinomial generalization of the logistic model, we require a slight modification. Suppose that y i {1,..., K} is an unordered category indicator, and that β k = (β k1,..., β kp ) T is a separate block of p regression coefficients for the kth category. Let η ik = exp ( x T ) ( ) i β k c ik /{1 + exp x T i β k c ik }, where { } c ik (β k ) = log exp(x T i β l ). We follow Holmes and Held (26) in writing the conditional likelihood for β k as Q(β k β k, y) n K i=1 l=1 n i=1 n i=1 l k η I(y i=l) il η I(y i=k) ik {w i (1 η ik )} I(y i k) { ( ) exp x T I(yi } =k) i β k c ik 1 + exp ( x T i β ), k c ik where w i is independent of β k and I is the indicator function. Thus the conditional likelihood for the kth block of coefficients β k, given all the other blocks of coefficients β k, can be written as a product of n terms, the ith term having a Polya mixture representation with α ik = I(y i = k), κ ik = α ik 1/2, and µ ik = c ik (β k ). This allows regularized multinomial logistic models to be fit using essentially the same approach of Section 4.1, where each block β k is updated using a conditional maximization step. B Appendix: Proofs B.1 Theorem 1 Since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β ( β µ β κ β /λ = τ 2 /λ ) φ(β µ β + κ β /λ, τ 2 /λ ). 17

18 We use this fact to differentiate p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) p(λ τ) dλ under the integral sign to obtain { p(β τ) = φ(β µ β + κ β /λ, τ 2 /λ ) } p(λ τ) dλ. β β Dividing by p(β τ) and using the above identity for the inner function, we get ( κβ ) p(β τ) = β τ 2 ( β µ β p(β τ) τ 2 ) E (λ β, τ). Hence we can in general find the needed moment using the expression { } 1 p(β τ) = κ ( ) β p(β τ) β τ 2 β µ ( ) β τ 2 E λ β (g), τ, y. Equivalently, in terms of the penalty function log p(β τ), we have By a similar argument, we also have (β µ β )E (λ β ) = κ β τ 2 β log p(β τ). (z i µ z )E (ω i β, z i, σ) = κ z σ 2 z i log p(z i β, σ), We obtain the desired result by using the identities z i log p(z i β i ) = f (z i β, σ) B.2 Theorem 2 and β log p(β τ) = g (β τ). We demonstrate the result for a regression problem, with the classification result following as a simple modification. Begin with Equation (4). Collecting terms, we can represent the log posterior, up to an additive constant not involving β, as a sum of quadratic forms in β: log p(β ω, λ, τ, σ, z) = 1 ( ) T ( ) {y µ z 1 κ z ω 1 } Xβ Ω {y µ z 1 κ z ω 1 } Xβ 2 1 ( 2τ 2 β µ b 1 κ β λ 1) T ( Λ 1 β µ β 1 κ β λ 1). where we recall that ω 1 is a column vector whose ith entry is ω 1 i, and similarly for λ 1. This is the log posterior under a normal prior β N(µ β 1 + κ β λ 1, τ 2 Λ 1 ) after having observed the working response y µ z 1 κ z ω 1. The identity Ω(µ 1 + κω 1 ) = µω + κ 1 then gives the result. 18

19 For classification, on the other hand, let X be the matrix with rows x i = y ix i. The kernel of the conditionally normal likelihood then becomes (X β µ z 1 κ z ω 1 ) T Ω (X β µ z 1 κ z ω 1 ). Hence it is as if we observe the n-dimensional working response µ z 1+κ z ω 1 in a regression model having design matrix X, and we proceed by a similar argument to arrive at the result. B.3 Theorem 3 Our extension of Masreliez s theorem to variance-mean mixtures follows a similar path as Theorem 1. From before, since φ is a normal kernel, φ(β µ β + κ β /λ, τ 2 /λ ) β = β µ β κ β λ τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ). Differentiating under the integral sign and applying this result, we have that 1 τ 2 λ β φ(β µ β + κ β /λ, τ 2 /λ ) = µ β κ β τ 2 /λ φ(β µ β + κ β /λ, τ 2 /λ ) β. The rest of the argument follows the standard Masreliez approach. References D. Andrews and C. Mallows. Scale mixtures of normal distributions. Journal of the Royal Statistical Society, Series B, 36:99 12, A. Armagan, D. Dunson, and J. Lee. Generalized double Pareto shrinkage. Technical report, Duke University Department of Statistical Science, 212a. A. Armagan, D. B. Dunson, J. Lee, and W. Bawa. Posterior consistency in linear models under shrinkage priors. Technical report, Duke University Department of Statistical Science, 212b. K. Bae and B. Mallick. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics, 2 (18):3423 3, 24. O. E. Barndorff-Nielsen and N. Shephard. Non-Gaussian Ornstein Uhlenbeck-based models and some of their uses in financial economics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63(2): , 21. O. E. Barndorff-Nielsen, J. Kent, and M. Sorensen. Normal variance-mean mixtures and z distributions. International Statistical Review, 5:145 59, L. Brown. Admissible estimators, recurrent diffusions and insoluble boundary problems. The Annals of Mathematical Statistics, 42:855 93, F. Caron and A. Doucet. Sparse Bayesian nonparametric regression. In Proceedings of the 25th International Conference on Machine Learning, pages Association for Computing Machinery, 28. C. M. Carvalho, N. G. Polson, and J. G. Scott. The horseshoe estimator for sparse signals. Biometrika, 97(2):465 8, 21. A. Dempster, N. Laird, and D. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39(1):1 38,

20 B. Efron. Empirical Bayes estimates for large-scale prediction problems. Journal of the American Statistical Association, 14(487):115 28, 29. B. Efron. Tweedie s formula and selection bias. Journal of the American Statistical Association, 16(496): , 211. J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348 6, 21. M. Figueiredo. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(9):115 9, 23. A. Gelman. Prior distributions for variance parameters in hierarchical models. Bayesian Anal., 1(3): , 26. A. Gelman, A. Jakulin, M. Pittau, and Y. Su. A weakly informative default prior distribution for logistic and other regression models. The Annals of Applied Statistics, 2(4):136 83, 28a. A. Gelman, D. A. van Dyk, Z. Huang, and W. J. Boscardin. Using redundant parameterizations to fit hierarchical models. Journal of Computational and Graphical Statistics, 17(1):95 122, 28b. A. Genkin, D. Lewis, and D. Madigan. High-dimensional generalized linear models and the lasso. Technometrics, 49:291 34, 27. R. B. Gramacy and N. G. Polson. Simulation-based regularized logistic regression. Bayesian Analysis, 212. arxiv.org/abs/ (to appear). P. J. Green. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society (Series B), 46(2):149 92, J. Griffin and P. Brown. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1):171 88, 21. J. Griffin and P. Brown. Alternative prior distributions for variable selection with very many more variables than observations. Australian and New Zealand Journal of Statistics, 212. (to appear). C. M. Hans. Bayesian lasso regression. Biometrika, 96(4):835 45, 29. C. Holmes and L. Held. Bayesian auxiliary variable models for binary and multinomial regression. Bayesian Analysis, 1(1):145 68, 26. J. Huang, J. Horowitz, and S. Ma. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. The Annals of Statistics, 36(2): , 28. D. R. Hunter and K. Lange. Quantile regression via an MM algorithm. Journal of Computational and Graphical Statistics, 9(1):6 77, 2. K. Knight and W. Fu. Asymptotics for lasso-type estimators. The Annals of Statistics, 28(5): , 2. R. Koenker. Quantile Regression. Cambridge University Press, New York, USA, 25. R. Koenker. quantreg: Quantile Regression, 211. URL R package version R. Koenker and J. Machado. Goodness of fit and related inference processes for quantile regression. Journal of the American Statistical Association, 94(448): , Q. Li, R. Xi, and N. Lin. Bayesian regularized quantile regression. Bayesian Analysis, 5(3):533 56, 21. C. Liu. Missing data imputation using the multivariate t distribution. Journal of Multivariate Analysis, 53(1):139 58,

21 C. Liu. Robit regression: a simple robust alternative to logistic and probit regression. In A. Gelman and X. L. Meng, editors, Applied Bayesian Modeling and Causal Inference from Incomplete-Data Perspectives: An Essential Journey with Donald Rubin s Statistical Family, pages John Wiley & Sons, 25. J. S. Liu and Y. N. Wu. Parameter expansion for data augmentation. Journal of the American Statistical Association, 94(448): , B. K. Mallick, D. Ghosh, and M. Ghosh. Bayesian classification of tumours by using gene expression data. Journal of the Royal Statistical Society (Series B), 67(2):219 34, 25. C. Masreliez. Approximate non-gaussian filtering with linear state and observation relations. IEEE. Trans. Autom. Control, 2(1):17 1, R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: coordinate descent with non-convex penalties. Journal of the American Statistical Association, 16(495): , 211. L. Meier, S. van de Geer, and P. B uhlmann. The group lasso for logistic regression. Journal of the Royal Statistical Society (Series B), 7(1):53 71, 28. X. L. Meng and D. van Dyk. The EM algorithm an old folk-song sung to a fast new tune. Journal of the Royal Statistical Society (Series B), 59(3):511 67, T. Park and G. Casella. The Bayesian lasso. Journal of the American Statistical Association, 13(482): 681 6, 28. L. R. Pericchi and A. Smith. Exact and approximate posterior moments for a normal location parameter. Journal of the Royal Statistical Society (Series B), 54(3):793 84, N. G. Polson and J. G. Scott. Shrink globally, act locally: sparse Bayesian regularization and prediction. In J. Bernardo, M. Bayarri, J. O. Berger, A. Dawid, D. Heckerman, A. Smith, and M. West, editors, Proceedings of the 9th Valencia World Meeting on Bayesian Statistics. Oxford University Press, 211a. N. G. Polson and J. G. Scott. The Bayesian bridge. Technical report, University of Texas at Austin, 211b. N. G. Polson and J. G. Scott. Local shrinkage rules, Lévy processes, and regularized regression. Journal of the Royal Statistical Society (Series B), 212a. (to appear). N. G. Polson and J. G. Scott. Good, great, or lucky? Screening for firms with sustained superior performance using heavy-tailed priors. The Annals of Applied Statistics, 212b. (to appear). N. G. Polson and S. Scott. Data augmentation for support vector machines (with discussion). Bayesian Analysis, 6(1):1 24, 211c. R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society (Series B), 58(1):267 88, M. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1:211 44, 21. D. van Dyk and X. L. Meng. Cross-fertilizing strategies for better EM mountain climbing and DA field exploration: A graphical guide book. Statistical Science, 25(4):429 49, 21. D. van Dyk and T. Park. Partially collapsed Gibbs samplers: theory and methods. Journal of American Statistical Association, 13(482):79 6, 28. M. West. On scale mixtures of normal distributions. Biometrika, 74(3):646 8, H. Zou and R. Li. One-step sparse estimates in nonconcave penalized likelihood models. Annals of Statistics, 36(4):159 33,

Data augmentation for non-gaussian regression models using variance-mean mixtures

Data augmentation for non-gaussian regression models using variance-mean mixtures Biometrika (213), 1,2,pp. 459 471 doi: 1.193/biomet/ass81 C 213 Biometrika Trust Advance Access publication 11 March 213 Printed in Great Britain Data augmentation for non-gaussian regression models using

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Regularization with variance-mean mixtures

Regularization with variance-mean mixtures Regularization with variance-mean mixtures Nick Polson University of Chicago James Scott University of Texas at Austin Workshop on Sensing and Analysis of High-Dimensional Data Duke University July 2011

More information

Geometric ergodicity of the Bayesian lasso

Geometric ergodicity of the Bayesian lasso Geometric ergodicity of the Bayesian lasso Kshiti Khare and James P. Hobert Department of Statistics University of Florida June 3 Abstract Consider the standard linear model y = X +, where the components

More information

Generalized Elastic Net Regression

Generalized Elastic Net Regression Abstract Generalized Elastic Net Regression Geoffroy MOURET Jean-Jules BRAULT Vahid PARTOVINIA This work presents a variation of the elastic net penalization method. We propose applying a combined l 1

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

arxiv: v2 [stat.me] 27 Oct 2012

arxiv: v2 [stat.me] 27 Oct 2012 The Bayesian Bridge Nicholas G. Polson University of Chicago arxiv:119.2279v2 [stat.me] 27 Oct 212 James G. Scott Jesse Windle University of Texas at Austin First Version: July 211 This Version: October

More information

On the Half-Cauchy Prior for a Global Scale Parameter

On the Half-Cauchy Prior for a Global Scale Parameter Bayesian Analysis (2012) 7, Number 2, pp. 1 16 On the Half-Cauchy Prior for a Global Scale Parameter Nicholas G. Polson and James G. Scott Abstract. This paper argues that the half-cauchy distribution

More information

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University Econ 690 Purdue University In virtually all of the previous lectures, our models have made use of normality assumptions. From a computational point of view, the reason for this assumption is clear: combined

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu, Daniel F. Schmidt, Enes Makalic, Guoqi Qian, and John L. Hopper Centre for Epidemiology and Biostatistics, Melbourne

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Bayesian shrinkage approach in variable selection for mixed

Bayesian shrinkage approach in variable selection for mixed Bayesian shrinkage approach in variable selection for mixed effects s GGI Statistics Conference, Florence, 2015 Bayesian Variable Selection June 22-26, 2015 Outline 1 Introduction 2 3 4 Outline Introduction

More information

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence Bayesian Inference in GLMs Frequentists typically base inferences on MLEs, asymptotic confidence limits, and log-likelihood ratio tests Bayesians base inferences on the posterior distribution of the unknowns

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Nonparametric Bayesian Methods (Gaussian Processes)

Nonparametric Bayesian Methods (Gaussian Processes) [70240413 Statistical Machine Learning, Spring, 2015] Nonparametric Bayesian Methods (Gaussian Processes) Jun Zhu dcszj@mail.tsinghua.edu.cn http://bigml.cs.tsinghua.edu.cn/~jun State Key Lab of Intelligent

More information

Partial factor modeling: predictor-dependent shrinkage for linear regression

Partial factor modeling: predictor-dependent shrinkage for linear regression modeling: predictor-dependent shrinkage for linear Richard Hahn, Carlos Carvalho and Sayan Mukherjee JASA 2013 Review by Esther Salazar Duke University December, 2013 Factor framework The factor framework

More information

Lasso & Bayesian Lasso

Lasso & Bayesian Lasso Readings Chapter 15 Christensen Merlise Clyde October 6, 2015 Lasso Tibshirani (JRSS B 1996) proposed estimating coefficients through L 1 constrained least squares Least Absolute Shrinkage and Selection

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

Analysis Methods for Supersaturated Design: Some Comparisons

Analysis Methods for Supersaturated Design: Some Comparisons Journal of Data Science 1(2003), 249-260 Analysis Methods for Supersaturated Design: Some Comparisons Runze Li 1 and Dennis K. J. Lin 2 The Pennsylvania State University Abstract: Supersaturated designs

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 7 Approximate

More information

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010

Chris Fraley and Daniel Percival. August 22, 2008, revised May 14, 2010 Model-Averaged l 1 Regularization using Markov Chain Monte Carlo Model Composition Technical Report No. 541 Department of Statistics, University of Washington Chris Fraley and Daniel Percival August 22,

More information

On the half-cauchy prior for a global scale parameter

On the half-cauchy prior for a global scale parameter On the half-cauchy prior for a global scale parameter Nicholas G. Polson University of Chicago arxiv:1104.4937v2 [stat.me] 25 Sep 2011 James G. Scott The University of Texas at Austin First draft: June

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent Latent Variable Models for Binary Data Suppose that for a given vector of explanatory variables x, the latent variable, U, has a continuous cumulative distribution function F (u; x) and that the binary

More information

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method

Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Slice Sampling with Adaptive Multivariate Steps: The Shrinking-Rank Method Madeleine B. Thompson Radford M. Neal Abstract The shrinking rank method is a variation of slice sampling that is efficient at

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic

The Polya-Gamma Gibbs Sampler for Bayesian. Logistic Regression is Uniformly Ergodic he Polya-Gamma Gibbs Sampler for Bayesian Logistic Regression is Uniformly Ergodic Hee Min Choi and James P. Hobert Department of Statistics University of Florida August 013 Abstract One of the most widely

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu LITIS - EA 48 - INSA/Universite de Rouen Avenue de l Université - 768 Saint-Etienne du Rouvray

More information

Biostatistics Advanced Methods in Biostatistics IV

Biostatistics Advanced Methods in Biostatistics IV Biostatistics 140.754 Advanced Methods in Biostatistics IV Jeffrey Leek Assistant Professor Department of Biostatistics jleek@jhsph.edu Lecture 12 1 / 36 Tip + Paper Tip: As a statistician the results

More information

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Bayesian Grouped Horseshoe Regression with Application to Additive Models Bayesian Grouped Horseshoe Regression with Application to Additive Models Zemei Xu 1,2, Daniel F. Schmidt 1, Enes Makalic 1, Guoqi Qian 2, John L. Hopper 1 1 Centre for Epidemiology and Biostatistics,

More information

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation

Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation Choosing the Summary Statistics and the Acceptance Rate in Approximate Bayesian Computation COMPSTAT 2010 Revised version; August 13, 2010 Michael G.B. Blum 1 Laboratoire TIMC-IMAG, CNRS, UJF Grenoble

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

On the Conditional Distribution of the Multivariate t Distribution

On the Conditional Distribution of the Multivariate t Distribution On the Conditional Distribution of the Multivariate t Distribution arxiv:604.0056v [math.st] 2 Apr 206 Peng Ding Abstract As alternatives to the normal distributions, t distributions are widely applied

More information

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem

Deep Poisson Factorization Machines: a factor analysis model for mapping behaviors in journalist ecosystem 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Dynamic Generalized Linear Models

Dynamic Generalized Linear Models Dynamic Generalized Linear Models Jesse Windle Oct. 24, 2012 Contents 1 Introduction 1 2 Binary Data (Static Case) 2 3 Data Augmentation (de-marginalization) by 4 examples 3 3.1 Example 1: CDF method.............................

More information

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations John R. Michael, Significance, Inc. and William R. Schucany, Southern Methodist University The mixture

More information

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty

Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Journal of Data Science 9(2011), 549-564 Selection of Smoothing Parameter for One-Step Sparse Estimates with L q Penalty Masaru Kanba and Kanta Naito Shimane University Abstract: This paper discusses the

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 25: Markov Chain Monte Carlo (MCMC) Course Review and Advanced Topics Many figures courtesy Kevin

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

On Reparametrization and the Gibbs Sampler

On Reparametrization and the Gibbs Sampler On Reparametrization and the Gibbs Sampler Jorge Carlos Román Department of Mathematics Vanderbilt University James P. Hobert Department of Statistics University of Florida March 2014 Brett Presnell Department

More information

PACKAGE LMest FOR LATENT MARKOV ANALYSIS

PACKAGE LMest FOR LATENT MARKOV ANALYSIS PACKAGE LMest FOR LATENT MARKOV ANALYSIS OF LONGITUDINAL CATEGORICAL DATA Francesco Bartolucci 1, Silvia Pandofi 1, and Fulvia Pennoni 2 1 Department of Economics, University of Perugia (e-mail: francesco.bartolucci@unipg.it,

More information

Horseshoe, Lasso and Related Shrinkage Methods

Horseshoe, Lasso and Related Shrinkage Methods Readings Chapter 15 Christensen Merlise Clyde October 15, 2015 Bayesian Lasso Park & Casella (JASA 2008) and Hans (Biometrika 2010) propose Bayesian versions of the Lasso Bayesian Lasso Park & Casella

More information

Fast Regularization Paths via Coordinate Descent

Fast Regularization Paths via Coordinate Descent August 2008 Trevor Hastie, Stanford Statistics 1 Fast Regularization Paths via Coordinate Descent Trevor Hastie Stanford University joint work with Jerry Friedman and Rob Tibshirani. August 2008 Trevor

More information

Default Priors and Efficient Posterior Computation in Bayesian Factor Analysis

Default Priors and Efficient Posterior Computation in Bayesian Factor Analysis Default Priors and Efficient Posterior Computation in Bayesian Factor Analysis Joyee Ghosh Institute of Statistics and Decision Sciences, Duke University Box 90251, Durham, NC 27708 joyee@stat.duke.edu

More information

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA) Willem van den Boom Department of Statistics and Applied Probability National University

More information

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation. CS 189 Spring 2015 Introduction to Machine Learning Midterm You have 80 minutes for the exam. The exam is closed book, closed notes except your one-page crib sheet. No calculators or electronic items.

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood

Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Stat 542: Item Response Theory Modeling Using The Extended Rank Likelihood Jonathan Gruhl March 18, 2010 1 Introduction Researchers commonly apply item response theory (IRT) models to binary and ordinal

More information

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models Jingyi Jessica Li Department of Statistics University of California, Los

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Click Prediction and Preference Ranking of RSS Feeds

Click Prediction and Preference Ranking of RSS Feeds Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS

More information

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables

A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables A New Bayesian Variable Selection Method: The Bayesian Lasso with Pseudo Variables Qi Tang (Joint work with Kam-Wah Tsui and Sijian Wang) Department of Statistics University of Wisconsin-Madison Feb. 8,

More information

MARGINAL MARKOV CHAIN MONTE CARLO METHODS

MARGINAL MARKOV CHAIN MONTE CARLO METHODS Statistica Sinica 20 (2010), 1423-1454 MARGINAL MARKOV CHAIN MONTE CARLO METHODS David A. van Dyk University of California, Irvine Abstract: Marginal Data Augmentation and Parameter-Expanded Data Augmentation

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution

MH I. Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution MH I Metropolis-Hastings (MH) algorithm is the most popular method of getting dependent samples from a probability distribution a lot of Bayesian mehods rely on the use of MH algorithm and it s famous

More information

Contents. Part I: Fundamentals of Bayesian Inference 1

Contents. Part I: Fundamentals of Bayesian Inference 1 Contents Preface xiii Part I: Fundamentals of Bayesian Inference 1 1 Probability and inference 3 1.1 The three steps of Bayesian data analysis 3 1.2 General notation for statistical inference 4 1.3 Bayesian

More information

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms

Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms Probabilistic Low-Rank Matrix Completion with Adaptive Spectral Regularization Algorithms François Caron Department of Statistics, Oxford STATLEARN 2014, Paris April 7, 2014 Joint work with Adrien Todeschini,

More information

EM Algorithm II. September 11, 2018

EM Algorithm II. September 11, 2018 EM Algorithm II September 11, 2018 Review EM 1/27 (Y obs, Y mis ) f (y obs, y mis θ), we observe Y obs but not Y mis Complete-data log likelihood: l C (θ Y obs, Y mis ) = log { f (Y obs, Y mis θ) Observed-data

More information

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

Principles of Bayesian Inference

Principles of Bayesian Inference Principles of Bayesian Inference Sudipto Banerjee University of Minnesota July 20th, 2008 1 Bayesian Principles Classical statistics: model parameters are fixed and unknown. A Bayesian thinks of parameters

More information

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Stability and the elastic net

Stability and the elastic net Stability and the elastic net Patrick Breheny March 28 Patrick Breheny High-Dimensional Data Analysis (BIOS 7600) 1/32 Introduction Elastic Net Our last several lectures have concentrated on methods for

More information

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models

An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Proceedings 59th ISI World Statistics Congress, 25-30 August 2013, Hong Kong (Session CPS023) p.3938 An Algorithm for Bayesian Variable Selection in High-dimensional Generalized Linear Models Vitara Pungpapong

More information

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Kneib, Fahrmeir: Supplement to Structured additive regression for categorical space-time data: A mixed model approach Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach" Sonderforschungsbereich 386, Paper 43 (25) Online unter: http://epub.ub.uni-muenchen.de/

More information

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression Noah Simon Jerome Friedman Trevor Hastie November 5, 013 Abstract In this paper we purpose a blockwise descent

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization /

Coordinate descent. Geoff Gordon & Ryan Tibshirani Optimization / Coordinate descent Geoff Gordon & Ryan Tibshirani Optimization 10-725 / 36-725 1 Adding to the toolbox, with stats and ML in mind We ve seen several general and useful minimization tools First-order methods

More information

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project Devin Cornell & Sushruth Sastry May 2015 1 Abstract In this article, we explore

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21 CS 70 Discrete Mathematics and Probability Theory Fall 205 Lecture 2 Inference In this note we revisit the problem of inference: Given some data or observations from the world, what can we infer about

More information

The Bayesian Approach to Multi-equation Econometric Model Estimation

The Bayesian Approach to Multi-equation Econometric Model Estimation Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation

More information

Machine Learning Techniques for Computer Vision

Machine Learning Techniques for Computer Vision Machine Learning Techniques for Computer Vision Part 2: Unsupervised Learning Microsoft Research Cambridge x 3 1 0.5 0.2 0 0.5 0.3 0 0.5 1 ECCV 2004, Prague x 2 x 1 Overview of Part 2 Mixture models EM

More information

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement

More information

Sparse Bayesian Nonparametric Regression

Sparse Bayesian Nonparametric Regression François Caron caronfr@cs.ubc.ca Arnaud Doucet arnaud@cs.ubc.ca Departments of Computer Science and Statistics, University of British Columbia, Vancouver, Canada Abstract One of the most common problems

More information

Bayesian Modeling of Conditional Distributions

Bayesian Modeling of Conditional Distributions Bayesian Modeling of Conditional Distributions John Geweke University of Iowa Indiana University Department of Economics February 27, 2007 Outline Motivation Model description Methods of inference Earnings

More information

Weakly informative priors

Weakly informative priors Department of Statistics and Department of Political Science Columbia University 23 Apr 2014 Collaborators (in order of appearance): Gary King, Frederic Bois, Aleks Jakulin, Vince Dorie, Sophia Rabe-Hesketh,

More information

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract

WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION. Abstract Journal of Data Science,17(1). P. 145-160,2019 DOI:10.6339/JDS.201901_17(1).0007 WEIGHTED QUANTILE REGRESSION THEORY AND ITS APPLICATION Wei Xiong *, Maozai Tian 2 1 School of Statistics, University of

More information

Learning discrete graphical models via generalized inverse covariance matrices

Learning discrete graphical models via generalized inverse covariance matrices Learning discrete graphical models via generalized inverse covariance matrices Duzhe Wang, Yiming Lv, Yongjoon Kim, Young Lee Department of Statistics University of Wisconsin-Madison {dwang282, lv23, ykim676,

More information

STA 216, GLM, Lecture 16. October 29, 2007

STA 216, GLM, Lecture 16. October 29, 2007 STA 216, GLM, Lecture 16 October 29, 2007 Efficient Posterior Computation in Factor Models Underlying Normal Models Generalized Latent Trait Models Formulation Genetic Epidemiology Illustration Structural

More information

arxiv: v1 [stat.me] 6 Jul 2017

arxiv: v1 [stat.me] 6 Jul 2017 Sparsity information and regularization in the horseshoe and other shrinkage priors arxiv:77.694v [stat.me] 6 Jul 7 Juho Piironen and Aki Vehtari Helsinki Institute for Information Technology, HIIT Department

More information

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract

Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies. Abstract Bayesian Estimation of A Distance Functional Weight Matrix Model Kazuhiko Kakamu Department of Economics Finance, Institute for Advanced Studies Abstract This paper considers the distance functional weight

More information

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.

Design of Text Mining Experiments. Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt. Design of Text Mining Experiments Matt Taddy, University of Chicago Booth School of Business faculty.chicagobooth.edu/matt.taddy/research Active Learning: a flavor of design of experiments Optimal : consider

More information

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units

Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Bayesian nonparametric estimation of finite population quantities in absence of design information on nonsampled units Sahar Z Zangeneh Robert W. Keener Roderick J.A. Little Abstract In Probability proportional

More information

Bayesian Linear Regression

Bayesian Linear Regression Bayesian Linear Regression Sudipto Banerjee 1 Biostatistics, School of Public Health, University of Minnesota, Minneapolis, Minnesota, U.S.A. September 15, 2010 1 Linear regression models: a Bayesian perspective

More information

Lecture 2 Part 1 Optimization

Lecture 2 Part 1 Optimization Lecture 2 Part 1 Optimization (January 16, 2015) Mu Zhu University of Waterloo Need for Optimization E(y x), P(y x) want to go after them first, model some examples last week then, estimate didn t discuss

More information

COMS 4771 Regression. Nakul Verma

COMS 4771 Regression. Nakul Verma COMS 4771 Regression Nakul Verma Last time Support Vector Machines Maximum Margin formulation Constrained Optimization Lagrange Duality Theory Convex Optimization SVM dual and Interpretation How get the

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

MCMC algorithms for fitting Bayesian models

MCMC algorithms for fitting Bayesian models MCMC algorithms for fitting Bayesian models p. 1/1 MCMC algorithms for fitting Bayesian models Sudipto Banerjee sudiptob@biostat.umn.edu University of Minnesota MCMC algorithms for fitting Bayesian models

More information

PENALIZING YOUR MODELS

PENALIZING YOUR MODELS PENALIZING YOUR MODELS AN OVERVIEW OF THE GENERALIZED REGRESSION PLATFORM Michael Crotty & Clay Barker Research Statisticians JMP Division, SAS Institute Copyr i g ht 2012, SAS Ins titut e Inc. All rights

More information