Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference

Similar documents
Adaptive Sparseness Using Jeffreys Prior

Lecture 1b: Linear Models for Regression

Recent Advances in Bayesian Inference Techniques

Bayesian Inference Course, WTCN, UCL, March 2013

Variational Principal Components

Machine Learning. Bayesian Regression & Classification. Marc Toussaint U Stuttgart

Pattern Recognition and Machine Learning

Learning Multiple Related Tasks using Latent Independent Component Analysis

Bayesian Learning (II)

Statistical Data Mining and Machine Learning Hilary Term 2016

Unsupervised Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

The Variational Gaussian Approximation Revisited

Large-scale Ordinal Collaborative Filtering

Expectation Propagation for Approximate Bayesian Inference

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Machine Learning Lecture 5

Expectation Propagation Algorithm

Linear Regression and Discrimination

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Integrated Non-Factorized Variational Inference

Relevance Vector Machines

Statistical Machine Learning Lectures 4: Variational Bayes

Multivariate Bayesian Linear Regression MLAI Lecture 11

an introduction to bayesian inference

Graphical Models for Collaborative Filtering

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for Signal Processing Bayes Classification and Regression

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

CPSC 540: Machine Learning

Learning Gaussian Process Models from Uncertain Data

Geometric ergodicity of the Bayesian lasso

The Expectation-Maximization Algorithm

Algorithms for Variational Learning of Mixture of Gaussians

Linear Classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Nonparametric Bayesian Methods (Gaussian Processes)

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

PATTERN RECOGNITION AND MACHINE LEARNING

Foundations of Statistical Inference

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Deep Learning

Introduction to Machine Learning

An introduction to Variational calculus in Machine Learning

Model Comparison. Course on Bayesian Inference, WTCN, UCL, February Model Comparison. Bayes rule for models. Linear Models. AIC and BIC.

Introduction to Bayesian Inference

Learning Bayesian network : Given structure and completely observed data

Expectation Maximization (EM)

G8325: Variational Bayes

Machine Learning Techniques for Computer Vision

Bayesian Machine Learning

Lecture 2: Priors and Conjugacy

Variance Reduction in Black-box Variational Inference by Adaptive Importance Sampling

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

COM336: Neural Computing

STA 4273H: Statistical Machine Learning

Learning with L q<1 vs L 1 -norm regularisation with exponentially many irrelevant features

Introduction to Probabilistic Machine Learning

Statistical Inference

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Brief Introduction of Machine Learning Techniques for Content Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Part 2: Multivariate fmri analysis using a sparsifying spatio-temporal prior

Bayesian methods in economics and finance

Gaussian Process Regression

Logistic Regression. Seungjin Choi

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Introduction Generalization Overview of the main methods Resources. Pattern Recognition. Bertrand Thirion and John Ashburner

Part 1: Expectation Propagation

Discriminative Models

Learning Binary Classifiers for Multi-Class Problem

Curve Fitting Re-visited, Bishop1.2.5

Study Notes on the Latent Dirichlet Allocation

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Variational Bayesian Inference for Parametric and Non-Parametric Regression with Missing Predictor Data

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Smooth Bayesian Kernel Machines

Bayesian X-ray Computed Tomography using a Three-level Hierarchical Prior Model

Sparse Gaussian Markov Random Field Mixtures for Anomaly Detection

Sparse Gaussian conditional random fields

Maximum Entropy Discrimination Markov Networks

Least Squares Regression

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

VIBES: A Variational Inference Engine for Bayesian Networks

Lecture 13 : Variational Inference: Mean Field Approximation

Or How to select variables Using Bayesian LASSO

Ch 4. Linear Models for Classification

Discriminative Models

Variational Message Passing. By John Winn, Christopher M. Bishop Presented by Andy Miller

Parameter Expanded Variational Bayesian Methods

Logistic Regression. COMP 527 Danushka Bollegala

Multivariate statistical methods and data mining in particle physics

Stochastic Complexity of Variational Bayesian Hidden Markov Models

GWAS V: Gaussian processes

Transcription:

Sparse Bayesian Logistic Regression with Hierarchical Prior and Variational Inference Shunsuke Horii Waseda University s.horii@aoni.waseda.jp Abstract In this paper, we present a hierarchical model which assumes the logistic regression function as the observation model and assumes hierarchical priors which promote sparsity of the estimated parameters. We also develop an inference algorithm based on the variational method. The effectiveness of the proposed algorithm is validated through some experiments on both synthetic and real-world data. Introduction Learning from high-dimensional data is one of the emerging tasks in machine learning research. There are many researches that address the problem based on optimization methods such as LASSO []. LASSO can be considered as a method of finding the maximum a posterior MAP estimate of the parameter assuming that the prior distribution is the Laplace distribution. For some Bayesian decision-making problems, such as sequential experimental design, the full information of the posterior distribution is useful. There are some approaches to obtain an approximate posterior distribution for the sparse linear model. Relevance vector machine RVM assumes a Gaussian prior for the unknown parameters and try to find the maximum likelihood or MAP estimates for the scale parameters of the Gaussian []. Since some estimators for the scale parameters go to infinity, it results in a sparse estimator for the unknown parameters. Another approach is the Majorize Minimization MM approach. In this approach, sparsity inducing priors, such as the Laplace distribution and the Student s t distribution, are assumed for the unknown parameters and it finds a Gaussian approximation for the posterior [3]. Finally, there is a hierarchical modeling approach, which assumes a Gaussian prior for the unknown parameters and further assumes priors for the scale parameters. For the prior distributions of the scale parameters, the exponential distribution, the gamma distribution, or more generally, the generalized inverse Gaussian distribution are used [4][5][6]. Compared to the other approaches, the hierarchical modeling approach assumes one more deeper hierarchical structure. Figueiredo proposed to treat the scale parameters as hidden variables and developed an approximation algorithm based on the EM algorithm [4]. Park and Casella derived the Gibbs sampling for the hierarchical sparse linear model [5]. In general, sampling based methods such as Gibbs sampling require long time to converge and it is hard to apply to the large scale problems. As a deterministic approximation for the hierarchical sparse linear model, Babacan et al. proposed an approximation algorithm based on the variational Bayes VB method [6]. While there are a lot of works that address the problem of finding the approximate posterior distribution for the sparse linear model, limited efforts have been devoted to the sparse nonlinear models. In order to apply the methods to the classification problems, sparse nonlinear models are also important. Among the works listed above, RVM can also be applied to the classification problems []. Figueiredo also provided how to apply the EM algorithm based method for the probit classification model [4]. Very recently, Serra et al. proposed a Bayesian logistic regression with sparsity inducing priors [7]. The methods proposed in [7] is based on the MM approach. In this paper, we propose a Bayesian logistic regression with hierarchical modeling which induces sparsity for the unknown parameters. The proposed algorithm is based on the VB and the EM algorithms and it can be considered as an extension of the Bayesian logistic regression [8] and the VB algorithm for the hierarchical sparse linear model [6]. 3st Conference on Neural Information Processing Systems NIPS 07, Long Beach, CA, USA.

Hierarchical Model for Sparse Logistic Regression Let y = y,..., y n {0, } n be the observation label vector and X = [x,..., x n ] T R n p be the design matrix and =,..., p R p be the unknown parameter vector. We consider the following logistic regression model, py = py j = σx T j yj σx T j yj, σx =. + e x We assume an independent Gaussian prior for the unknown parameter, p τ = N 0, S τ, 3 where N µ, Σ denotes the multivariate Gaussian distribution with mean vector µ and covariance matrix Σ and τ = τ,..., τ p and S τ = diagτ,..., τ p. Further, we assume a prior distribution pτ for τ, which is called mixing distribution. As in [6], in this paper, for the mixing distribution, we consider the independent generalized inverse Gaussian GIG distribution, a i /b i ρi/ pτ a, b, ρ = pτ i a i, b i, ρ i = K ρi a i b i τ ρi i exp a iτ i + b i τ i, 4 i= i= where a = a,..., a p, b = b,..., b p, ρ = ρ,..., ρ p, and K ρi is the modified Bessel function of the second kind. As special cases, the GIG distribution coincides with the exponential distribution when b i = 0, ρ i = and the inverse gamma distribution when a i = 0, ρ i < 0. In summary, the following joint distribution is obtained. py,, τ, a, b, ρ = py p τ pτ a, b, ρpa, b, ρ. 5 For the sake of the simplicity, hereafter, we consider only the case where ρ =, b = 0 in this case, pτ j a j is the exponential distribution and assume that the hyperprior pa is the independent gamma distribution, that is, pa = Gama i ; k a, θ a = i= 3 Variational Inference i= θ ka a Γk a aka i exp θ a a i. 6 Given the joint distribution 5, we want to calculate posterior distribution p y, however, complex integral calculation is required to find the posterior and it is very hard. In this paper, we give an approximation algorithm based on the VB method [9]. Let θ =, τ, a and the VB method finds an approximation distribution qθ that approximates pθ y. More specifically, the goal is to find qθ that minimizes the Kullback-Leibler divergence KLqθ pθ y: q θ = argmin qθ qθ ln qθ dθ = argmin pθ y qθ qθ ln qθ dθ. 7 pθ, y However, it is difficult to minimize 7 for arbitrary probability distributions. First, we limit the optimization distribution to qθ that can be factorized as follows. q, τ, a = qqτ qa. 8 This is known as the mean-field approximation [9]. Using this factorization, we can minimize 7 by iteratively updating each component distribution qθ k, θ k θ as [9] ln q θ k = E qθ\θk [ln py, θ] + const., 9 where θ \ θ k is the set θ with θ k removed. However, it is still difficult to calculate 9 since there is no analytic solution for ln q = E qθ\ [ln py, θ] + const.. Then, we use the Majorize- Minimization MM approach which is used for Bayesian logistic regression in [8]. The idea is

Algorithm Sparse Bayesian Logistic Regression Input: design matrix X R n p, label vector y {0, } n, hyper parameters k a > 0, θ a > 0, initial variational parameter ξ 0 R n. Output:, Σ. Set t = 0, Initialize ā 0 j = k a /θ a for j =,..., p, S 0 = I p p-dimensional identity matrix. repeat if p > n then Σ t+ = S t S t X T Λ + XS t X T ξ XS t, t else Σ t+ = U t U t X T Λ ξ txu t + I p U t, end if where Λ ξ t t+ = Σ t+ τ t+ i = + = diag ā t i S t+ = diag i = k a + ā t+ ξ t+ j = t = t +. until Converges λξ t n,..., λξt n and U t = diag xj. yj t+ i ā t i + t+ ā t θ a + + Σ t+,i,i Σ t+,, for i =,..., p.,..., t+ τ i for i =,..., p. + t+ p ā t p x T j Σt+ x j + x T j t+ for j =,..., n. S t,,..., S t p,p. Σ t+,p,p. to introduce a variational parameter ξ = ξ,..., ξ n and to find a function h, ξ, which is a quadratic function of and a lower bound of py. Then, h, ξ is substituted to py. More specifically, it is known that the following inequality holds [8]. py h, ξ = σξ j exp y j x T j xt j + ξ j λξ j x T j ξj, 0 where λξ = ξ σξ. In this paper, we use the following strategy to find an approximate posterior distribution. First, fix the variational parameter ξ and update qθ k, θ k θ as ln q θ k = E qθ\θk [ln h, ξpθ] + const.. Then we use the EM algorithm to update ξ. It is updated according to the following equation. ξ = argmax ξ q ln h, ξd. 3 Since it turns out that q is a Gaussian, we can use the result of [8] for solving 3. The resulting algorithm is summarized in Algorithm. Using, Σ, we can construct an approximate predictive distribution as follows [9] py = x, X, y σ x T. 4 + π 8 xt Σ x 3

Table : MSE and prediction accuracy for synthetic data. MSE Accuracy Proposed 0.589 ± 0.33 0.895 ± 0.0477 L 0.3974 ± 0.939 0.7750 ± 0.0456 L 0.439 ± 0.597 0.7 ± 0.04 Table : Prediction accuracy for real world data. aa wa covtype Proposed 0.8400 0.9757 0.7436 L 0.8400 0.970 0.74 L 0.8386 0.9779 0.7398 True.5 Proposed 4 Proposed.00 Logistic L Logistic L 3 Logistic L 0.75 0.50 0.5 0.00 0 0.5 0.50 5 0 5 0 5 30 35 40 45 50 55 60 65 70 75 80 85 90 9500 0 40 60 80 00 0 Figure : Estimation results on synthetic data. Figure : Estimation results on aa data. 4 Experiments In this section, we illustrate the performance of our proposed method on synthetic data and real world data. 4. Experiments on Synthetic Data We illustrate the behavior of the proposed method using a synthetic data. In this experiment, we fix the true sparse parameter = 0 90, 0 and generate training samples X train, y train and test samples X test, y test. The number of training sample is 00 and that of test sample is 000. The elements of X train, X test are generated from standard normal distribution and y train, y test are generated according to the model with. We examined the mean squared error MSE of the estimators for and the prediction accuracy for the test data. For comparison, we compared the proposed method with the L regularized logistic regression and L regularized logistic regression with tuned regularization parameters through cross validation. For the proposed algorithm, we set k a = θ a = 0 6 so that the prior is almost non-informative and ξ 0 =. Table shows the average and standard deviation of the MSE and the prediction accuracy for 000 experiments. We can see that the proposed method shows better performance in terms of both the MSE and the prediction accuracy. Furthermore, Figure is the plot of the true parameter and the estimators for one training data. We can see that the estimator of the proposed method is more sparse than the outputs of the regularized logistic regressions. 4. Experiments on Real World Data We evaluate our proposed method on some real world classification tasks. As in the case of synthetic data, we compared the prediction accuracy of the proposed method with that of the L regularized logistic regression and L regularized logistic regression. The regularization parameters for the regularized logistic regressions are tuned through cross validation and k a = θ a = 0 6 and ξ 0 = are used for the proposed algorithm. All data sets are from the LIBSVM archive. Table shows the prediction accuracy for some data and it shows that the proposed method has competitive or even better performance compared with the regularized logistic regressions. Furthermore, from Figure, which shows the estimation results of the proposed method and L regularized logistic regression on aa data, the proposed method outputs a sparse solution even for real world data. For aa data, the ratio of the number of the estimated coefficients whose absolute value is greater than 0. to p = 3 were 0.76 for the L regularized logistic regression and 0.8 for the proposed method. 4

References [] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B Methodological, pages 67 88, 996. [] Michael E Tipping. Sparse bayesian learning and the relevance vector machine. Journal of machine learning research, Jun: 44, 00. [3] Matthias W Seeger and Hannes Nickisch. Large scale bayesian inference and experimental design for sparse linear models. SIAM Journal on Imaging Sciences, 4:66 99, 0. [4] Mário AT Figueiredo. Adaptive sparseness for supervised learning. IEEE transactions on pattern analysis and machine intelligence, 59:50 59, 003. [5] Trevor Park and George Casella. The bayesian lasso. Journal of the American Statistical Association, 0348:68 686, 008. [6] S Derin Babacan, Shinichi Nakajima, and Minh N Do. Bayesian group-sparse modeling and variational inference. IEEE transactions on signal processing, 6:906 9, 04. [7] Juan G Serra, Pablo Ruiz, Rafael Molina, and Aggelos K Katsaggelos. Bayesian logistic regression with sparse general representation prior for multispectral image classification. In Image Processing ICIP, 06 IEEE International Conference on, pages 893 897. IEEE, 06. [8] T Jaakkola and M Jordan. A variational approach to bayesian logistic regression models and their extensions. In Sixth International Workshop on Artificial Intelligence and Statistics, volume 8, page 4, 997. [9] Christopher M Bishop. Pattern recognition and machine learning. springer, 006. 5