Variational Bayesian Logistic Regression

Similar documents
Bayesian Logistic Regression

Linear Classification: Probabilistic Generative Models

Ch 4. Linear Models for Classification

The Laplace Approximation

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

The connection of dropout and Bayesian statistics

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Models for Classification

Outline Lecture 2 2(32)

PATTERN RECOGNITION AND MACHINE LEARNING

Iterative Reweighted Least Squares

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Neural Network Training

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Reading Group on Deep Learning Session 1

Expectation Propagation for Approximate Bayesian Inference

Nonparametric Bayesian Methods (Gaussian Processes)

STA 4273H: Statistical Machine Learning

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear Classification

Variational Autoencoders

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Recent Advances in Bayesian Inference Techniques

ECE521 week 3: 23/26 January 2017

Lecture : Probabilistic Machine Learning

Bayesian Machine Learning

Bayesian Regression Linear and Logistic Regression

CSC 411: Lecture 04: Logistic Regression

Bayesian Linear Regression. Sargur Srihari

Variational Dropout via Empirical Bayes

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Pattern Recognition and Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Basic Sampling Methods

Probability and Information Theory. Sargur N. Srihari

Overfitting, Bias / Variance Analysis

Variational Inference. Sargur Srihari

An Introduction to Statistical and Probabilistic Linear Models

Support Vector Machines

Probabilistic Machine Learning. Industrial AI Lab.

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Introduction to Bayesian inference

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning using Bayesian Approaches

Ways to make neural networks generalize better

Multiclass Logistic Regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Artificial Neural Networks

Machine Learning Basics: Maximum Likelihood Estimation

Probabilistic Graphical Models for Image Analysis - Lecture 4

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Variational Principal Components

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Gaussian Processes for Machine Learning

Density Estimation. Seungjin Choi

Computer Vision Group Prof. Daniel Cremers. 3. Regression

MODULE -4 BAYEIAN LEARNING

Variational inference

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Machine Learning. 7. Logistic and Linear Regression

STA 4273H: Sta-s-cal Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Stochastic Backpropagation, Variational Inference, and Semi-Supervised Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Recurrent Latent Variable Networks for Session-Based Recommendation

Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Logistic Regression. Machine Learning Fall 2018

Lecture 13 : Variational Inference: Mean Field Approximation

Unsupervised Learning

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

TDA231. Logistic regression

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probabilistic generative models

Variational Inference and Learning. Sargur N. Srihari

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Feed-forward Network Functions

Logistic Regression. Stochastic Gradient Descent

Machine Learning Lecture 7

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Logistic Regression. COMP 527 Danushka Bollegala

Linear Models for Regression

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Probabilistic Graphical Models

Least Squares Regression

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

SGN (4 cr) Chapter 5

Bayesian Deep Learning

Bias-Variance Trade-off in ML. Sargur Srihari

Relevance Vector Machines

Least Squares Regression

Variational Inference. Sargur Srihari

Transcription:

Variational Bayesian Logistic Regression Sargur N. University at Buffalo, State University of New York USA

Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models 3. Probabilistic Discriminative Models 4. The Laplace Approximation 5. 1 Bayesian Logistic Regression 5.2 Variational Bayesian Logistic Regression 2

Topics in Variational Bayesian Logistic Regression Bayesian Logistic Regression Posterior K-L Divergence Variational Inference Stochastic Variational Inference 3

Variational Logistic Regression Variational: Based on calculus of variations How a derivative of a functional changes Functional takes function as input and returns a value Approximation is more flexible than Laplace with additional variational parameters 4

Example of Variational Method Negative Logarithms Laplace Original distribution Variational distribution is Gaussian Optimized with respect to mean and variance 5

Bayesian Logistic Regression Posterior Data: D = {(x (1),t 1 ),.., (x (N),t N )} N i.i.d.samples t ε {0,1} Parameters: w = {w 1,..,w M } Probabilistic model specifies the joint distribution p(d,w)=p(d/w)p(w) Which is a product of sigmoid likelihood and a Gaussian: p(d w) = N n=1 y n t n { 1 y n } 1 t n y n = p(c 1 x (n) )= σ(w T x (n) ) p(w)=n(w m 0,S 0 ) Goal is to approximate posterior distribution p(w D) which is also a normalized product of sigmoid likelihood and Gaussian p(w D) = p(d,w) p(d) = p(d / w)p(w) p(d) with a Gaussian q(w), so we can use it in prediction p(c 1 x)! σ(w T x)q(w)dw 6

Decomposition of Log Marginal Probability ln p(d) = L(q)+ KL(q p) where L(q) = q(w)ln p(d,w) dw q(w) and p(w D) KL{q p} = q(w)ln dw q(w) A functional We wish to maximize ln p(d) called evidence, by suitable choice of q We want to minimize K-L Divergence over a family for q(w) Which finds distribution q(w) that best approximates p(w D) Some Observations: Lower bound on ln p(d) is L(q) Maximizing the lower bound L(q) wrt distribution q(w) is equivalent to minimizing KL Divergence When KL divergence vanishes q(w) equals the posterior p(w D) Plan: We seek that distribution q(w) for which L(q) is largest We consider restricted family for q(w) Seek member of this family for which KL divergence is minimized 7

Variational Inference We want to form our Gaussian approximation q(µ,σ)=n(w;µ,σ) to the posterior p(w D) by minimizing the KL divergence ( ) = E N (w;µ,σ) log N(w;µ,Σ) D KL q(w;µ,σ) p(w D) p(w D) We fit the variational parameters µ and Σ, not the original parameters w Although there is an interpretation that µ is an estimate of w, while Σ indicates a credible range of where the parameters could plausibly be around 8 this estimate.

Criterion to Minimize As we can only evaluate the posterior up to a constant, we write D KL ( q p) = E N (w;µ,σ) log N(w;µ,Σ) where + log p(d) p(w)p(d w) And minimize J J we have used p(w D)=p(w,D)/p(D) Because D KL 0 we obtain a lower bound on the model likelihood log p(d) -J, -J is called evidence lower bound (ELBO) For logistic regression model and prior N J = E N (w;µ,σ) logn(w;µ,σ) log p(w) σ(w T x (n) ) n=1 which we wish to minimize

Method of Optimization N J = E N (w;µ,σ) logn(w;µ,σ) log p(w) log σ(w T x (n) ) n=1 We could evaluate this expression numerically The first two expectations can be computed analytically, and the remaining N terms in the sum can be reduced to a one-dimensional integral We could similarly evaluate the derivatives wrt µ and Σ, and fit the variational parameters with a gradient-based optimizer However the inner loop each function evaluation would require N numerical integrations, or further approximation (since it expectation of a summation) 10

Stochastic Variational Inference We can avoid numerical integration by a simple Monte Carlo estimate of the expectation J 1 logn(w (s) ;µ,σ) log p(w (s) N ) log σ(w (s)t x (n) ) S s,w(s) ~ N(µ,Σ) n=1 Really cheap estimate is to use one sample Further form an unbiased estimate of the sum over data points by randomly picking one term J logn(w (s) ;µ,σ) log p(w (s) ) N log σ(w (s)t x (n) ),n ~ Uniform[1,..N ],w(s) ~ N(µ,Σ) 11

Removing dependence on µ and Σ The goal is to move the variational parameters µ and Σ, which express which weights are plausible, so that cost J gets smaller (on average) If the variational parameters change, the samples we would draw change, which complicates our reasoning. We can remove the variational parameters from the random draws by writing down how a Gaussian random generator works: w (s) =(µ+σ ½ ν), where ν~n(0,1), Σ ½ is a matrix square root, such as Cholesky decomposition. Then our cost function becomes J logn((µ + Σ 1/2 ν);µ,σ) log p(w = µ + Σ 1/2 ν) N log σ(µ + Σ 1/2 ν) T x (n) ), n ~ Uniform[1,..N ],ν ~ N(0,I ) Where all dependence on variational parameters µ and Σ are now in the estimator itself

Minimizing cost function J We can minimize the cost function J by stochastic gradient descent. We draw a datapoint n at random, some Gaussian white noise ν, and evaluate J and its derivatives wrt µ and Σ We then make a small change to µ and Σ to improve the current estimate of J and repeat. 13

Applicability to neural networks While some of the fine details are slightly complicated, none of them depend on the fact we were considering logistic regression. We could easily replace the logistic function σ(.) with another model likelihood, such as from a neural network. As long as we can differentiate the log-likelihood, we can apply stochastic variational inference. 14

Example of Variational Bayes Logistic Regression Linearly separable data set Predictive distribution Decision boundaries C 1 Corresponding to five samples of parameter vector w p(c 1 ϕ) is plotted from 0.01 to 0.99 Large margin solution of SVM has qualitatively similar result to the Bayesian solution 15