Bayesian Regression Linear and Logistic Regression

Similar documents
STA 4273H: Sta-s-cal Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Nonparametric Bayesian Methods (Gaussian Processes)

STA414/2104 Statistical Methods for Machine Learning II

Density Estimation. Seungjin Choi

Outline Lecture 2 2(32)

Introduction to Machine Learning

An Introduction to Bayesian Linear Regression

Bayesian Inference and MCMC

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian RL Seminar. Chris Mansley September 9, 2008

Default Priors and Effcient Posterior Computation in Bayesian

Bayesian Methods for Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Lecture : Probabilistic Machine Learning

Introduction to Bayesian Inference

CLASS NOTES Models, Algorithms and Data: Introduction to computing 2018

Introduction to Probabilistic Machine Learning

Bayesian linear regression

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

CPSC 540: Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

STA 4273H: Statistical Machine Learning

Stat 5101 Lecture Notes

CSC 2541: Bayesian Methods for Machine Learning

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Part 1: Expectation Propagation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction into Bayesian statistics

ML estimation: Random-intercepts logistic model. and z

Bayesian methods in economics and finance

Web Appendix for Hierarchical Adaptive Regression Kernels for Regression with Functional Predictors by D. B. Woodard, C. Crainiceanu, and D.

Pattern Recognition and Machine Learning

A short introduction to INLA and R-INLA

Bayesian Analysis for Natural Language Processing Lecture 2

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Introduction to Machine Learning

Supplementary Note on Bayesian analysis

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

A Bayesian Approach to Phylogenetics

Probabilistic Graphical Models

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Phylogenetics:

Learning Bayesian network : Given structure and completely observed data

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

CPSC 540: Machine Learning

Lecture 5: Bayes pt. 1

Probability and Estimation. Alan Moses

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Classical and Bayesian inference

GWAS IV: Bayesian linear (variance component) models

Bayesian Models in Machine Learning

Bayesian Machine Learning

Unsupervised Learning

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Lecture 2: Priors and Conjugacy

Bayesian Dropout. Tue Herlau, Morten Morup and Mikkel N. Schmidt. Feb 20, Discussed by: Yizhe Zhang

CS 361: Probability & Statistics

Graphical Models for Collaborative Filtering

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Chapter 4 - Fundamentals of spatial processes Lecture notes

Overfitting, Bias / Variance Analysis

PMR Learning as Inference

COS513 LECTURE 8 STATISTICAL CONCEPTS

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Deciding, Estimating, Computing, Checking

Deciding, Estimating, Computing, Checking. How are Bayesian posteriors used, computed and validated?

Fundamental Probability and Statistics

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Inference: Concept and Practice

Bayesian Linear Regression. Sargur Srihari

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Foundations of Statistical Inference

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

36-463/663: Hierarchical Linear Models

Integrated Non-Factorized Variational Inference

Introduction to Machine Learning

Gaussian Models

Linear Models for Regression CS534

Introduction to Machine Learning

Reading Group on Deep Learning Session 1

Markov Chain Monte Carlo, Numerical Integration

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Outline lecture 2 2(30)

1 Bayesian Linear Regression (BLR)

Bayesian Machine Learning

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Logistic Regression. Seungjin Choi

ECE521 week 3: 23/26 January 2017

Probabilistic Machine Learning. Industrial AI Lab.

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Transcription:

When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we want a full posterior Over the parameters β (or w in the book) Over the estimated variance σ $ Bayesian linear regression Likelihood function: Linear Regression First, assume σ $ is known So we re estimating p w D, σ $ or in the more general case p w, σ $ D Recall our linear regression model is defined as: μ is an offset term so if the inputs are centered (or standardized) for each j Then our prior belief about the mean of the output is equally likely to be positive or negative 1

Likelihood function: Linear Regression We want an (uninformative) prior that will not affect the mean. We choose an improper prior on μ f the form p μ 1 This is for mathematical ease We include that and then integrate it out again Prior: Linear Regression We now have a well defined likelihood function What do we choose for our prior? Conjugate priors are nice Conjugate prior of a normal is a normal Where y is the empirical mean of the output and y y1 / indicates the output vector y is centered Now our likelihood function is free of μ Let s define our prior (on the parameters) as Bayes rule to compute Posterior Relationship to Ridge If we define our prior to be w 0 = 0 V 0 = τ $ I This reduces to the ridge estimate with λ = 78 9 8 But in this case we d have the full posterior rather than just a point estimate 2

Bayesian approach Recall: A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 The possible linear fit includes many different lines, with the single most probable weight value going through (0,0) Initially, we have no data and thus no likelihood Our prior is a MVN centered at (0,0) A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 We see one data point (the blue circle) our data space changes our hypothesis is more restricted A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 With two data points the model is more sure about valid hypotheses The dark red values in the band would generate the data point in data space Our multivariate normal is now skewed to reflect the likelihood The dark red values in the band would generate the data points in data space Our multivariate normal is now skewed and smaller to reflect the likelihood 3

A walkthrough: Assume data is sampled from a model where w ; = 0.3 and w B = 0.5 After 20 data points, we re More confident about the weights / best fitting line Posterior predictive distribution We can show that So the variance depends on two terms The variance of the observed noise σ $ The variance in the parameters V / The likelihood, reflecting the data has high confidence that the true weights are -.3 and.5 Our multivariate normal is now very close to the true value (white cross) Posterior predictive Posterior predictive Standard MLE estimation Bayesian MAP estimation Features: As we move further from the observed data, we increase our uncertainty The MLE estimate is the MAP estimate 4

Function generation with posterior predictives Standard MLE function generation Bayesian MAP function generation What happens with unknown variance? Unknown variance We instead define where NIG is a normal-inverse gamma whose pdf is The normal distribution arises as a special case by setting β = 0, δ = σ $ α and α w / and V / are similar to the case where σ $ is known. a J is just an update of counts b / is a contribution of the prior sum of squares b ; and the empirical sum of squares y L y plus a term due to the error in the prior on w. 5

Unknown variance The posterior predictive distribution with unknown variance is a Student t-distribution. Looks like a normal distribution but with thick tails What happens if the prior is unknown? There s an empirical Bayes procedure for picking the hyperparameters Choose η = (α, λ) to maximize the marginal likelihood where λ = 1/σ $ is the precision of the observation noise and α is the precision of the prior p w = N(w 0, α TB I) This is known as the evidence procedure Evidence procedure Alternative to using cross validation 5- fold cross validation Evidence procedure 6

Bayesian approach Bayesian Logistic Regression Recall: Logistic regression Approximate inference We may want to compute the full posterior over the parameters P w D This would allow us to associate confidence intervals with our predictions Applications include bandit problems But there is no convenient conjugate prior to compute the posterior exactly Simple approximation to follow known as Laplace approximation Also known as saddle point approximation Other solutions include Markov Chain Monte Carlo (MCMC) Variational inference Expectation propagation 7

Laplace approximation Gaussian approximation to the posterior Taylor series Let θ R Y and p θ D) = 1 et\ θ Z where E θ is an energy function For this approximation, we assume E θ = log p(θ, D) where Z = p(d) Approximation to a sine function with great and great expansion. As the degree of the Taylor polynomial rises, it approaches the correct function. This image shows sin x and its Taylor approximations, polynomials of degree 1, 3, 5, 7, 9, 11 and 13. Applying a Taylor series Laplace approximation Expand around the mode of θ Recall the mode is the maximum (most likely) of the posterior distribution It s also the lowest energy state in our definition of E θ = log p(θ, D) E θ E θ + θ θ L g + 1 2 θ θ L H θ θ Where g is the gradient and H is the Hessian of the energy function evaluated at the mode. E θ E θ + θ θ L g + 1 2 θ θ L H θ θ g E θ i j H $ E θ θ θ L m θ 8

Laplace approximation cont. Since θ is the mode, the gradient term is zero. Why? It s the maximum of the distribution Hence This is known as the Laplace approximation to the marginal likelihood So the posterior. is an approximation to As the sample size increases this will start to look more and more like a Gaussian and thus is commonly referred to as a Guassian approximation Gaussian approximation Gaussian approximation Our prior becomes: p w D N w wn, H TB If we have linearly separable data, the MLE estimate is not well defined since w can get arbitrarily large wn = argmin r E w E w = log p D w + log p w H = $ E w w 9

Unnormalized vs Laplace approximation 10