An Introduction to Bayesian Linear Regression

Similar documents
The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Bayesian Regression Linear and Logistic Regression

Regression Models - Introduction

ST 740: Linear Models and Multivariate Normal Inference

F & B Approaches to a simple model

Bayesian Inference: Concept and Practice

STAT5044: Regression and Anova. Inyoung Kim

Chapter 8: Sampling distributions of estimators Sections

Lecture 14 Simple Linear Regression

Chapter 1: Linear Regression with One Predictor Variable also known as: Simple Linear Regression Bivariate Linear Regression

A Bayesian Treatment of Linear Gaussian Regression

Part 6: Multivariate Normal and Linear Models

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Part 4: Multi-parameter and normal models

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Simple Linear Regression

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Hierarchical Modeling for Univariate Spatial Data

Regression Models - Introduction

Correlation and Regression

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

(a) (3 points) Construct a 95% confidence interval for β 2 in Equation 1.

A Very Brief Summary of Bayesian Inference, and Examples

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Gibbs Sampling in Linear Models #2

Module 4: Bayesian Methods Lecture 5: Linear regression

Module 17: Bayesian Statistics for Genetics Lecture 4: Linear regression

Introduction to Simple Linear Regression

Foundations of Statistical Inference

Bayesian Econometrics

Lecture 16 : Bayesian analysis of contingency tables. Bayesian linear regression. Jonathan Marchini (University of Oxford) BS2a MT / 15

Multivariate Bayesian Linear Regression MLAI Lecture 11

Inference for Regression

Linear models and their mathematical foundations: Simple linear regression

A Very Brief Summary of Statistical Inference, and Examples

Chapter 14. Linear least squares

Are Bayesian Inferences Weak for Wasserman s Example?

Machine Learning 4771

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

STA 2201/442 Assignment 2

First Year Examination Department of Statistics, University of Florida

1. Simple Linear Regression

Chapter 1 Linear Regression with One Predictor

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

COS513 LECTURE 8 STATISTICAL CONCEPTS

Statistics & Data Sciences: First Year Prelim Exam May 2018

Bayesian Linear Regression

Multivariate Normal & Wishart

Bayesian Linear Models

ECON The Simple Regression Model

Frequentist-Bayesian Model Comparisons: A Simple Example

Simple Linear Regression

Topic 12 Overview of Estimation

Statistical Data Analysis Stat 3: p-values, parameter estimation

17: INFERENCE FOR MULTIPLE REGRESSION. Inference for Individual Regression Coefficients

GAUSSIAN PROCESS REGRESSION

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Linear Models A linear model is defined by the expression

Hierarchical Modelling for Univariate Spatial Data

Multiple regression. CM226: Machine Learning for Bioinformatics. Fall Sriram Sankararaman Acknowledgments: Fei Sha, Ameet Talwalkar

Review of Statistics

Robust Bayesian Simple Linear Regression

Next is material on matrix rank. Please see the handout

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

Statistics: Learning models from data

Summer School in Statistics for Astronomers V June 1 - June 6, Regression. Mosuk Chow Statistics Department Penn State University.

Regression. ECO 312 Fall 2013 Chris Sims. January 12, 2014

Lecture 15 Multiple regression I Chapter 6 Set 2 Least Square Estimation The quadratic form to be minimized is

Lecture 5: GPs and Streaming regression

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Module 11: Linear Regression. Rebecca C. Steorts

Simple Linear Regression Analysis

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Bayesian linear regression

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

STAT Advanced Bayesian Inference

First Year Examination Department of Statistics, University of Florida

The Bayesian Approach to Multi-equation Econometric Model Estimation

Bayesian data analysis in practice: Three simple examples

Bayesian methods in economics and finance

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Fractional Hot Deck Imputation for Robust Inference Under Item Nonresponse in Survey Sampling

where x and ȳ are the sample means of x 1,, x n

Gaussian Process Regression

Foundations of Statistical Inference

Estimation Theory. as Θ = (Θ 1,Θ 2,...,Θ m ) T. An estimator

Math 3330: Solution to midterm Exam

Classical and Bayesian inference

Bayesian Linear Regression [DRAFT - In Progress]

Quantitative Analysis of Financial Markets. Summary of Part II. Key Concepts & Formulas. Christopher Ting. November 11, 2017

Part IB Statistics. Theorems with proof. Based on lectures by D. Spiegelhalter Notes taken by Dexter Chua. Lent 2015

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Ph.D. Qualifying Exam Friday Saturday, January 6 7, 2017

Estimation of Operational Risk Capital Charge under Parameter Uncertainty

MIT Spring 2015

The linear model is the most fundamental of all serious statistical models encompassing:

Transcription:

An Introduction to Bayesian Linear Regression APPM 5720: Bayesian Computation Fall 2018

A SIMPLE LINEAR MODEL Suppose that we observe explanatory variables x 1, x 2,..., x n and dependent variables y 1, y 2,..., y n Assume they are related through the very simple linear model y i = βx i + ε i for i = 1, 2,..., n, with ε 1, ε 2,..., ε n being realizations of iid N(0, σ 2 ) random variables.

A SIMPLE LINEAR MODEL y i = βx i + ε i, i = 1, 2,..., n The x i can either be constants or realizations of random variables. In the latter case, assume that they have joint pdf f ( x θ) where θ is a parameter (or vector of parameters) that is unrelated to β and σ 2. The likelihood for this model is f ( y, x β, σ 2, θ) = f ( y x, β, σ 2, θ) f ( x β, σ 2, θ) = f ( y x, β, σ 2 ) f ( x θ)

A SIMPLE LINEAR MODEL Assume that the x i are fixed. The likelihood for the model is then f ( y x, β, σ 2 ). The goal is to estimate and make inferences about the parameters β and σ 2. Frequentist Approach: Ordinary Least Squares (OLS) y i is supposed to be β times x i plus some residual noise. The noise, modeled by a normal distribution, is observed as y i βx i. Take β to be the minimizer of the sum of squared errors n (y i βx i ) 2 i=1

A SIMPLE LINEAR MODEL β = n i=1 x iy i n i=1 x2 i Now for the randomness. Consider Y i = βx i + Z i, i = 1, 2,..., n for Z i iid N(0, σ 2 ). Then Y i N(βx i, σ 2 ) β = ( n i=1 x i x 2 j ) Y i N ( β, σ 2 / ) x 2 j

A SIMPLE LINEAR MODEL If we predict each y i to be ŷ i := βx i, we can define the sum of squared errors to be SSE = n (y i ŷ i ) 2 = i=1 n (y i βx i ) 2 i=1 We can then estimate the noise variance σ 2 by the average sum of squared errors SSE/n or, better yet, we can adjust the denominator slightly to get the unbiased estimator σ 2 = SSE n 1. This quantity is known as the mean squared error or MSE and will also be denoted by s 2.

THE BAYESIAN APPROACH: Y i = βx i + Z i, Z i iid N(0, σ 2 ) f (y i β, σ 2 ) = f ( y β, σ 2 ) = ( 2πσ 2) n/2 exp [ [ 1 exp 1 ] 2πσ 2 2σ 2 (y i βx i ) 2 1 2σ 2 ] n (y i βx i ) 2 It will be convenient to write this in terms of the OLS estimators xi y i β =, s 2 (yi = βx i ) 2 x 2 i n 1 i=1

THE BAYESIAN APPROACH: Then n (y i βx i ) 2 = νs 2 + (β β) n 2 i=1 where ν := n 1. It will also be convenient to work with the precision parameter τ := 1/σ 2. Then f ( y β, τ) = (2π) n/2 i=1 { [ τ 1/2 exp τ 2 (β β) 2 ]} n i=1 x2 i { [ τ ν/2 exp τνs2 2 ]} x 2 i

THE BAYESIAN APPROACH: [ τ 1/2 exp τ 2 (β β) 2 ] n i=1 x2 i [ τ ν/2 exp τνs2 2 ] looks normal as a function of β looks gamma as a function of τ (inverse gamma as a function of σ 2 ) The natural conjugate prior for (β, σ 2 ) will be a normal inverse gamma.

THE BAYESIAN APPROACH: So many symbols... will use underbars and overbars for prior and posterior hyperparameters and also add a little more structure. Priors β τ N(β, c/τ), τ Γ(ν/2, ν s 2 /2) Will write (β, τ) NG(β, c, ν/2, ν s 2 /2).

THE BAYESIAN APPROACH: It is routine to show that the posterior is (β, τ) y NG(β, c, ν/2, ν s 2 /2) where c = [ 1/c + x 2 i ] 1, β = c(c 1 β + β x 2 i ) ν = ν + n, νs 2 = νs 2 + νs 2 + ( β β) 2 c + x 2 i

ESTIMATING β AND σ 2 : The posterior Bayes estimator for β is E[β y]. A measure of uncertainty of the estimator is given by the posterior variance Var[β y]. We need to write down the NG(β, c, ν/2, ν s 2 /2) pdf for (β, τ) y and integrate out τ. The result is that β y has a generalized t-distribution. (This is not exactly the same as a non-central t.)

THE MULTIVARIATE t-distribution: We say that a k-dimensional random vector X has a multivariate t-distribution with mean µ variance-covariance matrix parameter V ν degrees of freedon if X has pdf ( ) ν ν/2 Γ ν+k 2 [ f ( x µ, V, ν) = π k/2 Γ ( ) ν V 1/2 ( x µ) t V 1 ( x µ) + ν 2 We will write X t( µ, V, ν). ] ν+k 2.

THE MULTIVARIATE t-distribution: With k = 1, µ = 0, and V = 1, we get the usual t-distribution. Marginals: X = ( X1 X 2 ) Xi t( µ i, V i, ν) where µ i and V i are the mean and variance-covariance matrix of X i. Conditionals such as X 1 X 2 are also multivariate t. E[ X] = µ, if ν > 1 Var[ X] = ν ν 2 V if ν > 2

BACK TO THE REGRESSION PROBLEM: Can show that β y t(β, cs 2, ν) So, the PBE is E[β y] = β and the posterior variance is Var[β y] = ν ν 1 cs2. Also can show that τ y Γ(ν/2, ν s 2 /2). So, E[τ y] = 1/s 2, Var[τ y] = 2/(ν (s 2 ) 2 ).

RELATIONSHIP TO FREQUENTIST APPROACH: The PBE of β E[β y] = β = c(c 1 β + β x 2 i ). It is a weighted average of the prior mean and the OLS estimator of β from frequentist statistics. c 1 reflects your confidence in the prior and should be chosen accordingly x 2 i reflects the degree of confidence that the data has in the OLS estimator β

RELATIONSHIP TO FREQUENTIST APPROACH: Recall also that and So, νs 2 = νs 2 + νs 2 + ( β β) 2 s 2 (yi = βx i ) 2 n 1 νs 2 }{{} posterior SSE = νs 2 }{{} prior SSE c + x 2 i = SSE n 1 = SSE ν. + νs 2 }{{} SSE + ( β β) 2 c + x 2 i The final term reflects conflict between the prior and the data.

CHOOSING PRIOR HYPERPARAMETERS: When choosing hyperparameters β, c, ν, and s 2, it may be helpful to know that β is equivalent to the OLS estimate from an imaginary data set with ν + 1 observations imaginary x 2 i equal to c 1 imaginary s 2 given by s 2 The imaginary data set might even be previous data!

MODEL COMPARISON Suppose you want to fit this overly simplistic linear model to describe the y i but are not sure whether you want to use the x i or a different set of explananatory variables. Consider the two models: M 1 : y i = β 1 x 1i + ε 1i Here, we assume are independent. M 2 : y i = β 2 x 2i + ε 2i ε 1i iid N(0, τ 1 1 ) and ε 2i iid N(0, τ 1 2 )

MODEL COMPARISON Priors for model j: posteriors for model j are The posterior odds ratio is (β j, τ j ) NG(β j, c j, ν j /2, ν j s j 2 ) (β j, τ j ) y NG(β j, c j, ν j /2, ν j s j 2 ) PO 12 := P(M 1 y) P(M 2 y) = f ( y M 1) f ( y M 2 ) P(M 1) P(M 2 )

MODEL COMPARISON Can show that ( ) 1/2 cj ( ) f ( y M j ) = a j ν j s 2 νj /2 c j j where a j = ( ) Γ(ν j /2) ν j s 2 νj /2 j Γ(ν j /2) π n/2

MODEL COMPARISON We can get the posterior model probabilities: P(M 1 y) = PO 12 1 + PO 12, P(M 2 y) = 1 1 + PO 12. where PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2

MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains the OLS SSE. A lower value indicates a better fit. So, the posterior odds ratio rewards models which fit the data better.

MODEL COMPARISON PO 12 = ) ν1 /2 ( ) 1/2 a c1 1 c 1 (ν 1 s 2 1 ( ) 1/2 ) a c2 2 c 2 (ν 2 s 2 ν2 /2 P(M 1) P(M 2 ) 2 ν j s 2 j contains a term like ( β j β j ) 2 So, the posterior odds ratio supports greater coherency between prior info and data info!