Linear Models A linear model is defined by the expression

Similar documents
Introduction to Probabilistic Machine Learning

Density Estimation. Seungjin Choi

Introduction into Bayesian statistics

Bayesian inference. Rasmus Waagepetersen Department of Mathematics Aalborg University Denmark. April 10, 2017

Probability and Estimation. Alan Moses

The linear model is the most fundamental of all serious statistical models encompassing:

Lecture 2: Priors and Conjugacy

Foundations of Statistical Inference

Lecture : Probabilistic Machine Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

The Jeffreys Prior. Yingbo Li MATH Clemson University. Yingbo Li (Clemson) The Jeffreys Prior MATH / 13

10. Exchangeability and hierarchical models Objective. Recommended reading

Bayesian Linear Models

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

One-parameter models

Lecture 3. Univariate Bayesian inference: conjugate analysis

Parametric Techniques Lecture 3

Bayesian RL Seminar. Chris Mansley September 9, 2008

AMS-207: Bayesian Statistics

Bayesian Linear Models

Parametric Techniques

Bayesian linear regression

Time Series and Dynamic Models

Bayesian Inference: Concept and Practice

Hypothesis Testing. Econ 690. Purdue University. Justin L. Tobias (Purdue) Testing 1 / 33

David Giles Bayesian Econometrics

7. Estimation and hypothesis testing. Objective. Recommended reading

g-priors for Linear Regression

Bayesian Inference for Normal Mean

Bayesian Machine Learning

1. Fisher Information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Computational Perception. Bayesian Inference

USEFUL PROPERTIES OF THE MULTIVARIATE NORMAL*

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Ingredients. Hedibert Freitas Lopes

A Very Brief Summary of Bayesian Inference, and Examples

Data Analysis and Uncertainty Part 2: Estimation

Computational Cognitive Science

Default priors and model parametrization

STAT 830 Bayesian Estimation

Lecture 13 and 14: Bayesian estimation theory

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Models in Machine Learning

Bayesian Linear Regression

The binomial model. Assume a uniform prior distribution on p(θ). Write the pdf for this distribution.

The Normal Linear Regression Model with Natural Conjugate Prior. March 7, 2016

Bayesian Inference. STA 121: Regression Analysis Artin Armagan

A Very Brief Summary of Statistical Inference, and Examples

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Stat 5101 Lecture Notes

Bayesian Linear Models

STAT 425: Introduction to Bayesian Analysis

Bayesian Inference and MCMC

The joint posterior distribution of the unknown parameters and hidden variables, given the

Bayesian Inference. Chapter 9. Linear models and regression

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

A Bayesian Treatment of Linear Gaussian Regression

Part 6: Multivariate Normal and Linear Models

Recall that the AR(p) model is defined by the equation

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

The comparative studies on reliability for Rayleigh models

Model Checking and Improvement

ST 740: Linear Models and Multivariate Normal Inference

Conjugate Priors, Uninformative Priors

Chapter 4 HOMEWORK ASSIGNMENTS. 4.1 Homework #1

Parametric Inference Maximum Likelihood Inference Exponential Families Expectation Maximization (EM) Bayesian Inference Statistical Decison Theory

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Predictive Distributions

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Bayesian Regression Linear and Logistic Regression

MCMC algorithms for fitting Bayesian models

CPSC 540: Machine Learning

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

STAT 730 Chapter 4: Estimation

Introduction to Bayesian Methods

Bayesian Inference. Chapter 4: Regression and Hierarchical Models

Ch 2: Simple Linear Regression

INTRODUCING LINEAR REGRESSION MODELS Response or Dependent variable y

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

[y i α βx i ] 2 (2) Q = i=1

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Introduction to Bayesian Methods. Introduction to Bayesian Methods p.1/??

Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Computational Cognitive Science

COS513: FOUNDATIONS OF PROBABILISTIC MODELS LECTURE 9: LINEAR REGRESSION

PMR Learning as Inference

Remarks on Improper Ignorance Priors

Some Curiosities Arising in Objective Bayesian Analysis

Bayesian Machine Learning

Markov Chain Monte Carlo methods

EIE6207: Maximum-Likelihood and Bayesian Estimation

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Problem Selected Scores

More on nuisance parameters

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Non-Parametric Bayes

Final Exam. 1. (6 points) True/False. Please read the statements carefully, as no partial credit will be given.

Transcription:

Linear Models A linear model is defined by the expression x = F β + ɛ. where x = (x 1, x 2,..., x n ) is vector of size n usually known as the response vector. β = (β 1, β 2,..., β p ) is the transpose of a vector of dimension p also known as parameter vector. F is a matrix of known elements and of dimension n p with rows denoted by f i also known as design matrix. ɛ = (ɛ 1, ɛ 2,..., ɛ n ) models errors. is a vector of size n that contain the These errors are usually assumed iid with ɛ i N(0, σ 2 ) 153

Least squares: Find the value of β so that the sum of squares S = (x F β) (x F β) reaches its minimum. MLE: Find the value of β that produces the maximum likelihood. Under normality and assuming σ 2 known, the log-likelihood for β is given by l(β) = c (n/2)log(σ 2 ) (1/2σ 2 )(x F β) (x F β) Also under normality, the MLE and LSE of β is b = (F F ) 1 F x(= ˆβ) Other concepts: The residual sum of squares is R = (x F b) (x F b) 154

This sum of squares is associated with n p degrees of freedom since R/σ 2 χ 2 n p. An unbiased estimate of the variance is s 2 = R/(n p) which is not the same to the MLE of σ 2 ( ˆσ 2 = R/n). We also have the sum of squares factorization x x = R + b F F b. Total sum of squares is equal to the residual sum of squares plus the regression sum of squares. 155

Bayesian Statistics The main goal of Bayesian analysis is to incorporate prior information into statistical modeling. This leads into the treatment of observations and parameters as random variables. A Bayesian model is established in a hierarchical way. First we define a probability distribution for the observations (likelihood) given a specific value of the parameter f(x 1, x 2,..., x n θ) We also specify a probability distribution for θ known, the prior distribution p(θ), which reflects the current state of knowledge for θ. 156

After a likelihood and a prior are specified, Bayesians compute the posterior distribution given by Bayes Theorem p(θ x 1, x 2,..., x n ) f(x 1, x 2,..., x n θ)p(θ) The proportionality constant is given by the marginal distribution of the data p(x 1, x 2,..., x n ) = f(x 1, x 2,..., x n θ)p(θ)dθ All the inferences are based on the posterior distribution. If we wish to estimate θ, we could use the posterior expectation E(θ x) = θ p(θ x)dθ 157

where x represents the data vector x = (x 1, x 2,..., x n ). If we want to predict a future value x f we use the predictive distribution of x f given the data x, p(x f x) = p(x f θ, x)p(θ x)dθ Usually the computations related to Bayesian Statistics require numerical evaluation of complicated integrals except in specific cases known as conjugate models. Example of conjugate model: Binomial data-beta prior. Outside conjugate models it is usually hard to determine a prior distribution (requires scientific and probability knowledge). To deal with this problem, some statisticians appeal to 158

non-informative (objective) prior distributions. Non-informative priors have the purpose of reflecting lack of prior knowledge. A starting point to run Bayes machinery. Non-informative priors are also known as reference priors. An intuitive choice for a non-informative prior is the Uniform also known as flat prior. p(θ) 1 Both Bayes and Laplace proposed this prior as a default non-informative prior. However, this prior distribution is not invariant for one-to-one transformations. 159

A famous probabilist, Jeffreys, derived the invariant non-informative prior. Jeffreys rule p(θ) I(θ) ( 1/2 where I(θ) ) denotes Expected information. I(θ) = E X θ d2 log f(x θ) d 2 θ In the Binomial-Beta example, Jeffreys prior is: p(θ) θ 1/2 (1 θ) 1/2 If we have a probability model with a location parameter µ and a scale parameter σ 2, Jeffreys prior becomes: p(µ, σ 2 ) 1/σ 2 For more information about Bayesian Statistics you may want to check Tim Hanson s course page. http://www.stat.unm.edu/ hanson/sta579/sta579.html 160

Summary of Bayes results for the Linear Model For the linear model, β and σ 2 are essentially location/scale parameters. The default non-informative prior for β and σ 2 is: p(β, σ 2 ) 1/σ 2 With Bayes theorem the posterior distribution is given by p(β, σ 2 x, F ) f(x β, σ 2 )(1/σ 2 ) Under this prior, the posterior distribution for (β, σ 2 ) is a Normal-Gamma distribution. Conditional on σ 2, the posterior for β is a p-dimensional Normal with mean b and a covariance matrix σ 2 (F F ) 1 161

or β N(b, σ 2 (F F ) 1 ). The marginal posterior distribution for σ 2 is an Inverse Gamma with shape parameter n/2 and scale parameter R/2 or σ 2 IG(n/2, R/2) The product of this p-dimensional Normal and the Inverse Gamma defines the Normal/Gamma posterior. For the marginal posterior distribution of β we need p(β x, F ) = p(β, σ 2 x, F )dσ 2 After some algebraic manipulation, it can be shown that p(β x, F ) = c(n, p) F F 1/2 /(1+(β b) F F (β b)/ps 2 ) n/2 Roughly, for n large p(β x, F ) N(b, s 2 (F F ) 1 ). 162

The marginal density of x given F is, p(x F ) = p(x β, σ 2 )p(β, σ 2 )dβdσ 2 = c F F 1/2 /R (n p)/2 Due to the sum of squares factorization, we can establish that p(x F ) F F 1/2 (1 b F F b/(x x)) (p n)/2 If we think of F as a parameter, p(x F ) is a likelihood that could be used to produce inferences on F or on quantities that determine F (marginal likelihood). Under orthogonality of the F matrix, the evaluation of p(x F ) becomes really easy. F orthogonal means that F F = ki 163