Bayesian methods in economics and finance

Size: px
Start display at page:

Download "Bayesian methods in economics and finance"

Transcription

1 1/26 Bayesian methods in economics and finance Linear regression: Bayesian model selection and sparsity priors

2 Linear Regression 2/26 Linear regression Model for relationship between (several) independent variables x = (x 1,..., x D 1 ) and dependent variable y D 1 y = w 0 + w i x i + ɛ i=1 Structure: Linear relationship with parameters w Noise: Additive observation noise ɛ N (0, σ ɛ )

3 Linear Regression 2/26 Linear regression Model for relationship between (several) independent variables x = (x 1,..., x D 1 ) and dependent variable y D 1 y = w 0 + w i x i + ɛ i=1 Structure: Linear relationship with parameters w Noise: Additive observation noise ɛ N (0, σ ɛ ) Convenient matrix notation where w, x R D and x 0 1 y = w T x + ɛ

4 Linear regression Example data {(x n, t n )} N n=1 Collect data in matrices X = (x T 1,..., x T N )T R N D and t = (t 1,..., t N ) T R N 1 X is also called design matrix Likelihood p(data θ) p(t X, θ) = N N (t n w T x n, σ ɛ ) n=1 = (2πσ ɛ ) N 2 e 1 2 N (tn w T xn) 2 n=1 σɛ 2 with parameters θ = (w, σ ɛ ) weights w = (w0,..., w D ) T noise variance σ 2 ɛ Note: Data are modeled as exchangeable Linear Regression 3/26

5 Ordinary least squares Maximum likelihood solution: max ln p(t X, w, σ ɛ ) = max N w,σ ɛ w,σ 2 ln σ2 ɛ 1 2σɛ 2 Find optimal weight vector w : N (t n w T x n ) 2 n=1 ln p(t X, w, σ ɛ ) = 1 N w i σɛ 2 (t n w T! x n )x in = 0 n=1 ( N ) N = w T x n x T n = t n x T n n=1 i=1 w = (X T X) 1 X T t Minimizes the squared error 1 N 2 n=1 (t n w T x n ) 2 X = (X T X) 1 X T is pseudo-inverse of design matrix X Linear Regression 4/26

6 Linear Regression 4/26 Ordinary least squares Maximum likelihood solution: max ln p(t X, w, σ ɛ ) = max N w,σ ɛ w,σ 2 ln σ2 ɛ 1 2σɛ 2 N (t n w T x n ) 2 n=1 Find optimal weight vector w : w = (X T X) 1 X T t Find optimal noise precision τ ɛ : ln p(t X, w, τ ɛ ) = N 1 1 τ ɛ 2 τ ɛ 2 = 1 τ ɛ = 1 N N (t n w T x n ) 2! = 0 n=1 N (t n w T x n ) 2 n=1

7 Linear Regression 5/26 Bayesian linear regression Assume that noise precision τ ɛ is known Conjugate prior for weights w is Gaussian Posterior is again Gaussian with covariance matrix p(w 0, Σ 0 ) Σ N = (Σ τ ɛ X T X) 1 mean µ N = τ ɛ Σ N X T t

8 Linear Regression 6/26 Bayesian linear regression Posterior found by completing the square: p(w X, t, τ ɛ ) e 1 2 τɛ N n=1 (tn wt x n) 2 e 1 2 wt Σ 1 0 w Show demo... = e 1 2 (τɛ(xw t)t (Xw t)+w T Σ 1 0 w) e 1 2 wt Σ 1 N {}}{ (τ ɛ X T X + Σ 1 0 ) w+wt Σ 1 N µ N {}}{ τ ɛ X T t

9 inear Regression 7/26 Posterior intervals Consider the posterior p(θ D) of a single parameter θ: To quantify the uncertainty, define an interval [a, b] which contains 95% of the posterior probability: P(a < θ < b D) = 95% Common choices are Symmetric intervals, i.e. [ˆθ x, ˆθ + x] around the posterior mean or mode ˆθ Intervals with p(θ = a D) = p(θ = b D) Intervals with a, b being fixed quantiles Also called credible intervals, Bayesian confidence intervals

10 inear Regression 8/26 Posterior intervals Example: Estimating mean of a Gaussian N (µ, σ) with uninformative prior: P(ˆµ 1.96 σ N < µ < ˆµ 1.96 σ N ) = 95% Suggestive notation as interpretation is very different from confidence interval Confidence interval: Parameter µ is fixed and random interval covers µ with probability 95% Coverage is a repeated sampling concept Credible interval: Data, i.e. ˆµ, are fixed and uncertain parameter lies in interval with probability 95%

11 inear Regression 9/26 Posterior intervals Frequentist interval I (D) with α coverage: P(θ I (D) θ) = α θ With Bayesian prior p(θ) it holds: α = P(θ I (D) θ)p(θ)dθ = 1 θ I (D) p(d, θ)pdpθ = P(θ I (D) D)p(D)pD Thus, Bayesian intervals Have good coverage on avergage But without worst-case guaranties, i.e. coverage for surprising data can be low

12 inear Regression 10/26 Predictive Distribution Predict new data point at x new Predictive distribution: p(t new x new ) = Again Gaussian distribution with mean µ T N x new variance σ 2 ɛ + x T new Σ N x new p(t new x new, w)p(w X, t)dw Noise variance + uncertainty about parameters Do not pick best parameters, but take uncertainty into account = Bayesian slogan: Don t optimize, integrate!

13 Linear Regression 11/26 Regularization Compare ML solution w = (X T X) 1 X T t Minimizes 1 2 N n=1 (t n w T x n ) 2 and posterior maximum (Σ 1 0 = τ 0 I) µ N = τ ɛ (τ 0 I + τ ɛ X T X) 1 X T t Minimizes regularized mean squared error 1 2 N (t n w T x n ) 2 + λ 2 wt w n=1 where λ = τ0 τ ɛ squared error + regularizer

14 Linear Regression 11/26 Regularization Compare ML solution w = (X T X) 1 X T t Minimizes 1 N 2 n=1 (t n w T x n ) 2 and posterior maximum (Σ 1 0 = τ 0 I) µ N = τ ɛ (τ 0 I + τ ɛ X T X) 1 X T t Quadratic regularizer controls model complexity by shrinking weights to zero (reduces variance) Lasso λ Asolute regularizer: D 2 i=1 w i Favors sparse solutions, i.e. some weights exactly zero Corresponds to Laplace prior: p(w) e i w i

15 Slide from Linear Regression 12/26 Regularization Regularized Least Squares (3) Lasso tends to generate sparser solutions than a quadratic regularizer.

16 Bias-variance decomposition Consider expected mean squared error: E[(y(x; w) t) 2 ] = (y(x; w) t) 2 p(x, t)dxdt = E[(y(x; w) h(x)) 2 ] + E[(h(x) t) 2 ] + 2 E[(w T x h(x))(h(x) t)] }{{} =0 since h(x)=e[t x] Regression function y(x; w) = w T x Optimal choice is h(x) = E[t x] In practice ŵ is estimated from data set D E D [(y(x; ŵ) h(x)) 2 ] = (E D [y(x; ŵ)] h(x)) 2 }{{} bias 2 + E D [(y(x; ŵ) E D [y(x; ŵ)]) 2 ] }{{} variance + 2 E D [(y(x; ŵ) E D [y(x; ŵ)])(e D [y(x; ŵ)] h(x))] } {{ } =0 inear Regression 13/26

17 Slide from Linear Regression 14/26 Bias-variance decomposition The Bias-Variance Decomposition (5) Example: 25 data sets from the sinusoidal, varying the degree of regularization,.

18 Slide from Linear Regression 14/26 Bias-variance decomposition The Bias-Variance Decomposition (6) Example: 25 data sets from the sinusoidal, varying the degree of regularization,.

19 Slide from Linear Regression 14/26 Bias-variance decomposition The Bias-Variance Decomposition (7) Example: 25 data sets from the sinusoidal, varying the degree of regularization,.

20 Slide from Linear Regression 14/26 Bias-variance decomposition The Bias-Variance Trade-off From these plots, we note that an over-regularized model (large ) will have a high bias, while an underregularized model (small ) will have a high variance.

21 odel selection 15/26 Bayesian model selection Maximum likelihood max θ p(t X, θ) improves when number of parameters increases. This leads to over-fitting as model picks up noise in data. To achieve low expected loss we need to control bias }{{} 2 + variance }{{} decreases with model complexity increases with model complexity Regularization favors less flexible models by penalizing large weights Marginal likelihood/evidence: p(t X) = p(t X, w)p(w)dw Bayesian answer to over-fitting Probability of data accounting for parameter uncertainty

22 Model selection 16/26 Bayesian model selection Consider set of models {M i }, e.g. regression models with different number/subset of covariates: Prior probability of each model p(m i ) Compute posterior: p(m i D) p(d M i )p(m i ) Note that each model has parameters w i and thus p(d M i ) = p(d, w i M i )dw i = p(d w, M i )p(w i M i )dw i Marginal likelihood (wrt parameters) is likelihood (wrt model)! Note: Marginal likelihood requires proper prior

23 Model selection 17/26 Bayesian model selection Two models M 1, M 2 are then compared based on posterior odds p(m 1 D) p(m 2 D) = p(d M 1) p(m 1 ) p(d M 2 ) p(m }{{} 2 ) }{{} Bayes factor Prior odds Interpretation of evidence from Bayes factors as suggested by Jeffreys: Bayes factor decibans Evidence < 1 < 0 negative (supports M 2 ) barely worth mentioning substantial strong very strong > 100 > 20 decisive

24 Model selection 18/26 Bayesian model selection Marginal likelihood can be computed analytically: Prior: p(w) = N (w 0, τ 0 I) Likelihood: p(t X, w) = N (t Xw, τ ɛ I) Evidence: ln p(t X) = ln p(t X, w )p(w) p(w for any w X, t) D }{{} = 2 ln τ 0 + N 2 ln τ ɛ N ln 2π 2 w =µ N τ ɛ 2 (Xµ N t) T (Xµ N t) τ 0 2 µt N µ N 1 ln A 2 where µ N = τ ɛ A 1 X T t A = τ 0 I + τ ɛ X T X = Σ 1 N

25 Slide from Model selection 19/26 Bayesian model selection 0 th Order Polynomial

26 Slide from Model selection 19/26 Bayesian model selection 1 st Order Polynomial

27 Slide from Model selection 19/26 Bayesian model selection 3 rd Order Polynomial

28 Slide from Model selection 19/26 Bayesian model selection 9 th Order Polynomial

29 Slide from Model selection 19/26 Bayesian model selection The Evidence Approximation (3) Example: sinusoidal data, M th degree polynomial,

30 Model selection 20/26 Zellner s g-prior Empirical prior with particularly easy to compute Bayes factors: Jeffreys prior p(σ 2 ɛ ) 1 σ 2 ɛ Empirial g-prior for regression weights: w N (w 0, gσ 2 ɛ (X T X) 1 ) Bayes factor between model using subset γ of regressors and null model (intercept only): p(d M γ ) p(d M 0 ) = (1 + g)n γ 1 2 [1 + g(1 Rγ)] 2 n 1 2 where R 2 γ is the coefficient of determination Note: p(d Mγ ) p(d M γ) p(d M γ ) = p(d M 0 ) p(d M γ ) p(d M 0 )

31 Zellner s g-prior How to choose g? Bartlett paradoxon: General problem of uninformative prior compared to point hypothesis p(d M γ ) p(d M 0 ) g 0 Information paradoxon: Consider overwhelming support for M γ, e.g. Rγ 2 1. Bayes factor stays bounded as p(d Mγ) p(d M 0 ) (1 + g)n γ 1 2 Need to increase g with number N of data points: Empirical Bayes choosing g ML II = argmax g p(d M γ ) Hyperprior for g: Zellner-Slow prior: g Inv Gamma( 1 2, N 2 ) Hyper g-prior: g 1+g Beta(1, α 2 1) All choices are asymptotically consistent for model selection and prediction odel selection 21/26

32 Model selection 22/26 Evidence approximation How to choose hyperparameters, e.g. τ 0? Bayesian ideal: Assume prior p(τ 0 ) and integrate Often analytically intractable Introduces new hyperparameters... have to stop at some point Evidence approximation: Optimize marginal likelihood p(d τ 0 ), i.e. ML II Tends to work well in practice with little over-fitting Model selection from {Mτ0 } τ0>0

33 Model selection 23/26 Sparsity priors Alternative to model selection: Fit large model with many parameters, but use prior that favors sparse models E.g. many regression weights are very small: Hierarchical specification, i.e. w N (0, σ σ σd 2 ) and 1 σ 2 i Gamma(α, β) Evidence approximation for hyperparameters σ 1,..., σ D is called Automatic Relevance Determination (ARD) Laplace prior p(w i ) e w i Bayesian LASSO Spike and Slap prior: { w i 0 with prob. q N (0, σ w ) with prob. 1 q q Uniform[0, 1]

34 Sparsity priors Modern sparsity prior: Horseshoe prior w i N (0, σ i σ) σ i C + (0, 1) with the standard half-cauchy distribution C + (0, 1) Figure from: C. M. Carvalho et al., Handling Sparsity via the Horseshoe, Proceedings of the 12 th International Conference on Artificial Intelligence and Statistics, odel selection 24/26

35 Sparsity priors Modern sparsity prior: Horseshoe prior w i N (0, σ i σ) σ i C + (0, 1) Compared to LASSO: Small weights: Stronger shrinkage Large weights: Smaller bias Figure from: C. M. Carvalho et al., Handling Sparsity via the Horseshoe, Proceedings of the 12 th International Conference on Artificial Intelligence and Statistics, odel selection 24/26

36 Model selection 25/26 Cross-validation Idea: Select model based on generalization error, i.e. best predictions on novel data Cross-validation estimates generalization error: Partition data set D into K parts D 1,..., D K For each i = 1,..., K 1. Train model on D \ D i 2. Evaluate model on D i, e.g. using predictive distribution Predictive performance is estimated by average prediction error across D 1,..., D K Common choices in practice K = 10: Good compromise between bias (due to small training set) and variance (due to small test sets) K = N: Leaving-one-out cross-validation LOOCV Often efficient if data point can easily be removed from trained model

37 Model selection 26/26 Model selection Model selection summary: Evidence Cross-validation Philosophy Bayesian Frequentist Prior p(θ) matters Prior insensitive Model all data p(d) Predict part of data p(d i D \ D i ) Often computationally demanding Evaluation Probability Different error functions Asymptotics Consistent Often inconsistent, e.g. LOOCV BIC = 2 log p(d θ ML ) + D log N AIC = 2 log p(d θ ML ) + 2D Interpretation Data compression Prediction

STA414/2104 Statistical Methods for Machine Learning II

STA414/2104 Statistical Methods for Machine Learning II STA414/2104 Statistical Methods for Machine Learning II Murat A. Erdogdu & David Duvenaud Department of Computer Science Department of Statistical Sciences Lecture 3 Slide credits: Russ Salakhutdinov Announcements

More information

CPSC 540: Machine Learning

CPSC 540: Machine Learning CPSC 540: Machine Learning Empirical Bayes, Hierarchical Bayes Mark Schmidt University of British Columbia Winter 2017 Admin Assignment 5: Due April 10. Project description on Piazza. Final details coming

More information

g-priors for Linear Regression

g-priors for Linear Regression Stat60: Bayesian Modeling and Inference Lecture Date: March 15, 010 g-priors for Linear Regression Lecturer: Michael I. Jordan Scribe: Andrew H. Chan 1 Linear regression and g-priors In the last lecture,

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr

More information

Bayesian Regression Linear and Logistic Regression

Bayesian Regression Linear and Logistic Regression When we want more than point estimates Bayesian Regression Linear and Logistic Regression Nicole Beckage Ordinary Least Squares Regression and Lasso Regression return only point estimates But what if we

More information

Lecture : Probabilistic Machine Learning

Lecture : Probabilistic Machine Learning Lecture : Probabilistic Machine Learning Riashat Islam Reasoning and Learning Lab McGill University September 11, 2018 ML : Many Methods with Many Links Modelling Views of Machine Learning Machine Learning

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT68 Winter 8) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

More information

GWAS IV: Bayesian linear (variance component) models

GWAS IV: Bayesian linear (variance component) models GWAS IV: Bayesian linear (variance component) models Dr. Oliver Stegle Christoh Lippert Prof. Dr. Karsten Borgwardt Max-Planck-Institutes Tübingen, Germany Tübingen Summer 2011 Oliver Stegle GWAS IV: Bayesian

More information

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen Advanced Machine Learning Lecture 3 Linear Regression II 02.11.2015 Bastian Leibe RWTH Aachen http://www.vision.rwth-aachen.de/ leibe@vision.rwth-aachen.de This Lecture: Advanced Machine Learning Regression

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 2: Bayesian Basics https://people.orie.cornell.edu/andrew/orie6741 Cornell University August 25, 2016 1 / 17 Canonical Machine Learning

More information

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade University of Oxford October 19 & 24, 2016 Outline Basis function expansion to capture non-linear relationships

More information

Linear Models for Regression

Linear Models for Regression Linear Models for Regression Machine Learning Torsten Möller Möller/Mori 1 Reading Chapter 3 of Pattern Recognition and Machine Learning by Bishop Chapter 3+5+6+7 of The Elements of Statistical Learning

More information

Lecture 1b: Linear Models for Regression

Lecture 1b: Linear Models for Regression Lecture 1b: Linear Models for Regression Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced

More information

Regularized Regression A Bayesian point of view

Regularized Regression A Bayesian point of view Regularized Regression A Bayesian point of view Vincent MICHEL Director : Gilles Celeux Supervisor : Bertrand Thirion Parietal Team, INRIA Saclay Ile-de-France LRI, Université Paris Sud CEA, DSV, I2BM,

More information

Module 22: Bayesian Methods Lecture 9 A: Default prior selection

Module 22: Bayesian Methods Lecture 9 A: Default prior selection Module 22: Bayesian Methods Lecture 9 A: Default prior selection Peter Hoff Departments of Statistics and Biostatistics University of Washington Outline Jeffreys prior Unit information priors Empirical

More information

Relevance Vector Machines

Relevance Vector Machines LUT February 21, 2011 Support Vector Machines Model / Regression Marginal Likelihood Regression Relevance vector machines Exercise Support Vector Machines The relevance vector machine (RVM) is a bayesian

More information

Outline lecture 2 2(30)

Outline lecture 2 2(30) Outline lecture 2 2(3), Lecture 2 Linear Regression it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic Control

More information

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine

Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Bayesian Inference: Principles and Practice 3. Sparse Bayesian Models and the Relevance Vector Machine Mike Tipping Gaussian prior Marginal prior: single α Independent α Cambridge, UK Lecture 3: Overview

More information

STA 4273H: Sta-s-cal Machine Learning

STA 4273H: Sta-s-cal Machine Learning STA 4273H: Sta-s-cal Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistical Sciences! rsalakhu@cs.toronto.edu! h0p://www.cs.utoronto.ca/~rsalakhu/ Lecture 2 In our

More information

PMR Learning as Inference

PMR Learning as Inference Outline PMR Learning as Inference Probabilistic Modelling and Reasoning Amos Storkey Modelling 2 The Exponential Family 3 Bayesian Sets School of Informatics, University of Edinburgh Amos Storkey PMR Learning

More information

Bayesian Linear Regression. Sargur Srihari

Bayesian Linear Regression. Sargur Srihari Bayesian Linear Regression Sargur srihari@cedar.buffalo.edu Topics in Bayesian Regression Recall Max Likelihood Linear Regression Parameter Distribution Predictive Distribution Equivalent Kernel 2 Linear

More information

Hierarchical Models & Bayesian Model Selection

Hierarchical Models & Bayesian Model Selection Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Contact information Please report any typos or

More information

20: Gaussian Processes

20: Gaussian Processes 10-708: Probabilistic Graphical Models 10-708, Spring 2016 20: Gaussian Processes Lecturer: Andrew Gordon Wilson Scribes: Sai Ganesh Bandiatmakuri 1 Discussion about ML Here we discuss an introduction

More information

Bias-Variance Trade-off in ML. Sargur Srihari

Bias-Variance Trade-off in ML. Sargur Srihari Bias-Variance Trade-off in ML Sargur srihari@cedar.buffalo.edu 1 Bias-Variance Decomposition 1. Model Complexity in Linear Regression 2. Point estimate Bias-Variance in Statistics 3. Bias-Variance in Regression

More information

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0.

Linear regression example Simple linear regression: f(x) = ϕ(x)t w w ~ N(0, ) The mean and covariance are given by E[f(x)] = ϕ(x)e[w] = 0. Gaussian Processes Gaussian Process Stochastic process: basically, a set of random variables. may be infinite. usually related in some way. Gaussian process: each variable has a Gaussian distribution every

More information

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) = Until now we have always worked with likelihoods and prior distributions that were conjugate to each other, allowing the computation of the posterior distribution to be done in closed form. Unfortunately,

More information

Bayesian Inference: Concept and Practice

Bayesian Inference: Concept and Practice Inference: Concept and Practice fundamentals Johan A. Elkink School of Politics & International Relations University College Dublin 5 June 2017 1 2 3 Bayes theorem In order to estimate the parameters of

More information

Introduction to Machine Learning

Introduction to Machine Learning Introduction to Machine Learning Linear Regression Varun Chandola Computer Science & Engineering State University of New York at Buffalo Buffalo, NY, USA chandola@buffalo.edu Chandola@UB CSE 474/574 1

More information

PATTERN RECOGNITION AND MACHINE LEARNING

PATTERN RECOGNITION AND MACHINE LEARNING PATTERN RECOGNITION AND MACHINE LEARNING Chapter 1. Introduction Shuai Huang April 21, 2014 Outline 1 What is Machine Learning? 2 Curve Fitting 3 Probability Theory 4 Model Selection 5 The curse of dimensionality

More information

COS513 LECTURE 8 STATISTICAL CONCEPTS

COS513 LECTURE 8 STATISTICAL CONCEPTS COS513 LECTURE 8 STATISTICAL CONCEPTS NIKOLAI SLAVOV AND ANKUR PARIKH 1. MAKING MEANINGFUL STATEMENTS FROM JOINT PROBABILITY DISTRIBUTIONS. A graphical model (GM) represents a family of probability distributions

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 254 Part V

More information

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.) Prof. Daniel Cremers 2. Regression (cont.) Regression with MLE (Rep.) Assume that y is affected by Gaussian noise : t = f(x, w)+ where Thus, we have p(t x, w, )=N (t; f(x, w), 2 ) 2 Maximum A-Posteriori

More information

Ch 4. Linear Models for Classification

Ch 4. Linear Models for Classification Ch 4. Linear Models for Classification Pattern Recognition and Machine Learning, C. M. Bishop, 2006. Department of Computer Science and Engineering Pohang University of Science and echnology 77 Cheongam-ro,

More information

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015

Model Choice. Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci. October 27, 2015 Model Choice Hoff Chapter 9, Clyde & George Model Uncertainty StatSci, Hoeting et al BMA StatSci October 27, 2015 Topics Variable Selection / Model Choice Stepwise Methods Model Selection Criteria Model

More information

Bayesian Model Averaging

Bayesian Model Averaging Bayesian Model Averaging Hoff Chapter 9, Hoeting et al 1999, Clyde & George 2004, Liang et al 2008 October 24, 2017 Bayesian Model Choice Models for the variable selection problem are based on a subset

More information

Outline Lecture 2 2(32)

Outline Lecture 2 2(32) Outline Lecture (3), Lecture Linear Regression and Classification it is our firm belief that an understanding of linear models is essential for understanding nonlinear ones Thomas Schön Division of Automatic

More information

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework HT5: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Maximum Likelihood Principle A generative model for

More information

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods Pattern Recognition and Machine Learning Chapter 6: Kernel Methods Vasil Khalidov Alex Kläser December 13, 2007 Training Data: Keep or Discard? Parametric methods (linear/nonlinear) so far: learn parameter

More information

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop

Bayesian Gaussian / Linear Models. Read Sections and 3.3 in the text by Bishop Bayesian Gaussian / Linear Models Read Sections 2.3.3 and 3.3 in the text by Bishop Multivariate Gaussian Model with Multivariate Gaussian Prior Suppose we model the observed vector b as having a multivariate

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 3 Stochastic Gradients, Bayesian Inference, and Occam s Razor https://people.orie.cornell.edu/andrew/orie6741 Cornell University August

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA Sistemi di Elaborazione dell Informazione Regressione Ruggero Donida Labati Dipartimento di Informatica via Bramante 65, 26013 Crema (CR), Italy http://homes.di.unimi.it/donida

More information

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Probabilistic machine learning group, Aalto University  Bayesian theory and methods, approximative integration, model Aki Vehtari, Aalto University, Finland Probabilistic machine learning group, Aalto University http://research.cs.aalto.fi/pml/ Bayesian theory and methods, approximative integration, model assessment and

More information

Least Squares Regression

Least Squares Regression CIS 50: Machine Learning Spring 08: Lecture 4 Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may not cover all the

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak Gaussian processes and bayesian optimization Stanisław Jastrzębski kudkudak.github.io kudkudak Plan Goal: talk about modern hyperparameter optimization algorithms Bayes reminder: equivalent linear regression

More information

Seminar über Statistik FS2008: Model Selection

Seminar über Statistik FS2008: Model Selection Seminar über Statistik FS2008: Model Selection Alessia Fenaroli, Ghazale Jazayeri Monday, April 2, 2008 Introduction Model Choice deals with the comparison of models and the selection of a model. It can

More information

Computer Vision Group Prof. Daniel Cremers. 3. Regression

Computer Vision Group Prof. Daniel Cremers. 3. Regression Prof. Daniel Cremers 3. Regression Categories of Learning (Rep.) Learnin g Unsupervise d Learning Clustering, density estimation Supervised Learning learning from a training data set, inference on the

More information

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning Lecture 3: More on regularization. Bayesian vs maximum likelihood learning L2 and L1 regularization for linear estimators A Bayesian interpretation of regularization Bayesian vs maximum likelihood fitting

More information

Bayesian linear regression

Bayesian linear regression Bayesian linear regression Linear regression is the basis of most statistical modeling. The model is Y i = X T i β + ε i, where Y i is the continuous response X i = (X i1,..., X ip ) T is the corresponding

More information

Sparse Linear Models (10/7/13)

Sparse Linear Models (10/7/13) STA56: Probabilistic machine learning Sparse Linear Models (0/7/) Lecturer: Barbara Engelhardt Scribes: Jiaji Huang, Xin Jiang, Albert Oh Sparsity Sparsity has been a hot topic in statistics and machine

More information

Default Priors and Effcient Posterior Computation in Bayesian

Default Priors and Effcient Posterior Computation in Bayesian Default Priors and Effcient Posterior Computation in Bayesian Factor Analysis January 16, 2010 Presented by Eric Wang, Duke University Background and Motivation A Brief Review of Parameter Expansion Literature

More information

CS-E3210 Machine Learning: Basic Principles

CS-E3210 Machine Learning: Basic Principles CS-E3210 Machine Learning: Basic Principles Lecture 4: Regression II slides by Markus Heinonen Department of Computer Science Aalto University, School of Science Autumn (Period I) 2017 1 / 61 Today s introduction

More information

Introduction to Probabilistic Machine Learning

Introduction to Probabilistic Machine Learning Introduction to Probabilistic Machine Learning Piyush Rai Dept. of CSE, IIT Kanpur (Mini-course 1) Nov 03, 2015 Piyush Rai (IIT Kanpur) Introduction to Probabilistic Machine Learning 1 Machine Learning

More information

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference Associate Instructor: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise noted, all material posted

More information

Machine learning, shrinkage estimation, and economic theory

Machine learning, shrinkage estimation, and economic theory Machine learning, shrinkage estimation, and economic theory Maximilian Kasy December 14, 2018 1 / 43 Introduction Recent years saw a boom of machine learning methods. Impressive advances in domains such

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Prediction Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict the

More information

Or How to select variables Using Bayesian LASSO

Or How to select variables Using Bayesian LASSO Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO x 1 x 2 x 3 x 4 Or How to select variables Using Bayesian LASSO On Bayesian Variable Selection

More information

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Consistent high-dimensional Bayesian variable selection via penalized credible regions Consistent high-dimensional Bayesian variable selection via penalized credible regions Howard Bondell bondell@stat.ncsu.edu Joint work with Brian Reich Howard Bondell p. 1 Outline High-Dimensional Variable

More information

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric? Zoubin Ghahramani Department of Engineering University of Cambridge, UK zoubin@eng.cam.ac.uk http://learning.eng.cam.ac.uk/zoubin/

More information

Statistical Data Mining and Machine Learning Hilary Term 2016

Statistical Data Mining and Machine Learning Hilary Term 2016 Statistical Data Mining and Machine Learning Hilary Term 2016 Dino Sejdinovic Department of Statistics Oxford Slides and other materials available at: http://www.stats.ox.ac.uk/~sejdinov/sdmml Naïve Bayes

More information

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters

Introduction. Start with a probability distribution f(y θ) for the data. where η is a vector of hyperparameters Introduction Start with a probability distribution f(y θ) for the data y = (y 1,...,y n ) given a vector of unknown parameters θ = (θ 1,...,θ K ), and add a prior distribution p(θ η), where η is a vector

More information

(1) Introduction to Bayesian statistics

(1) Introduction to Bayesian statistics Spring, 2018 A motivating example Student 1 will write down a number and then flip a coin If the flip is heads, they will honestly tell student 2 if the number is even or odd If the flip is tails, they

More information

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics

Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics Regularization Parameter Selection for a Bayesian Multi-Level Group Lasso Regression Model with Application to Imaging Genomics arxiv:1603.08163v1 [stat.ml] 7 Mar 016 Farouk S. Nathoo, Keelin Greenlaw,

More information

Part III. A Decision-Theoretic Approach and Bayesian testing

Part III. A Decision-Theoretic Approach and Bayesian testing Part III A Decision-Theoretic Approach and Bayesian testing 1 Chapter 10 Bayesian Inference as a Decision Problem The decision-theoretic framework starts with the following situation. We would like to

More information

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes CSci 8980: Advanced Topics in Graphical Models Gaussian Processes Instructor: Arindam Banerjee November 15, 2007 Gaussian Processes Outline Gaussian Processes Outline Parametric Bayesian Regression Gaussian

More information

Density Estimation. Seungjin Choi

Density Estimation. Seungjin Choi Density Estimation Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr http://mlg.postech.ac.kr/

More information

Lecture 2: Priors and Conjugacy

Lecture 2: Priors and Conjugacy Lecture 2: Priors and Conjugacy Melih Kandemir melih.kandemir@iwr.uni-heidelberg.de May 6, 2014 Some nice courses Fred A. Hamprecht (Heidelberg U.) https://www.youtube.com/watch?v=j66rrnzzkow Michael I.

More information

Least Squares Regression

Least Squares Regression E0 70 Machine Learning Lecture 4 Jan 7, 03) Least Squares Regression Lecturer: Shivani Agarwal Disclaimer: These notes are a brief summary of the topics covered in the lecture. They are not a substitute

More information

Learning Bayesian network : Given structure and completely observed data

Learning Bayesian network : Given structure and completely observed data Learning Bayesian network : Given structure and completely observed data Probabilistic Graphical Models Sharif University of Technology Spring 2017 Soleymani Learning problem Target: true distribution

More information

Parametric Techniques Lecture 3

Parametric Techniques Lecture 3 Parametric Techniques Lecture 3 Jason Corso SUNY at Buffalo 22 January 2009 J. Corso (SUNY at Buffalo) Parametric Techniques Lecture 3 22 January 2009 1 / 39 Introduction In Lecture 2, we learned how to

More information

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2

Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, Jeffreys priors. exp 1 ) p 2 Stat260: Bayesian Modeling and Inference Lecture Date: February 10th, 2010 Jeffreys priors Lecturer: Michael I. Jordan Scribe: Timothy Hunter 1 Priors for the multivariate Gaussian Consider a multivariate

More information

Parametric Techniques

Parametric Techniques Parametric Techniques Jason J. Corso SUNY at Buffalo J. Corso (SUNY at Buffalo) Parametric Techniques 1 / 39 Introduction When covering Bayesian Decision Theory, we assumed the full probabilistic structure

More information

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d) Instructor: Herke van Hoof (herke.vanhoof@mail.mcgill.ca) Slides mostly by: Class web page: www.cs.mcgill.ca/~hvanho2/comp551 Unless

More information

Linear Models for Regression. Sargur Srihari

Linear Models for Regression. Sargur Srihari Linear Models for Regression Sargur srihari@cedar.buffalo.edu 1 Topics in Linear Regression What is regression? Polynomial Curve Fitting with Scalar input Linear Basis Function Models Maximum Likelihood

More information

Linear Classification

Linear Classification Linear Classification Lili MOU moull12@sei.pku.edu.cn http://sei.pku.edu.cn/ moull12 23 April 2015 Outline Introduction Discriminant Functions Probabilistic Generative Models Probabilistic Discriminative

More information

Linear Regression and Discrimination

Linear Regression and Discrimination Linear Regression and Discrimination Kernel-based Learning Methods Christian Igel Institut für Neuroinformatik Ruhr-Universität Bochum, Germany http://www.neuroinformatik.rub.de July 16, 2009 Christian

More information

Choosing among models

Choosing among models Eco 515 Fall 2014 Chris Sims Choosing among models September 18, 2014 c 2014 by Christopher A. Sims. This document is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported

More information

Bayesian RL Seminar. Chris Mansley September 9, 2008

Bayesian RL Seminar. Chris Mansley September 9, 2008 Bayesian RL Seminar Chris Mansley September 9, 2008 Bayes Basic Probability One of the basic principles of probability theory, the chain rule, will allow us to derive most of the background material in

More information

An Introduction to Bayesian Linear Regression

An Introduction to Bayesian Linear Regression An Introduction to Bayesian Linear Regression APPM 5720: Bayesian Computation Fall 2018 A SIMPLE LINEAR MODEL Suppose that we observe explanatory variables x 1, x 2,..., x n and dependent variables y 1,

More information

Bayesian Machine Learning

Bayesian Machine Learning Bayesian Machine Learning Andrew Gordon Wilson ORIE 6741 Lecture 4 Occam s Razor, Model Construction, and Directed Graphical Models https://people.orie.cornell.edu/andrew/orie6741 Cornell University September

More information

Machine Learning Linear Regression. Prof. Matteo Matteucci

Machine Learning Linear Regression. Prof. Matteo Matteucci Machine Learning Linear Regression Prof. Matteo Matteucci Outline 2 o Simple Linear Regression Model Least Squares Fit Measures of Fit Inference in Regression o Multi Variate Regession Model Least Squares

More information

Some Curiosities Arising in Objective Bayesian Analysis

Some Curiosities Arising in Objective Bayesian Analysis . Some Curiosities Arising in Objective Bayesian Analysis Jim Berger Duke University Statistical and Applied Mathematical Institute Yale University May 15, 2009 1 Three vignettes related to John s work

More information

Today. Calculus. Linear Regression. Lagrange Multipliers

Today. Calculus. Linear Regression. Lagrange Multipliers Today Calculus Lagrange Multipliers Linear Regression 1 Optimization with constraints What if I want to constrain the parameters of the model. The mean is less than 10 Find the best likelihood, subject

More information

Neutron inverse kinetics via Gaussian Processes

Neutron inverse kinetics via Gaussian Processes Neutron inverse kinetics via Gaussian Processes P. Picca Politecnico di Torino, Torino, Italy R. Furfaro University of Arizona, Tucson, Arizona Outline Introduction Review of inverse kinetics techniques

More information

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks

ST440/540: Applied Bayesian Statistics. (9) Model selection and goodness-of-fit checks (9) Model selection and goodness-of-fit checks Objectives In this module we will study methods for model comparisons and checking for model adequacy For model comparisons there are a finite number of candidate

More information

Bayesian Regression (1/31/13)

Bayesian Regression (1/31/13) STA613/CBB540: Statistical methods in computational biology Bayesian Regression (1/31/13) Lecturer: Barbara Engelhardt Scribe: Amanda Lea 1 Bayesian Paradigm Bayesian methods ask: given that I have observed

More information

Probabilistic Graphical Models Lecture 20: Gaussian Processes

Probabilistic Graphical Models Lecture 20: Gaussian Processes Probabilistic Graphical Models Lecture 20: Gaussian Processes Andrew Gordon Wilson www.cs.cmu.edu/~andrewgw Carnegie Mellon University March 30, 2015 1 / 53 What is Machine Learning? Machine learning algorithms

More information

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh.

Regression. Machine Learning and Pattern Recognition. Chris Williams. School of Informatics, University of Edinburgh. Regression Machine Learning and Pattern Recognition Chris Williams School of Informatics, University of Edinburgh September 24 (All of the slides in this course have been adapted from previous versions

More information

The Laplace Approximation

The Laplace Approximation The Laplace Approximation Sargur N. University at Buffalo, State University of New York USA Topics in Linear Models for Classification Overview 1. Discriminant Functions 2. Probabilistic Generative Models

More information

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics STA414/2104 Lecture 11: Gaussian Processes Department of Statistics www.utstat.utoronto.ca Delivered by Mark Ebden with thanks to Russ Salakhutdinov Outline Gaussian Processes Exam review Course evaluations

More information

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester Physics 403 Choosing Priors and the Principle of Maximum Entropy Segev BenZvi Department of Physics and Astronomy University of Rochester Table of Contents 1 Review of Last Class Odds Ratio Occam Factors

More information

Statistical Inference

Statistical Inference Statistical Inference Liu Yang Florida State University October 27, 2016 Liu Yang, Libo Wang (Florida State University) Statistical Inference October 27, 2016 1 / 27 Outline The Bayesian Lasso Trevor Park

More information

Machine Learning Srihari. Probability Theory. Sargur N. Srihari

Machine Learning Srihari. Probability Theory. Sargur N. Srihari Probability Theory Sargur N. Srihari srihari@cedar.buffalo.edu 1 Probability Theory with Several Variables Key concept is dealing with uncertainty Due to noise and finite data sets Framework for quantification

More information

Cross-sectional space-time modeling using ARNN(p, n) processes

Cross-sectional space-time modeling using ARNN(p, n) processes Cross-sectional space-time modeling using ARNN(p, n) processes W. Polasek K. Kakamu September, 006 Abstract We suggest a new class of cross-sectional space-time models based on local AR models and nearest

More information

Estimation. Max Welling. California Institute of Technology Pasadena, CA

Estimation. Max Welling. California Institute of Technology Pasadena, CA Preliminaries Estimation Max Welling California Institute of Technology 36-93 Pasadena, CA 925 welling@vision.caltech.edu Let x denote a random variable and p(x) its probability density. x may be multidimensional

More information

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2 Bayesian Analysis for Natural Language Processing Lecture 2 Shay Cohen February 4, 2013 Administrativia The class has a mailing list: coms-e6998-11@cs.columbia.edu Need two volunteers for leading a discussion

More information

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824 Naïve Bayes Jia-Bin Huang ECE-5424G / CS-5824 Virginia Tech Spring 2019 Administrative HW 1 out today. Please start early! Office hours Chen: Wed 4pm-5pm Shih-Yang: Fri 3pm-4pm Location: Whittemore 266

More information

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression

Computer Vision Group Prof. Daniel Cremers. 9. Gaussian Processes - Regression Group Prof. Daniel Cremers 9. Gaussian Processes - Regression Repetition: Regularized Regression Before, we solved for w using the pseudoinverse. But: we can kernelize this problem as well! First step:

More information

Introduction into Bayesian statistics

Introduction into Bayesian statistics Introduction into Bayesian statistics Maxim Kochurov EF MSU November 15, 2016 Maxim Kochurov Introduction into Bayesian statistics EF MSU 1 / 7 Content 1 Framework Notations 2 Difference Bayesians vs Frequentists

More information