Efficient Bayesian Multivariate Surface Regression

Similar documents
Efficient Bayesian Multivariate Surface Regression

Part 6: Multivariate Normal and Linear Models

Gibbs Sampling in Linear Models #2

Bayesian Linear Regression

Bayesian Aspects of Statistical Learning Summer School of Statistics 2011

Bayesian non-parametric model to longitudinally predict churn

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

MULTILEVEL IMPUTATION 1

Overfitting, Bias / Variance Analysis

Bayesian linear regression

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. January 28, 2015

Linear Regression (9/11/13)

MFM Practitioner Module: Risk & Asset Allocation. John Dodson. February 3, 2010

CPSC 540: Machine Learning

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

Bayesian Analysis of Multivariate Normal Models when Dimensions are Absent

Bayesian Inference. Chapter 9. Linear models and regression

High-dimensional Problems in Finance and Economics. Thomas M. Mertens

Motivation Scale Mixutres of Normals Finite Gaussian Mixtures Skew-Normal Models. Mixture Models. Econ 690. Purdue University

Bayesian Methods for Machine Learning

STAT 730 Chapter 4: Estimation

Timevarying VARs. Wouter J. Den Haan London School of Economics. c Wouter J. Den Haan

MCMC algorithms for fitting Bayesian models

Part 8: GLMs and Hierarchical LMs and GLMs

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Data Analysis and Machine Learning Lecture 12: Multicollinearity, Bias-Variance Trade-off, Cross-validation and Shrinkage Methods.

variability of the model, represented by σ 2 and not accounted for by Xβ

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

STA 4273H: Statistical Machine Learning

Bayesian Linear Models

Linear Regression (continued)

ST 740: Linear Models and Multivariate Normal Inference

Markov Chain Monte Carlo methods

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Multivariate Normal & Wishart

A Process over all Stationary Covariance Kernels

Machine Learning for OR & FE

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

STA 4273H: Statistical Machine Learning

17 : Markov Chain Monte Carlo

Vector Autoregressive Model. Vector Autoregressions II. Estimation of Vector Autoregressions II. Estimation of Vector Autoregressions I.

GWAS IV: Bayesian linear (variance component) models

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

GAUSSIAN PROCESS REGRESSION

Logistic Regression. Seungjin Choi

Bayesian Linear Models

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Lecture 16: Mixtures of Generalized Linear Models

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Regression: Lecture 2

Wrapped Gaussian processes: a short review and some new results

Latent Variable Models for Binary Data. Suppose that for a given vector of explanatory variables x, the latent

Gaussian processes and bayesian optimization Stanisław Jastrzębski. kudkudak.github.io kudkudak

A Search and Jump Algorithm for Markov Chain Monte Carlo Sampling. Christopher Jennison. Adriana Ibrahim. Seminar at University of Kuwait

Bayesian construction of perceptrons to predict phenotypes from 584K SNP data.

Introduction to Machine Learning

EM Algorithm II. September 11, 2018

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Statistical Data Mining and Machine Learning Hilary Term 2016

November 2002 STA Random Effects Selection in Linear Mixed Models

Logistic Regression. COMP 527 Danushka Bollegala

Bayesian Inference for DSGE Models. Lawrence J. Christiano

Regression, Ridge Regression, Lasso

Lecture 4: Types of errors. Bayesian regression models. Logistic regression

Bayesian spatial hierarchical modeling for temperature extremes

Labor-Supply Shifts and Economic Fluctuations. Technical Appendix

A new algorithm for deriving optimal designs

ECE521 week 3: 23/26 January 2017

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

The linear model is the most fundamental of all serious statistical models encompassing:

Bayesian Model Diagnostics and Checking

Lecture 8: Bayesian Estimation of Parameters in State Space Models

Dependence. Practitioner Course: Portfolio Optimization. John Dodson. September 10, Dependence. John Dodson. Outline.

Introduction to Markov Chain Monte Carlo & Gibbs Sampling

Topic 5: Non-Linear Relationships and Non-Linear Least Squares

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Pattern Recognition and Machine Learning. Bishop Chapter 6: Kernel Methods

Graphical Models for Collaborative Filtering

Data Mining and Analysis: Fundamental Concepts and Algorithms

STA 294: Stochastic Processes & Bayesian Nonparametrics

Pattern Recognition and Machine Learning. Bishop Chapter 2: Probability Distributions

12 - Nonparametric Density Estimation

Marginal Specifications and a Gaussian Copula Estimation

Katsuhiro Sugita Faculty of Law and Letters, University of the Ryukyus. Abstract

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Linear Models in Machine Learning

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Student-t Process as Alternative to Gaussian Processes Discussion

Bayesian Inference for DSGE Models. Lawrence J. Christiano

The Wishart distribution Scaled Wishart. Wishart Priors. Patrick Breheny. March 28. Patrick Breheny BST 701: Bayesian Modeling in Biostatistics 1/11

Bayesian Linear Models

Markov chain Monte Carlo methods in atmospheric remote sensing

13 Notes on Markov Chain Monte Carlo

Simulating Random Variables

Transcription:

Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211

Outline of the talk 1 Flexible regression models 2 The challenges 3 The multivariate surface model 4 Simulation study 5 Application to firm leverage data 6 Extensions Feng Li (Stockholm University) Multivariate Surface Regression October, 211 2 / 18

Flexible regression models ï Introduction Flexible models of the regression function E(y x) has been an active research field for decades. Attention has shifted from kernel regression methods to spline-based models. Splines are regression models with flexible mean functions. Example: a simple spline regression with only one explanatory variable with truncated linear basis function can be like this where y = α +α 1 x+β 1 (x ξ 1 ) + +...+β q (x ξ q ) + +ε (x ξi ) + are called the basis functions, ξi are called knots (the location of the basis function). Feng Li (Stockholm University) Multivariate Surface Regression October, 211 3 / 18

Flexible regression models ï Spline example (single covariate with thinplate bases) Feng Li (Stockholm University) Multivariate Surface Regression October, 211 4 / 18

Flexible regression models ï Spline regression with multiple covariates Additive spline model Each knot ξj. (scaler) is connected with only one covariate m 1 m ÿ ÿ q y = α +α 1 x 1 +...+α q x q + β j1 f(x 1,ξ j1 )+...+ β jq f ( ) x q,ξ jq +ε j 1 =1 j q=1 Good and simple if you know there is no interactions in the data a priori. Surface spline model Each knot ξj (vector) is connected with more than one covariate [ ] mÿ y = α +α 1 x 1 +...+α q x q + β j g(x 1,...x q,ξ j ) +ε A popular choice of g(x1,...x q,ξ j ) can be e.g. the multi-dimensional thinplate spline g(x 1,...x q,ξ j ) = }x ξ j } 2 ln }x ξ j } Can handle the interactions but the model complexity increase dramatically with the interactive knots. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 5 / 18 j=1

The challenges How many knots are needed? Too few knots lead to a bad approximation; too many knots yield overfitting. Where to place those knots? Equal spacing for the additive model, which is obviously not efficient with the surface model. Common approaches to the two problems: place enough many knots and use variable selection to pick up useful ones. not truly flexible use reversible jump MCMC to move among the model spaces with different numbers of knots very sensitive to the prior and not computational efficient clustering the covariates to select knots does not use the information from the responses How to choose between additive spline and surface spline? NA Feng Li (Stockholm University) Multivariate Surface Regression October, 211 6 / 18

The multivariate surface model ï The model The multivariate surface model consists of three different components, linear, surface and additive as Y = X o B o +X s (ξ s )B s +X a (ξ a )B a +E. We treat the knots ξ i as unknown parameters and let them move freely. A model with a minimal number of free knots outperforms model with lots of fixed knots. For notational convenience, we sometimes write model in compact form Y = XB+E, where X = [X o,x s,x a ] and B = [B o 1,B s 1,B a 1 ] 1 and E N p (, Σ) Feng Li (Stockholm University) Multivariate Surface Regression October, 211 7 / 18

The multivariate surface model ï The prior Conditional on the knots, the prior for B and Σ are set as [ ] vecb i Σ, λ i N q µ i, Λ 1/2 i ΣΛ 1/2 bp 1 i i, i P to,s,au, Σ IW[n S, n ], Λi = diag(λ i ) are called the shrinkage parameters, which is used for overcome overfitting through the prior. If Pi = I, can prevent singularity problem, like the ridge regression estimate. If Pi = X 1 i X i: use the covariates information, also a compressed version of least squares estimate when λ i is large. The shrinkage parameters are estimated in MCMC A small λi shrinks the variance of the conditional posterior for B i It is another approach to selection important variables (knots) and components. We allow to mixed use the two types priors ( P i = I, P i = X 1 i X i) in different components in order to take the both the advantages of them. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 8 / 18

The multivariate surface model ï The Bayesian posterior The posterior distribution is conveniently decomposed as p(b,σ,ξ,λ Y,X) = p(b Σ,ξ,λ,Y,X)p(Σ ξ,λ,y,x)p(ξ,λ Y,X). Hence p(b Σ, ξ, λ, Y, X) follows the multivariate normal distribution according to the conjugacy; When p = 1, p(σ ξ,λ,y,x) follows the inverse Wishart distribution [! IW n +n, n S +n S+ ÿ )] Λ 1/2 ipto,s,au i ( B i M i ) 1 P i ( B i M )Λ 1/2 i i When p ě 2, no closed form of p(σ ξ,λ,y,x), the above result is an very accurate approximation. Then the marginal posterior of Σ, ξ and λ is p(σ,ξ,λ Y,X) =c ˆp(ξ,Σ) ˆ Σ 1/2 Σ (n+n+p+1)/2 β Σ 1/2 β " ˆ exp 1 [trσ 1 ( ) n S +n S +( β µ β 2 ) 1Σ 1 ( β µ) ]* Feng Li (Stockholm University) Multivariate Surface Regression October, 211 9 / 18

The MCMC algorithm ï Metropolis-Hastings within Gibbs The coefficients (B) are directly sampled from normal distribution. We update covariance (Σ), all knots (ξ) and shrinkages (λ) jointly by using Metropolis-Hastings within Gibbs. The proposal density for ξ and λ is a multivariate t-density with ν ą 2 df, [ ( ] B θ p θ c t ˆθ, 2 lnp(θ Y) ) 1ˇˇˇˇ, ν, BθBθ 1 ˇθ= ˆθ where ˆθ is obtained by R steps (R ď 3) Newton s iterations during the proposal with analytical gradients for matrices. The proposal density for Σ is the inverse Wishart density on previous slide. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 1 / 18

The MCMC algorithm ï Computational remarks The MCMC implementations are straightforward. But there are lots of parameters to estimate: e.g., a 1 covariates, univariate response with 5 knots in both additive and surface components need 1 ˆ 5+1ˆ5+11+1 = 661 parameters. We allow the parameters to be updated via: parallel mode for small datasets, batched mode for big datasets. We derive the analytical gradients during R-step Newton s iterations which are very complicated. We have implemented it in an efficient way. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 11 / 18

Simulation study We randomly generate 1 datasets with various degrees of nonlinearity (p = 1, 2, n = 2, 1). Covariates: Mixture of multivariate normal with five components; Bases: Five true bases from the covariates. For each dataset we fit with the fixed knots model for 5, 1, 15, 2, 25 and 5 surface knots, and also the free knots model for 5, 1, and 15 surface knots. The results show the free knots model outperforms the fixed knots model in the large majority of the datasets. This is particularly true when the data are strongly nonlinear. Some results from the simulation Ñ ε ~ ε ~ ε ~ ε ~ 15 1 5 5 1 15 3 2 1 1 2 3 Ranking w.r.t. fixed knots model 3rd best Fixed knots model (qs = 15) Free knots model (qs = 5).8 1. 1.2 1.4 1.6 medium 1 75 5 25 25 5 75 1 4 3 2 1 1 2 3 4 Ranking w.r.t. free knots model 3rd best.8 1. 1.2 1.4 1.6 medium The norm of the predictive multivariate residuals (p = 2) against the distance between the testing covariates x (randomly selected covariates space, therefore out-of-sample) and DGP sample mean. 1 75.8 1. 1.2 1.4 1.6 3rd worst 2 15.8 1. 1.2 1.4 1.6 3rd worst The residuals from vertical bars above the zero line is from the model with 15 fixed surface knots. The residuals from vertical bars below the zero line is from the model with 5 free surface knots. ε ~ ε ~ 5 25 25 5 75 1 1 5 5 1 15 2.8 1. 1.2 1.4 1.6.8 1. 1.2 1.4 1.6 x x o x x o

Application to firm leverage data ï The data 1..6.8 1..8.6.8.6.4.6.8 1. 15 2 25 3 5 1 15 2 25 3 Feng Li (Stockholm University).4. 5 5 5 1 5 1 4 6 4 2 2 4 6 4 2 2 4 2 Tang.2.4. 1 3 2 3 2 1 1 1 1 6.2 5 4. 2 1..8 6 4 2.6 6.4 6.2 6. 6 4 2 log[y/(1 Y)].2.4.2...2.4 Y.6.8 1. total debt/(total debt+book value of equity), 445 observations; tangible assets/book value of total assets; (book value of total assets - book value of equity + market value of equity) / book value of total assets; logarithm of sales; (earnings before interest, taxes, depreciation, and amortization) / book value of total assets. 1. leverage (Y): tang: market2book: logsales: profit: Market2Book Multivariate Surface Regression LogSale Profit October, 211 13 / 18

Surface + 2 fixed additive knots Surface + 2 free additive knots 122 123 124 125 126 127 128 129 13 131 132 133 134 135 136 137 138 139 LPDS Surface component model Free knots 5 1 15 2 25 3 35 4 Additive component model Fixed knots 5 1 15 2 25 3 35 4 LPDS LPDS 123 124 125 126 127 128 129 13 122 123 124 125 126 127 Surface + 4 fixed additive knots Surface + 4 free additive knots No. of surface knots No. of additive knots 128 129 13 Ò Models with only surface or additive components 122 Surface + 8 fixed additive knots Surface + 8 free additive knots Ñ LPDS Model with both additive and surface components. Log predictive density score which is defined as 123 124 125 1 ÿ D lnp(ỹd D d=1 Ỹ d,x), LPDS 126 127 128 and D = 5 in the cross-validation. 129 13 Free surface knots Fixed surface knots 5 1 15 2 25 3 35 4 5 1 15 2 25 3 35 4 No. of surface knots No. of surface knots

Posterior locations of knots 1..7.5.6..5.5 Profit 1. 1.5.4.3 2..2 2.5 3. Prior mean of surface knots Prior mean of additive knots.1 3.5 5 1 15 2 25 3 35. Market2Book

Application to firm leverage data ï Posterior mean surface(left) and standard deviation(right) Posterior surface (mean) Posterior surface (standard deviation) 1..8.7 1..8.2.6.6.6.4.5.4.15.2.2 Profit..4 Profit..2.3.2.1.4.4.6.2.6.5.8.1.8 1. 1. 5 1 15 5 1 15 Market2Book Market2Book Feng Li (Stockholm University) Multivariate Surface Regression October, 211 16 / 18

Extensions It is easy to generalize the model to GLM framework. Variable selection is possible for knots. Dirichlet precess prior can be plugged into the model when heteroscedasticity is the problem. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 17 / 18