Efficient Bayesian Multivariate Surface Regression

Efficient Bayesian Multivariate Surface Regression Feng Li (joint with Mattias Villani) Department of Statistics, Stockholm University October, 211

Outline of the talk 1 Flexible regression models 2 The challenges 3 The multivariate surface model 4 Simulation study 5 Application to firm leverage data 6 Extensions Feng Li (Stockholm University) Multivariate Surface Regression October, 211 2 / 18

Flexible regression models ï Introduction Flexible models of the regression function E(y x) has been an active research field for decades. Attention has shifted from kernel regression methods to spline-based models. Splines are regression models with flexible mean functions. Example: a simple spline regression with only one explanatory variable with truncated linear basis function can be like this where y = α +α 1 x+β 1 (x ξ 1 ) + +...+β q (x ξ q ) + +ε (x ξi ) + are called the basis functions, ξi are called knots (the location of the basis function). Feng Li (Stockholm University) Multivariate Surface Regression October, 211 3 / 18

Flexible regression models ï Spline example (single covariate with thinplate bases) Feng Li (Stockholm University) Multivariate Surface Regression October, 211 4 / 18

Flexible regression models ï Spline regression with multiple covariates Additive spline model Each knot ξj. (scaler) is connected with only one covariate m 1 m ÿ ÿ q y = α +α 1 x 1 +...+α q x q + β j1 f(x 1,ξ j1 )+...+ β jq f ( ) x q,ξ jq +ε j 1 =1 j q=1 Good and simple if you know there is no interactions in the data a priori. Surface spline model Each knot ξj (vector) is connected with more than one covariate [ ] mÿ y = α +α 1 x 1 +...+α q x q + β j g(x 1,...x q,ξ j ) +ε A popular choice of g(x1,...x q,ξ j ) can be e.g. the multi-dimensional thinplate spline g(x 1,...x q,ξ j ) = }x ξ j } 2 ln }x ξ j } Can handle the interactions but the model complexity increase dramatically with the interactive knots. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 5 / 18 j=1

The challenges How many knots are needed? Too few knots lead to a bad approximation; too many knots yield overfitting. Where to place those knots? Equal spacing for the additive model, which is obviously not efficient with the surface model. Common approaches to the two problems: place enough many knots and use variable selection to pick up useful ones. not truly flexible use reversible jump MCMC to move among the model spaces with different numbers of knots very sensitive to the prior and not computational efficient clustering the covariates to select knots does not use the information from the responses How to choose between additive spline and surface spline? NA Feng Li (Stockholm University) Multivariate Surface Regression October, 211 6 / 18

The multivariate surface model ï The model The multivariate surface model consists of three different components, linear, surface and additive as Y = X o B o +X s (ξ s )B s +X a (ξ a )B a +E. We treat the knots ξ i as unknown parameters and let them move freely. A model with a minimal number of free knots outperforms model with lots of fixed knots. For notational convenience, we sometimes write model in compact form Y = XB+E, where X = [X o,x s,x a ] and B = [B o 1,B s 1,B a 1 ] 1 and E N p (, Σ) Feng Li (Stockholm University) Multivariate Surface Regression October, 211 7 / 18

The multivariate surface model ï The prior Conditional on the knots, the prior for B and Σ are set as [ ] vecb i Σ, λ i N q µ i, Λ 1/2 i ΣΛ 1/2 bp 1 i i, i P to,s,au, Σ IW[n S, n ], Λi = diag(λ i ) are called the shrinkage parameters, which is used for overcome overfitting through the prior. If Pi = I, can prevent singularity problem, like the ridge regression estimate. If Pi = X 1 i X i: use the covariates information, also a compressed version of least squares estimate when λ i is large. The shrinkage parameters are estimated in MCMC A small λi shrinks the variance of the conditional posterior for B i It is another approach to selection important variables (knots) and components. We allow to mixed use the two types priors ( P i = I, P i = X 1 i X i) in different components in order to take the both the advantages of them. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 8 / 18

The multivariate surface model ï The Bayesian posterior The posterior distribution is conveniently decomposed as p(b,σ,ξ,λ Y,X) = p(b Σ,ξ,λ,Y,X)p(Σ ξ,λ,y,x)p(ξ,λ Y,X). Hence p(b Σ, ξ, λ, Y, X) follows the multivariate normal distribution according to the conjugacy; When p = 1, p(σ ξ,λ,y,x) follows the inverse Wishart distribution [! IW n +n, n S +n S+ ÿ )] Λ 1/2 ipto,s,au i ( B i M i ) 1 P i ( B i M )Λ 1/2 i i When p ě 2, no closed form of p(σ ξ,λ,y,x), the above result is an very accurate approximation. Then the marginal posterior of Σ, ξ and λ is p(σ,ξ,λ Y,X) =c ˆp(ξ,Σ) ˆ Σ 1/2 Σ (n+n+p+1)/2 β Σ 1/2 β " ˆ exp 1 [trσ 1 ( ) n S +n S +( β µ β 2 ) 1Σ 1 ( β µ) ]* Feng Li (Stockholm University) Multivariate Surface Regression October, 211 9 / 18

The MCMC algorithm ï Metropolis-Hastings within Gibbs The coefficients (B) are directly sampled from normal distribution. We update covariance (Σ), all knots (ξ) and shrinkages (λ) jointly by using Metropolis-Hastings within Gibbs. The proposal density for ξ and λ is a multivariate t-density with ν ą 2 df, [ ( ] B θ p θ c t ˆθ, 2 lnp(θ Y) ) 1ˇˇˇˇ, ν, BθBθ 1 ˇθ= ˆθ where ˆθ is obtained by R steps (R ď 3) Newton s iterations during the proposal with analytical gradients for matrices. The proposal density for Σ is the inverse Wishart density on previous slide. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 1 / 18

The MCMC algorithm ï Computational remarks The MCMC implementations are straightforward. But there are lots of parameters to estimate: e.g., a 1 covariates, univariate response with 5 knots in both additive and surface components need 1 ˆ 5+1ˆ5+11+1 = 661 parameters. We allow the parameters to be updated via: parallel mode for small datasets, batched mode for big datasets. We derive the analytical gradients during R-step Newton s iterations which are very complicated. We have implemented it in an efficient way. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 11 / 18

Simulation study We randomly generate 1 datasets with various degrees of nonlinearity (p = 1, 2, n = 2, 1). Covariates: Mixture of multivariate normal with five components; Bases: Five true bases from the covariates. For each dataset we fit with the fixed knots model for 5, 1, 15, 2, 25 and 5 surface knots, and also the free knots model for 5, 1, and 15 surface knots. The results show the free knots model outperforms the fixed knots model in the large majority of the datasets. This is particularly true when the data are strongly nonlinear. Some results from the simulation Ñ ε ~ ε ~ ε ~ ε ~ 15 1 5 5 1 15 3 2 1 1 2 3 Ranking w.r.t. fixed knots model 3rd best Fixed knots model (qs = 15) Free knots model (qs = 5).8 1. 1.2 1.4 1.6 medium 1 75 5 25 25 5 75 1 4 3 2 1 1 2 3 4 Ranking w.r.t. free knots model 3rd best.8 1. 1.2 1.4 1.6 medium The norm of the predictive multivariate residuals (p = 2) against the distance between the testing covariates x (randomly selected covariates space, therefore out-of-sample) and DGP sample mean. 1 75.8 1. 1.2 1.4 1.6 3rd worst 2 15.8 1. 1.2 1.4 1.6 3rd worst The residuals from vertical bars above the zero line is from the model with 15 fixed surface knots. The residuals from vertical bars below the zero line is from the model with 5 free surface knots. ε ~ ε ~ 5 25 25 5 75 1 1 5 5 1 15 2.8 1. 1.2 1.4 1.6.8 1. 1.2 1.4 1.6 x x o x x o

Application to firm leverage data ï The data 1..6.8 1..8.6.8.6.4.6.8 1. 15 2 25 3 5 1 15 2 25 3 Feng Li (Stockholm University).4. 5 5 5 1 5 1 4 6 4 2 2 4 6 4 2 2 4 2 Tang.2.4. 1 3 2 3 2 1 1 1 1 6.2 5 4. 2 1..8 6 4 2.6 6.4 6.2 6. 6 4 2 log[y/(1 Y)].2.4.2...2.4 Y.6.8 1. total debt/(total debt+book value of equity), 445 observations; tangible assets/book value of total assets; (book value of total assets - book value of equity + market value of equity) / book value of total assets; logarithm of sales; (earnings before interest, taxes, depreciation, and amortization) / book value of total assets. 1. leverage (Y): tang: market2book: logsales: profit: Market2Book Multivariate Surface Regression LogSale Profit October, 211 13 / 18

Surface + 2 fixed additive knots Surface + 2 free additive knots 122 123 124 125 126 127 128 129 13 131 132 133 134 135 136 137 138 139 LPDS Surface component model Free knots 5 1 15 2 25 3 35 4 Additive component model Fixed knots 5 1 15 2 25 3 35 4 LPDS LPDS 123 124 125 126 127 128 129 13 122 123 124 125 126 127 Surface + 4 fixed additive knots Surface + 4 free additive knots No. of surface knots No. of additive knots 128 129 13 Ò Models with only surface or additive components 122 Surface + 8 fixed additive knots Surface + 8 free additive knots Ñ LPDS Model with both additive and surface components. Log predictive density score which is defined as 123 124 125 1 ÿ D lnp(ỹd D d=1 Ỹ d,x), LPDS 126 127 128 and D = 5 in the cross-validation. 129 13 Free surface knots Fixed surface knots 5 1 15 2 25 3 35 4 5 1 15 2 25 3 35 4 No. of surface knots No. of surface knots

Posterior locations of knots 1..7.5.6..5.5 Profit 1. 1.5.4.3 2..2 2.5 3. Prior mean of surface knots Prior mean of additive knots.1 3.5 5 1 15 2 25 3 35. Market2Book

Application to firm leverage data ï Posterior mean surface(left) and standard deviation(right) Posterior surface (mean) Posterior surface (standard deviation) 1..8.7 1..8.2.6.6.6.4.5.4.15.2.2 Profit..4 Profit..2.3.2.1.4.4.6.2.6.5.8.1.8 1. 1. 5 1 15 5 1 15 Market2Book Market2Book Feng Li (Stockholm University) Multivariate Surface Regression October, 211 16 / 18

Extensions It is easy to generalize the model to GLM framework. Variable selection is possible for knots. Dirichlet precess prior can be plugged into the model when heteroscedasticity is the problem. Feng Li (Stockholm University) Multivariate Surface Regression October, 211 17 / 18