Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Stanford University June 2, 1998 Bayesian Backtting: 1 Bayesian Backtting Trevor Hastie Stanford University Rob Tibshirani University of Toronto Email: trevor@stat.stanford.edu Ftp: stat.stanford.edu: pub/hastie WWW: http://stat.stanford.edu/~trevor These transparencies are available via ftp: ftp://stat.stanford.edu/pub/hastie/bayes.ps

Stanford University June 2, 1998 Bayesian Backtting: 2 In a Nutshell y = f1 + f2 + : : : + f p + " Backtting cycles around and replaces each current function P estimate f j by f j S j (y k6=j f k), where S j is a smoothing operator; Gibbs sampling cycles around and obtains a new realization of f j via f j S j (y P k6=j f k) + S 1=2 j z, where z is a vector of N(0; 1) variates and is the standard deviation.

Stanford University June 2, 1998 Bayesian Backtting: 3 ' $ Example: Air Pollution Inversion Base Height 0 1000 3000 5000 10 5 0 5 10 Daggot Pressure Gradient 50 0 50 100 15 10 5 0 5 10 Inversion Base Temperature 0 100 200 300 10 0 10 20 Visibility (miles) 0 100 200 300 10 5 0 5 10 Four environmental variables in a air pollution study.

Stanford University June 2, 1998 Bayesian Backtting: 4 Outline Smoothing splines a Bayesian version. Additive Models and backtting Bayesian backtting Example: Bone mineral density Priors, variance components and DF GAMs and MetropolisHastings

Stanford University June 2, 1998 Bayesian Backtting: 5 Smoothing Splines y i = f(x i ) + " i ; " i N(0; 2 ) X R(f; y) = (y i f(x i )) 2 + i Z [f 00 (x)] 2 dx Solution is natural cubic spline, with knots at unique values of x Solutions vary between linear ts ( = 1) and interpolating ts ( = 0) Finite dimensional solution minimizes with solution (y f) T (y f) + f T Kf; f = (I + K) 1 y = S()y

Stanford University June 2, 1998 Bayesian Backtting: 6 Bayesian Smoothing Splines y = f + " f N(0; 2 K ) where K is partially improper (linear functions have innite variances) " N(0; 2 I) (y; f) (y f) T (y f)=2 2 + f T Kf=2 2 fjy N(S()y; S() 2 ), where = 2 = 2 f = S()y + S() 1 2 z where z is an Nvector of N(0; 1) variates.

Stanford University June 2, 1998 Bayesian Backtting: 7 Picture the Prior f N(0; 2 K ), or equivalently, f = N, and N(0; 2 D) 0 10 20 30 40 50 0 10 20 30 40 50 number number number number 0.2 0.1 0.0 eigenfunction 0.2 0.1 number number number number 0.1 0.2 0.0 0.1 0.2 0 10 20 30 40 50 0 10 20 30 40 50 Eigenfunctions of the Prior Covariance Prior Variance 10^3 10^0 10^2 10^4 0 10 20 30 40 50 Number

Stanford University June 2, 1998 Bayesian Backtting: 8 Backtting Additive Models y = f 1 (x 1 ) + f 2 (x 2 ) + + f p (x p ) + " Estimating Equations f 1 (x 1 ) = S 1 (y f 2 (x 2 ) f p (x p )) f 2 (x 2 ) = S 2 (y f 1 (x 1 ) f p (x p )). f p (x p ) = S p (y f 1 (x 1 ) f 2 (x 2 ) ) where S j are: univariate regression smoothers such as smoothing splines, lowess, kernel linear regression operators yielding polynomial ts, piecewise polynomials,... more complicated operators: surface smoothers for 2nd order interactions, and random eects shrinkage operators. We use GaussSeidel or \backtting" to solve these estimating equations.

Stanford University June 2, 1998 Bayesian Backtting: 9 Justication? Example: penalized least squares Minimize nx i=1 y i px j=1 2+ px f j (x ij ) j=1 j Z (f 00 j (t)) 2 dt m f 1 = S 1 ( 1 )(y X j6=1 f j ) f 2 = S 2 ( 2 )(y X j6=2 f j ). f p = S p ( p )(y X j6=p f j ) where S j ( j ) denotes a smoothing spline using variable x j and penalty coecient j. Each smoothing spline t is O(N) computations, hence so is entire t.

Stanford University June 2, 1998 Bayesian Backtting: 10 Bayesian Additive Splines f j N(0; 2 j K j ) f j jy N(G j y; C j 2 ) where G j and C j are ugly and O(N 3 ) computations. You don't believe me? See Wahba (1990). Backtting computes ^f j = G j y eciently in O(N) computations, but not C j. Current GAM software in Splus approximates C j by C 0 j + S j, where C 0 j is the exact posterior covariance operator for the linear part of f j, but S j is the exact posterior covariance operator for the nonlinear part of a univariate spline problem.

Stanford University June 2, 1998 Bayesian Backtting: 11 Gibbs sampling saves the day! L(f j jy; f k ; k 6= j) = L(f j jy X k6=j f k ; ff k ; k 6= jg) = N(S j (y X k6=j f k ); S j 2 ) By replacing the backtting operator f j S j (y P k6=j f k) by the univariate Bayesian spline posterior sampler P f j S j (y k6=j f k) + S 1=2 j z we generate a Markov chain, whose stationary distribution coincides with (f1; f2; : : : ; f p jy) Carter and Kohn (1994, Biometrika), and Kohn and Wong (1998, manuscript) propose a similar algorithm using a state space approach.

Stanford University June 2, 1998 Bayesian Backtting: 12 Other operators We can use the Bayesian Backtting algorithm with operators other than smoothing splines S j. General nonparametric smoother: S j denes the smoothing operation, with implicit prior f j N(0; (S j I) 2 ): The operator S 1=2 j is found by a simple Taylor series expansion. Fixed linear eects: S j = X j (X T j X j) 1 X T j. This results from the model f j = X j j with j N(0; 2 D) and D diagonal, and! 1. Then S 1=2 j = S j and is easily applied. For the intercept term, P for example, we simply obtain p N(ave[y 1 f j]; 2 =n). Random linear eects: S j = X j (X T X j j + 2 1 ) 1 X T j. This results from f j = X T j j with j N(0; ).

Stanford University June 2, 1998 Bayesian Backtting: 13 ' $ Example: Growth Curves y ij = f(t ij ) + x T i E + V i + " ij Age Spinal Bone Mineral Density 10 15 20 25 0.6 0.8 1.0 1.2 1.4 Age Spinal Bone Mineral Density 10 15 20 25 0.3 0.2 0.1 0.0 0.1 0.05 0.0 0.05 Asian Black Hispanic White Ethnic Group Age Random Individual Level Effect 10 15 20 25 0.2 0.0 0.2 0.4

Stanford University June 2, 1998 Bayesian Backtting: 14 Functionals of posteriors Posterior Distributions of Functionals Derivative of Posterior Mean 0.02 0.0 0.02 0.04 0.06 0.08 10 15 20 25 Age The location of the maximum derivative (center of growth spurt) is not too convincing. We now attempt a more realistic model

Stanford University June 2, 1998 Bayesian Backtting: 15 Computing S 1 2 z Smoothing Splines Writing f(x) = P M j=1 b j(x) j in terms of the natural spline basis, the posterior distribution for f has the form f jy N(Sy; S 2 ) = N(B^; B^ B T ) ^ = (B T B + ) 1 B T y ^ = 2 (B T B + ) 1 The Cholesky squareroot of the last expression is computed routinely in the smoothing spline computations, and so is available for our purposes. General Smoothers S 1=2 = S 1 2 S(S I) + 3 8 S(S I)2 5 16 S(S I)3

Stanford University June 2, 1998 Bayesian Backtting: 16 ' $ y ij = f(t ij i ) + x T i E + V i + " ij Age Spinal Bone Mineral Density 10 15 20 25 0.6 0.8 1.0 1.2 1.4 Principal Curve Age Mean Age Curve 10 15 20 25 0.0 0.1 0.2 0.3 0.4 Girl Average Age Random Level Shifts 10 15 20 25 0.2 0.1 0.0 0.1 0.2 Girl Average Age Random Age Shifts 10 15 20 25 1.0 0.5 0.0 0.5 1.0 1.5

Stanford University June 2, 1998 Bayesian Backtting: 17 Age Shifted Mean Curves 0.0 0.1 0.2 0.3 0.4 Level Shifted Mean Curves 0.6 0.8 1.0 1.2 1.4 10 15 20 25 10 15 20 25 Age Adjusted Age Shrunken Level Fits 0.3 0.2 0.1 0.0 0.1 0.2 0.3 10 15 20 25 Girl Average Age

Stanford University June 2, 1998 Bayesian Backtting: 18 ' $ 0.2 0.1 0.0 0.1 0.2 107 2 0 2 93 124 2 0 2 37 109 57 2 0 2 68 0.2 0.1 0.0 0.1 0.2 36 2 0 2 Theta V Girl Average Age Spinal BMD 10 15 20 25 0.6 0.8 1.0 1.2 1.4 107 93 124 37 109 57 68 36

Stanford University June 2, 1998 Bayesian Backtting: 19 Estimating 2, 2,... A fully hierarchical Bayesian approach puts priors on these hyperparameters, and generates them along with the other posterior realizations. Empirical Bayes procedures maximize the marginal likelihood of y to estimate the hyperparameters. Very similar to REML Restricted Maximum Likelihood Estimation. This highlights the formal equivalence of additive spline models and mixedeects models: Y = N 1 1 + N 2 2 + : : : N p p + " Crossvalidation, GCV and Cp use prediction error to guide selection.

Stanford University June 2, 1998 Bayesian Backtting: 20 Priors for 2, 2,... p( 2 j ) 1= 2 j or 1= 2 j exp( = 2 j ) with = 10e 10. p( 2 ) 1= 2. Wong and Kohn (1998), Carter and Kohn (1994) These lead to Inverse Gaussian posteriors: p( 2 j jy; 2 ; ff j g p 1 ) = p( 2 j jf j) = IG(n=2; 1 2 f T j K j f j + j ) p( 2 jrest) = IG(n=2; jjejj 2 ) where e = y P j f j. These are generated within each cycle of the Gibbs algorithm, along with the functions. O(N) computations.

Stanford University June 2, 1998 Bayesian Backtting: 21 Posterior for df The eective degrees of freedom are dened as df = trs(), where = 2 = 2. Inversion Base Height Daggot Pressure Gradient Degrees of Freedom 0 2 4 6 8 10 12 Degrees of Freedom 0 2 4 6 8 10 12 0 1000 3000 5000 0 1000 3000 5000 Iteration Iteration Inversion Base Temperature Visibility (miles) Degrees of Freedom 0 2 4 6 8 10 12 Degrees of Freedom 0 2 4 6 8 10 12 0 1000 3000 5000 Iteration 0 1000 3000 5000 Iteration

Stanford University June 2, 1998 Bayesian Backtting: 22 Prior for df? Using the priors for 2 and 2, we can induce a prior for df for any sequence of x values. 50 40 Percent of Total 30 20 10 0 0 10 20 30 40 50 Prior Degrees of Freedom Actually, this gure is based on log 2 U[ 25; 25]. When 25! 1, we get point masses of 1 at 2 and N! 2

Stanford University June 2, 1998 Bayesian Backtting: 23 Bayes vs Bootstrap Suppose 2 is known, and we add residuals r N(0; I 2 ) to ^f and ret. For a single smoothing spline: Bootstrap: f N(S 2 y; S 2 2 ) Bayes: f jy N(Sy; S 2 ) S > S 2, and the Bayes posterior intervals are wider than the bootstrap intervals; they include an average bias component. For an additive spline model: Bootstrap: f j = N(A j Ay; A 2 j 2 ) Bayes: f j N(A j y; (I A j )S j (I S j ) 2 ) Bayes Bootstrap 0.0.41.45 0.5.43.47 0.9.51.64

Stanford University June 2, 1998 Bayesian Backtting: 24 Generalized Additive Models Suppose instead we have a GAM, such as an additive logistic regression model: LogitP (Y = 1jx) = X j f j (x j ) where the f j can be functions, random eects or \xed" eects, each with their (Gaussian) priors N(0; K j ) and hyperparameters j. Similar to Zeger and Karim (1991, JASA), we propose a MetropolisHastings scheme for updating the functions: At the current state, approximate the likelihood by a Gaussian, thus creating a working response z i and weights w i. Generate a new realization f 0 to replace j f j from this Gaussian approximation, which we denote by q(f j ; f j). 0

Stanford University June 2, 1998 Bayesian Backtting: 25 Move to f 0 j with probability min 0 (f )q(f 0 ; f) (f)q(f; f 0 ) ; 1 where (f) denotes the posterior. Again, all the computations can be performed in O(N) operations per update for smoothing splines, random eects and xed eects. This allows for estimation of mixed eects GLMs, with both the usual random eects as well as nonparametric smoothers, in a seamless fashion. The End