Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Curtis B. Storlie a a Los Alamos National Laboratory E-mail:storlie@lanl.gov

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Motivating Example Computational Model from Yucca Mountain Certification 150 input variables (several of which are discrete in nature), dozens of time dependent responses Response Variable (for this illustration) ESIC239C.10K: Cumulative release of Ic (i.e. Glass) Colloid of 239Pu (Plutonium 239) out of the Engineered Barrier System into the Unsaturated Zone at 10,000 years. Model is very expensive to run, we have an Latin Hypercube sample of size n = 300 where the model is evaluated. How to perform sensitivity/uncertainty analysis?

Computer Model Emulation An emulator is a simpler model that mimics a larger physical model. Evaluations of an emulator are much faster. Nonparametric Regression We have n observations from the model y i = f(x i )+ε i i = 1,...,n where x i = (x i1,...,x ip ), and f is the physical model. Usually some weak assumptions are made about f (e.g., f belongs to a smooth class of functions). Methods of Estimation: Orthogonal Series/Wavelets, Kernel Smoothing/local regression, Penalization Methods (Smoothing Splines, Gaussian Processes), Machine Learning/Algorithmic Approaches. With limited number of model evaluations and high number of inputs, we need to reduce emulator complexity.

Variable Selection in Regression Models Focus for now on variable selection for the linear model y = β 0 + p β j x j +ε Stepwise/best subsets type model fitting Can produce unstable estimates More recently: Continuous shrinkage using L1 penalty (LASSO), Tibshirani 1996 Stochastic Search Variable Selection (SVSS), George & McCulloch 1993, 1997.

Shrinkage, aka Penalized Regression Ridge Regression: Find the minimizing β j s to 1 n n y i β 0 i=1 p β j x i,j 2 +λ Note: All the x s must first be standardized. p Improved MSE Estimation via bias-variance trade-off. β 2 j Ridge Regression is equivalent to minimizing 1 n n y i β 0 i=1 p β j x i,j 2 subject to p βj 2 < t 2 for some t(λ).

Shrinkage, aka Penalized Regression LASSO: Find the minimizing β j s to 1 n n y i β 0 i=1 p β j x i,j 2 +λ p ( ) β 2 1/2 j This is equivalent to minimizing 1 n n y i β 0 i=1 for some t(λ). p β j x i,j 2 subject to p β j < t

Geometry of Ridge Regression and the LASSO

Stochastic Search Variable Selection (SSVS) Linear Regression: y = β 0 + p β jx j +ε, where: βj = γ j α j γj Bern(π j ) αj N(0,τ 2 j ) (γ 1,...,γ p ) is the model, and is treated as an unknown random variable. The prior probability that x j is included in the model is P(β j 0) = π j. Inference is based on the posterior probability that x j is included in the model, P(β j 0 y). It is common to determine the best model as the one that includes the variables that have P(β j 0 y) > 0.5.

Functional ANOVA Decomposition Any function f(x) can be decomosed into main effects and interactions, p p f(x) = µ 0 + f j (x j )+ f j,k (x j,x k )+, j<k where µ 0 is the mean, f j are the main effects, f j,k are the two-way interactions, and ( ) are the higher order interactions. The functional components (f j,f j,k, ) are an orthogonal decomposition of the space, which implies the constraints 1 0 f j(x j )dx j = 0 for all j and 1 0 f j,k(x j,x k )dx j = 0 for all j,k, and similar relations for higher order interactions. This insures identifiability of the functional components.

Functional ANOVA Decomposition A convenient way to treat the high order interactions is to let p p f(x) = µ 0 + f j (x j )+ f j,k (x j,x k )+f R (x) j<k where f R is a high-order interaction (catch-all) remainder. In general we can say the function f(x) lies in some space F, q F = {1} F j (1) where {1},F 1...F q is an orthogonal decomposition of the space. For the example above, we would have f 1 F 1,...,f p F p,f 1,2 F p+1,... Continuity assumptions on f, such as number of continuous derivatives, can be built in through the choice of the F j.

The General Smoothing Spline The L-spline estimate ˆf is given by the minimizer of 1 n q (y i f(x i )) 2 +λ P j f 2 n F, i=1 over f F. P j f is the orthogonal projection of f onto F j, j = 1,...,q. For the additive model with each component function in S 2 = {g : g, g are absolutely continuous and g L 2 [0,1]} then ˆf is given by the minimizer of 1 n n (y i f(x i )) 2 +λ i=1 p { [f j (1) f j (0)] 2 + 1 0 [ f j (x j ) ] 2 dxj } The solution can be obtained conveniently with tools from reproducing kernel Hilbert space theory (see Wahba 1990).

Adaptive COmponent Selection and Smoothing Operator LASSO is to Ridge Regression as ACOSSO is to the Smoothing Spline. Find the minimizer over f F of 1 n q (y i f(x i )) 2 +λ w j P j f F. n i=1 For the additive model with each the minimization becomes 1 n p 1 } 1/2 (y i f(x i )) 2 +λ w j {[f j (1) f j (0)] 2 + [f j (x j )] 2 dx j n i=1 This estimator sets some of the functional components (f j s) equal to exactly zero (i.e., x j is removed from the model.) We want w j to allow prominent functional components to enjoy the benefit of a smaller penalty. Use a weight based on L 2 norm of an initial estimate f ( 1 ) γ/2 w j = f j γ L 2 = ( f j (x j )) 2 dx j 0 0

Bayesian Smoothing Spline ANOVA (BSS-ANOVA) Assume f(x) = µ 0 + p f j (x j )+ Model the mean as µ 0 N(0,τ 2 0 ) p f j,k (x j,x k )+f R (x) j<k Model the f j GP(0,τ 2 j K 1), f j,k GP(0,τ 2 j,k K 2) and f R GP(0,τ 2 R K R). The covariances functions K 1,K 2,K R are such that the functions µ 0,f j,f j,k,f R obey the Functional ANOVA constraints almost surely. They can also be chosen for desired level of continuity. Lastly apply SSVS to the variance parameters τj 2, τ2 j,k, j,k = 1,2,...,p, and τ R to accomplish variable selection.

Treating Discrete Inputs Discrete Inputs can be thought of as having a graphical structure. Two examples where j th predictor x j {0,1,2,3,4}:

Treating Discrete Inputs Use Functional ANOVA framework to allow for these discrete predictors. Restrictions implied on the discrete input main effect component are c f j(c) = 0, and similarly for interactions. The norm (penalty) used is f Lf, where L = D A is the graph Laplacian matrix. It can be shown that f Lf = A l,m [f(l) f(m)] 2, i.e., the penalty is the sum (weighted by the adjacency) of all of the squared distances between neighboring nodes. There is also a corresponding covariance function K 1 which enforces the ANOVA constaints for f j in the BSS-ANOVA framework as well. Something like a harmonic expansion over the graph domain with variance decreasing with frequency.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Simulation Study x j iid Unif{1,2,3,4,5,6} for j = 1,...,4 x j iid Unif(0,1) for j = 5,...,15. x 1,...,x 4 are unordered qualitative factors. The test function used here is a function of only 3 inputs (2 of which are qualitative). So 12 of the 15 inputs are completely uninformative. Collect a sample of size n = 100 from y i = f(x i )+ε i, where ε i iid N(0,1), giving SNR 100 : 1 for the 2 test cases.

Test function

Simulation Results Estimator Pred MSE Pred 99% CDF ISE ACOSSO 0.28 (0.03) 3.98 (0.60) 0.006 (0.000) BSS-ANOVA 0.18 (0.01) 3.09 (0.46) 0.006 (0.000) GP 1.09 (0.06) 18.26 (1.73) 0.010 (0.001) Pred MSE Average over the 100 realizations of the Mean Squared Error for prediction of new observations. Pred 99% Average over the 100 realizations of the 99 th percentile of the squared error for prediction of a new observation. CDF ISE Average over the 100 realizations of the integrated squared error of the true CDF curve to the estimated CDF via the emulator.

Yucca Mountain Certification Response Variable (for this illustration) ESIC239C.10K: Cumulative release of Ic (i.e. Glass) Colloid of 239Pu (Plutonium 239) out of the Engineered Barrier System into the Unsaturated Zone at 10,000 years. Predictor Variables (that appear in plots below) TH.INFIL: Categorical variable describing different scenarios for infiltration and thermal conductivity in the region surrounding the drifts. high relative humidity ( 85%). CPUCOLWF: Concentration of irreversibly attached plutonium on glass/waste form colloids when colloids are stable (mol/l).

Yucca Mountain Certification

Yucca Mountain Certification Below is a Sensitivity Analysis for ESIC239C.10K Let T j denote the total variance index for the j th input (i.e., T j is the proportion of the total variance of the output that can be attributed to the j th input and its interactions). Meta-model: ACOSSO Model Summary: R 2 = 0.960, model df = 92 Input ˆTj 95% T j CI p-val CPUCOLWF 0.565 (0.473, 0.621) < 0.01 TH.INFIL 0.424 (0.360, 0.518) < 0.01 RHMUNO65 0.063 (0.052, 0.126) < 0.01 FHHISSCS 0.053 (0.041, 0.106) < 0.01 SEEPUNC 0.020 (0.000, 0.040) 0.10

Conclusions and Further Work Functional ANOVA construction and variable selection can help to increase efficiency in function estimation. A general treatment of graphical inputs easily allows for ordinal and qualitative inputs as special cases. When using Functional ANOVA construction, the main effect and interaction functions are immediately available (i.e., no need to numerically integrate). Functional ANOVA construction also lends itself well to allowing for nonstationarity in function estimation. The overall function (which is potentially quite complex) is composed of fairly simple functions (i.e., main effects or 2-way interactions), so the extension is much easier than for a general function of p inputs.

References 1. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B. 2. George, E. & McCulloch, R. (1993), Variable selection via Gibbs sampling, Journal of the American Statistical Association. 3. Wahba, G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied Mathematics. 4. Storlie, C., Bondell, H., Reich, B. & Zhang, H. (2009a), Surface estimation, variable selection, and the nonparametric oracle property, Statistica Sinica. 5. Reich, B., Storlie, C. & Bondell, H.D. (2009), Variable Selection in Bayesian Smoothing Spline ANOVA Models: Application to Deterministic Computer Codes, Technometrics. 6. Smola, A. &Kondor, R. (2003), Kernels and Regularization on Graphs. In Learning theory and Kernel machines.