Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation

Similar documents
Linear regression methods

Lecture 14: Shrinkage

A Modern Look at Classical Multivariate Techniques

Econ 2148, fall 2017 Gaussian process priors, reproducing kernel Hilbert spaces, and Splines

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Linear Methods for Regression. Lijun Zhang

Lecture 14: Variable Selection - Beyond LASSO

SOLVING NON-CONVEX LASSO TYPE PROBLEMS WITH DC PROGRAMMING. Gilles Gasso, Alain Rakotomamonjy and Stéphane Canu

Implementation and Evaluation of Nonparametric Regression Procedures for Sensitivity Analysis of Computationally Demanding Models

Bayesian variable selection via. Penalized credible regions. Brian Reich, NCSU. Joint work with. Howard Bondell and Ander Wilson

Variable Selection for Nonparametric Quantile. Regression via Smoothing Spline ANOVA

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Learning gradients: prescriptive models

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Grouping Pursuit in regression. Xiaotong Shen

Analysis Methods for Supersaturated Design: Some Comparisons

Regression, Ridge Regression, Lasso

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Least Absolute Shrinkage is Equivalent to Quadratic Penalization

Machine Learning for OR & FE

Bayesian Grouped Horseshoe Regression with Application to Additive Models

Inversion Base Height. Daggot Pressure Gradient Visibility (miles)

Dimension Reduction Methods

Modelling geoadditive survival data

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

The lasso. Patrick Breheny. February 15. The lasso Convex optimization Soft thresholding

Spline Density Estimation and Inference with Model-Based Penalities

A Survey of L 1. Regression. Céline Cunen, 20/10/2014. Vidaurre, Bielza and Larranaga (2013)

Bayesian linear regression

STAT 518 Intro Student Presentation

Homogeneity Pursuit. Jianqing Fan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Smoothing spline ANOVA models (metamodelling)

Regularization in Reproducing Kernel Banach Spaces

Can we do statistical inference in a non-asymptotic way? 1

SCUOLA DI SPECIALIZZAZIONE IN FISICA MEDICA. Sistemi di Elaborazione dell Informazione. Regressione. Ruggero Donida Labati

Reproducing Kernel Hilbert Spaces for Penalized Regression: A tutorial

Multi-scale modeling with generalized dynamic discrepancy

Analysing geoadditive regression data: a mixed model approach

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

A Bootstrap Lasso + Partial Ridge Method to Construct Confidence Intervals for Parameters in High-dimensional Sparse Linear Models

Ridge regression. Patrick Breheny. February 8. Penalized regression Ridge regression Bayesian interpretation

Spatially Adaptive Smoothing Splines

A general mixed model approach for spatio-temporal regression data

A Unified Framework for Uncertainty and Sensitivity Analysis of Computational Models with Many Input Parameters

Monitoring Wafer Geometric Quality using Additive Gaussian Process

Bayesian Aggregation for Extraordinarily Large Dataset

Integrated Likelihood Estimation in Semiparametric Regression Models. Thomas A. Severini Department of Statistics Northwestern University

Selection of Variables and Functional Forms in Multivariable Analysis: Current Issues and Future Directions

ISyE 691 Data mining and analytics

Doubly Decomposing Nonparametric Tensor Regression (ICML 2016)

Proteomics and Variable Selection

Gaussian Processes for Computer Experiments

Recap from previous lecture

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Iterative Gaussian Process Regression for Potential Energy Surfaces. Matthew Shelley University of York ISNET-5 Workshop 6th November 2017

COMPONENT SELECTION AND SMOOTHING FOR NONPARAMETRIC REGRESSION IN EXPONENTIAL FAMILIES

Lecture 3: Introduction to Complexity Regularization

STA414/2104 Statistical Methods for Machine Learning II

Odds ratio estimation in Bernoulli smoothing spline analysis-ofvariance

ECE521 week 3: 23/26 January 2017

A Short Introduction to the Lasso Methodology

Least Squares Regression

GAUSSIAN PROCESS REGRESSION

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

y(x) = x w + ε(x), (1)

Lecture 3: Statistical Decision Theory (Part II)

Gaussian Process Regression Networks

Generalized Elastic Net Regression

OWL to the rescue of LASSO

12 - Nonparametric Density Estimation

Computational and Statistical Aspects of Statistical Machine Learning. John Lafferty Department of Statistics Retreat Gleacher Center

MSA220/MVE440 Statistical Learning for Big Data

Machine Learning for Economists: Part 4 Shrinkage and Sparsity

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Kneib, Fahrmeir: Supplement to "Structured additive regression for categorical space-time data: A mixed model approach"

Introduction to Machine Learning

Linear Models for Regression

Approximating high-dimensional posteriors with nuisance parameters via integrated rotated Gaussian approximation (IRGA)

Chapter 7: Model Assessment and Selection

Regularization Methods for Additive Models

Shrinkage Methods: Ridge and Lasso

Machine Learning for OR & FE

Data Mining Stat 588

Bayesian shrinkage approach in variable selection for mixed

Nonnegative Garrote Component Selection in Functional ANOVA Models

Least Squares Regression

Convergence Rates of Kernel Quadrature Rules

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

ESL Chap3. Some extensions of lasso

Homework 1: Solutions

Statistics for high-dimensional data: Group Lasso and additive models

Gaussian processes for inference in stochastic differential equations

Model Selection for Gaussian Processes

Practical Bayesian Optimization of Machine Learning. Learning Algorithms

Discussion of Regularization of Wavelets Approximations by A. Antoniadis and J. Fan

Transcription:

Reduction of Model Complexity and the Treatment of Discrete Inputs in Computer Model Emulation Curtis B. Storlie a a Los Alamos National Laboratory E-mail:storlie@lanl.gov

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Motivating Example Computational Model from Yucca Mountain Certification 150 input variables (several of which are discrete in nature), dozens of time dependent responses Response Variable (for this illustration) ESIC239C.10K: Cumulative release of Ic (i.e. Glass) Colloid of 239Pu (Plutonium 239) out of the Engineered Barrier System into the Unsaturated Zone at 10,000 years. Model is very expensive to run, we have an Latin Hypercube sample of size n = 300 where the model is evaluated. How to perform sensitivity/uncertainty analysis?

Computer Model Emulation An emulator is a simpler model that mimics a larger physical model. Evaluations of an emulator are much faster. Nonparametric Regression We have n observations from the model y i = f(x i )+ε i i = 1,...,n where x i = (x i1,...,x ip ), and f is the physical model. Usually some weak assumptions are made about f (e.g., f belongs to a smooth class of functions). Methods of Estimation: Orthogonal Series/Wavelets, Kernel Smoothing/local regression, Penalization Methods (Smoothing Splines, Gaussian Processes), Machine Learning/Algorithmic Approaches. With limited number of model evaluations and high number of inputs, we need to reduce emulator complexity.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Variable Selection in Regression Models Focus for now on variable selection for the linear model y = β 0 + p β j x j +ε Stepwise/best subsets type model fitting Can produce unstable estimates More recently: Continuous shrinkage using L1 penalty (LASSO), Tibshirani 1996 Stochastic Search Variable Selection (SVSS), George & McCulloch 1993, 1997.

Shrinkage, aka Penalized Regression Ridge Regression: Find the minimizing β j s to 1 n n y i β 0 i=1 p β j x i,j 2 +λ Note: All the x s must first be standardized. p Improved MSE Estimation via bias-variance trade-off. β 2 j Ridge Regression is equivalent to minimizing 1 n n y i β 0 i=1 p β j x i,j 2 subject to p βj 2 < t 2 for some t(λ).

Shrinkage, aka Penalized Regression LASSO: Find the minimizing β j s to 1 n n y i β 0 i=1 p β j x i,j 2 +λ p ( ) β 2 1/2 j This is equivalent to minimizing 1 n n y i β 0 i=1 for some t(λ). p β j x i,j 2 subject to p β j < t

Geometry of Ridge Regression and the LASSO

Stochastic Search Variable Selection (SSVS) Linear Regression: y = β 0 + p β jx j +ε, where: βj = γ j α j γj Bern(π j ) αj N(0,τ 2 j ) (γ 1,...,γ p ) is the model, and is treated as an unknown random variable. The prior probability that x j is included in the model is P(β j 0) = π j. Inference is based on the posterior probability that x j is included in the model, P(β j 0 y). It is common to determine the best model as the one that includes the variables that have P(β j 0 y) > 0.5.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Functional ANOVA Decomposition Any function f(x) can be decomosed into main effects and interactions, p p f(x) = µ 0 + f j (x j )+ f j,k (x j,x k )+, j<k where µ 0 is the mean, f j are the main effects, f j,k are the two-way interactions, and ( ) are the higher order interactions. The functional components (f j,f j,k, ) are an orthogonal decomposition of the space, which implies the constraints 1 0 f j(x j )dx j = 0 for all j and 1 0 f j,k(x j,x k )dx j = 0 for all j,k, and similar relations for higher order interactions. This insures identifiability of the functional components.

Functional ANOVA Decomposition A convenient way to treat the high order interactions is to let p p f(x) = µ 0 + f j (x j )+ f j,k (x j,x k )+f R (x) j<k where f R is a high-order interaction (catch-all) remainder. In general we can say the function f(x) lies in some space F, q F = {1} F j (1) where {1},F 1...F q is an orthogonal decomposition of the space. For the example above, we would have f 1 F 1,...,f p F p,f 1,2 F p+1,... Continuity assumptions on f, such as number of continuous derivatives, can be built in through the choice of the F j.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

The General Smoothing Spline The L-spline estimate ˆf is given by the minimizer of 1 n q (y i f(x i )) 2 +λ P j f 2 n F, i=1 over f F. P j f is the orthogonal projection of f onto F j, j = 1,...,q. For the additive model with each component function in S 2 = {g : g, g are absolutely continuous and g L 2 [0,1]} then ˆf is given by the minimizer of 1 n n (y i f(x i )) 2 +λ i=1 p { [f j (1) f j (0)] 2 + 1 0 [ f j (x j ) ] 2 dxj } The solution can be obtained conveniently with tools from reproducing kernel Hilbert space theory (see Wahba 1990).

Adaptive COmponent Selection and Smoothing Operator LASSO is to Ridge Regression as ACOSSO is to the Smoothing Spline. Find the minimizer over f F of 1 n q (y i f(x i )) 2 +λ w j P j f F. n i=1 For the additive model with each the minimization becomes 1 n p 1 } 1/2 (y i f(x i )) 2 +λ w j {[f j (1) f j (0)] 2 + [f j (x j )] 2 dx j n i=1 This estimator sets some of the functional components (f j s) equal to exactly zero (i.e., x j is removed from the model.) We want w j to allow prominent functional components to enjoy the benefit of a smaller penalty. Use a weight based on L 2 norm of an initial estimate f ( 1 ) γ/2 w j = f j γ L 2 = ( f j (x j )) 2 dx j 0 0

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Bayesian Smoothing Spline ANOVA (BSS-ANOVA) Assume f(x) = µ 0 + p f j (x j )+ Model the mean as µ 0 N(0,τ 2 0 ) p f j,k (x j,x k )+f R (x) j<k Model the f j GP(0,τ 2 j K 1), f j,k GP(0,τ 2 j,k K 2) and f R GP(0,τ 2 R K R). The covariances functions K 1,K 2,K R are such that the functions µ 0,f j,f j,k,f R obey the Functional ANOVA constraints almost surely. They can also be chosen for desired level of continuity. Lastly apply SSVS to the variance parameters τj 2, τ2 j,k, j,k = 1,2,...,p, and τ R to accomplish variable selection.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Treating Discrete Inputs Discrete Inputs can be thought of as having a graphical structure. Two examples where j th predictor x j {0,1,2,3,4}:

Treating Discrete Inputs Use Functional ANOVA framework to allow for these discrete predictors. Restrictions implied on the discrete input main effect component are c f j(c) = 0, and similarly for interactions. The norm (penalty) used is f Lf, where L = D A is the graph Laplacian matrix. It can be shown that f Lf = A l,m [f(l) f(m)] 2, i.e., the penalty is the sum (weighted by the adjacency) of all of the squared distances between neighboring nodes. There is also a corresponding covariance function K 1 which enforces the ANOVA constaints for f j in the BSS-ANOVA framework as well. Something like a harmonic expansion over the graph domain with variance decreasing with frequency.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Simulation Study x j iid Unif{1,2,3,4,5,6} for j = 1,...,4 x j iid Unif(0,1) for j = 5,...,15. x 1,...,x 4 are unordered qualitative factors. The test function used here is a function of only 3 inputs (2 of which are qualitative). So 12 of the 15 inputs are completely uninformative. Collect a sample of size n = 100 from y i = f(x i )+ε i, where ε i iid N(0,1), giving SNR 100 : 1 for the 2 test cases.

Test function

Simulation Results Estimator Pred MSE Pred 99% CDF ISE ACOSSO 0.28 (0.03) 3.98 (0.60) 0.006 (0.000) BSS-ANOVA 0.18 (0.01) 3.09 (0.46) 0.006 (0.000) GP 1.09 (0.06) 18.26 (1.73) 0.010 (0.001) Pred MSE Average over the 100 realizations of the Mean Squared Error for prediction of new observations. Pred 99% Average over the 100 realizations of the 99 th percentile of the squared error for prediction of a new observation. CDF ISE Average over the 100 realizations of the integrated squared error of the true CDF curve to the estimated CDF via the emulator.

Outline Reduction of Emulator Complexity Variable Selection Functional ANOVA Emulation using Functional ANOVA and Variable Selection Adaptive Component Selection and Smoothing Operator Bayesian Smoothing Spline ANOVA Models Discrete Inputs Simulation Study Example from the Yucca Mountain Analysis Conclusions and Further Work

Yucca Mountain Certification Response Variable (for this illustration) ESIC239C.10K: Cumulative release of Ic (i.e. Glass) Colloid of 239Pu (Plutonium 239) out of the Engineered Barrier System into the Unsaturated Zone at 10,000 years. Predictor Variables (that appear in plots below) TH.INFIL: Categorical variable describing different scenarios for infiltration and thermal conductivity in the region surrounding the drifts. high relative humidity ( 85%). CPUCOLWF: Concentration of irreversibly attached plutonium on glass/waste form colloids when colloids are stable (mol/l).

Yucca Mountain Certification

Yucca Mountain Certification

Yucca Mountain Certification Below is a Sensitivity Analysis for ESIC239C.10K Let T j denote the total variance index for the j th input (i.e., T j is the proportion of the total variance of the output that can be attributed to the j th input and its interactions). Meta-model: ACOSSO Model Summary: R 2 = 0.960, model df = 92 Input ˆTj 95% T j CI p-val CPUCOLWF 0.565 (0.473, 0.621) < 0.01 TH.INFIL 0.424 (0.360, 0.518) < 0.01 RHMUNO65 0.063 (0.052, 0.126) < 0.01 FHHISSCS 0.053 (0.041, 0.106) < 0.01 SEEPUNC 0.020 (0.000, 0.040) 0.10

Conclusions and Further Work Functional ANOVA construction and variable selection can help to increase efficiency in function estimation. A general treatment of graphical inputs easily allows for ordinal and qualitative inputs as special cases. When using Functional ANOVA construction, the main effect and interaction functions are immediately available (i.e., no need to numerically integrate). Functional ANOVA construction also lends itself well to allowing for nonstationarity in function estimation. The overall function (which is potentially quite complex) is composed of fairly simple functions (i.e., main effects or 2-way interactions), so the extension is much easier than for a general function of p inputs.

References 1. Tibshirani, R. (1996), Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society: Series B. 2. George, E. & McCulloch, R. (1993), Variable selection via Gibbs sampling, Journal of the American Statistical Association. 3. Wahba, G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied Mathematics. 4. Storlie, C., Bondell, H., Reich, B. & Zhang, H. (2009a), Surface estimation, variable selection, and the nonparametric oracle property, Statistica Sinica. 5. Reich, B., Storlie, C. & Bondell, H.D. (2009), Variable Selection in Bayesian Smoothing Spline ANOVA Models: Application to Deterministic Computer Codes, Technometrics. 6. Smola, A. &Kondor, R. (2003), Kernels and Regularization on Graphs. In Learning theory and Kernel machines.