Better Simulation Metamodeling: The Why, What and How of Stochastic Kriging

Similar documents
Mustafa H. Tongarlak Bruce E. Ankenman Barry L. Nelson

e-companion ONLY AVAILABLE IN ELECTRONIC FORM Electronic Companion Stochastic Kriging for Simulation Metamodeling

The Effects of Common Random Numbers on Stochastic Kriging Metamodels

Stochastic Kriging for Efficient Nested Simulation of Expected Shortfall

Weighted Least Squares

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Overview. Probabilistic Interpretation of Linear Regression Maximum Likelihood Estimation Bayesian Estimation MAP Estimation

Model Checking and Improvement

Tilburg University. Efficient Global Optimization for Black-Box Simulation via Sequential Intrinsic Kriging Mehdad, Ehsan; Kleijnen, Jack

Regression with correlation for the Sales Data

Simulation optimization via bootstrapped Kriging: Survey

Bayesian Methods for Machine Learning

ECE521 week 3: 23/26 January 2017

STOCHASTIC KRIGING FOR SIMULATION METAMODELING. Bruce Ankenman Barry L. Nelson Jeremy Staum

Machine Learning - MT & 5. Basis Expansion, Regularization, Validation

Linear Models for Regression CS534

Statistics 203: Introduction to Regression and Analysis of Variance Penalized models

PMR Learning as Inference

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

1.225J J (ESD 205) Transportation Flow Systems

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

GRAPH SIGNAL PROCESSING: A STATISTICAL VIEWPOINT

Assessment of uncertainty in computer experiments: from Universal Kriging to Bayesian Kriging. Céline Helbert, Delphine Dupuy and Laurent Carraro

Shrinkage Methods: Ridge and Lasso

Metamodel-Based Optimization

arxiv: v1 [stat.me] 24 May 2010

Linear Models for Regression CS534

Lihua Sun. Department of Economics and Finance, School of Economics and Management Tongji University 1239 Siping Road Shanghai , China

Gaussians Linear Regression Bias-Variance Tradeoff

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models for Regression CS534

Proceedings of the 2016 Winter Simulation Conference T. M. K. Roeder, P. I. Frazier, R. Szechtman, E. Zhou, T. Huschka, and S. E. Chick, eds.

Probability and Estimation. Alan Moses

Proceedings of the 2018 Winter Simulation Conference M. Rabe, A. A. Juan, N. Mustafee, A. Skoogh, S. Jain, and B. Johansson, eds.

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Monitoring Wafer Geometric Quality using Additive Gaussian Process

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Transgaussian Kriging

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Gaussian Process Regression

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Climate Change: the Uncertainty of Certainty

STATISTICAL MODELS FOR QUANTIFYING THE SPATIAL DISTRIBUTION OF SEASONALLY DERIVED OZONE STANDARDS

Relevance Vector Machines

Statistical Techniques in Robotics (16-831, F12) Lecture#21 (Monday November 12) Gaussian Processes

The 1d Kalman Filter. 1 Understanding the forward model. Richard Turner

Bayesian Dynamic Linear Modelling for. Complex Computer Models

Linear Models for Regression

Learning Energy-Based Models of High-Dimensional Data

Introduction to Design and Analysis of Computer Experiments

CPSC 540: Machine Learning

Estimation and Prediction Scenarios

Simple Linear Regression for the MPG Data

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

INTRODUCTION TO PATTERN RECOGNITION

Frequentist-Bayesian Model Comparisons: A Simple Example

ENVIRONMENTAL DATA ANALYSIS WILLIAM MENKE JOSHUA MENKE WITH MATLAB COPYRIGHT 2011 BY ELSEVIER, INC. ALL RIGHTS RESERVED.

Spatial Statistics with Image Analysis. Outline. A Statistical Approach. Johan Lindström 1. Lund October 6, 2016

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

STA 4273H: Statistical Machine Learning

Limit Kriging. Abstract

An example of Bayesian reasoning Consider the one-dimensional deconvolution problem with various degrees of prior information.

Chapter 11. Output Analysis for a Single Model Prof. Dr. Mesut Güneş Ch. 11 Output Analysis for a Single Model

Time Series Prediction by Kalman Smoother with Cross-Validated Noise Density

Covariance function estimation in Gaussian process regression

Ways to make neural networks generalize better

Linear Regression. CSL603 - Fall 2017 Narayanan C Krishnan

Linear Regression. CSL465/603 - Fall 2016 Narayanan C Krishnan

Introduction. Semivariogram Cloud

Lecture 24: Weighted and Generalized Least Squares

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

STA 4273H: Statistical Machine Learning

Statistics: Learning models from data

Spatial Statistics with Image Analysis. Lecture L08. Computer exercise 3. Lecture 8. Johan Lindström. November 25, 2016

CONTROLLED SEQUENTIAL BIFURCATION: A NEW FACTOR-SCREENING METHOD FOR DISCRETE-EVENT SIMULATION

Machine Learning Basics: Maximum Likelihood Estimation

STA 4273H: Sta-s-cal Machine Learning

Adaptive Sampling of Clouds with a Fleet of UAVs: Improving Gaussian Process Regression by Including Prior Knowledge

Nonparametric Bayesian Methods - Lecture I

Proceedings of the 2015 Winter Simulation Conference L. Yilmaz, W. K. V. Chan, I. Moon, T. M. K. Roeder, C. Macal, and M. D. Rossetti, eds.

Approximate Principal Components Analysis of Large Data Sets

Regression, Ridge Regression, Lasso

STA414/2104 Statistical Methods for Machine Learning II

Spatio-temporal prediction of site index based on forest inventories and climate change scenarios

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

A QUICKER ASSESSMENT OF INPUT UNCERTAINTY. Eunhye Song Barry L. Nelson

Slides 12: Output Analysis for a Single Model

Introduction. Chapter 1

Spatial statistics, addition to Part I. Parameter estimation and kriging for Gaussian random fields

Basics of Point-Referenced Data Models

Queuing Networks. - Outline of queuing networks. - Mean Value Analisys (MVA) for open and closed queuing networks

Use of Design Sensitivity Information in Response Surface and Kriging Metamodels

Dynamic System Identification using HDMR-Bayesian Technique

GENERALIZED INTEGRATED BROWNIAN FIELDS FOR SIMULATION METAMODELING. Peter Salemi Jeremy Staum Barry L. Nelson

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Scaling up Bayesian Inference

STA 414/2104, Spring 2014, Practice Problem Set #1

Statistical Techniques in Robotics (16-831, F12) Lecture#20 (Monday November 12) Gaussian Processes

ARock: an algorithmic framework for asynchronous parallel coordinate updates

Transcription:

Better Simulation Metamodeling: The Why, What and How of Stochastic Kriging Jeremy Staum Collaborators: Bruce Ankenman, Barry Nelson Evren Baysal, Ming Liu, Wei Xie supported by the NSF under Grant No. DMI-0900354

Outline overview of metamodeling metamodeling approaches regression, kriging, stochastic kriging Why kriging? Why stochastic kriging? stochastic kriging: what and how practical advice for stochastic kriging

Metamodeling: Why? simulation model input x, response surface y(x) simulation effort n simulation output Y(x;n) estimates y(x) Y(x;n) converges to y(x) as n simulation metamodeling fast estimate of y(x) for any x What would the simulation say if I ran it?

Uses of Metamodeling trend modeling (global) Does y(x) = y(x 1,x 2 ) increase with x 1? Is y(x) = y(x 1,x 2 ) more sensitive to x 1 or x 2? y(x) is similar to β 0 + xβ globally optimization (local) Which way to move and how far? quadratic: first and second derivatives, y(x) β 0 + xβ + x T Qx locally

Uses of Metamodeling exploration (global) What if? scenario = x multi-objective tradeoffs throughput (x 1 ) cycle time (y)

Uses of Metamodeling scenario analysis (global) can t control scenario financial scenario military scenario simulation inputs probabilistic analysis distribution on scenarios

Needs of Metamodeling trend modeling: rough global trend optimization: rough local trend exploration / scenario analysis: globally accurate prediction: ŷ(x) y(x) ŷ(x) is almost as good an estimator of y(x) as the simulation output Y(x;n) but metamodel is much faster than model simulation on demand

Overview of Approaches No free lunch! Inference about y(x) without seeing Y(x;n) requires assumptions: about spatial variability in y( ) about noise ε(x;n) = Y(x;n) y(x) Is y( ) a simple trend y(x)=b(x)β, or must we model deviation from trend? Should we try to filter out noise?

Regression Assume y(x) = b(x)β. The truth y can t deviate from the trend. We aim only to estimate β. Assume ε(x) = Y(x) y(x) is white noise. Aim to filter it out! Var[ε(x)] doesn t depend on x. (remedies) Global metamodeling: can t find an adequate trend model (b).

Interpolation Assume y( ) has some smoothness. model mean β 0 and deviation y( ) - β 0 or trend b( )β and deviation y( ) - b( )β Assume ε(x) = Y(x) y(x) = 0. No filtering: ŷ(x) = Y(x) if x is simulated. Stochastic simulation: need big simulation effort n so Y(x;n) y(x) 0. Interpolated ŷ( ) will look bumpy.

Smoothing Assume y( ) has some smoothness. just like interpolation Assume ε(x) = Y(x) y(x) is white noise. Aim to filter it out! Can handle ordinal & categorical data. regression, interpolation, or smoothing

Example: M/M/1 Queue arrival rate 1 service rate x steady-state mean wait y(x)=1/(x(x-1)) initialize in steady-state (no bias) and simulate a fixed # of customers variance also explodes as x 1

Regression Remedies weighted least squares (WLS) weight on Y(x;n) is n/var[y(x)] empirical WLS: estimate Var[Y(x)] generalized linear models: Y(x) = y(x) + ε(x), y(x) = LINK(b(x)β) perfect for M/M/1: y(x) = exp(β 0 + β 1 ln(x) + β 2 ln(x-1)) = 1/(x(x-1))

Experiment Design M/M/1: one-dimensional, evenly spaced k design points: x 1 =1.1,, x k =2 constant simulation effort n at each 30 replications = runs of n customers each Var[Y(x;n)] is huge for x=1.1, small for x=2 Two experiment designs: sparse & deep: k=6, n=1000 customers dense & shallow: k=60, n=100 customers

Regression: Conclusions misspecification poor predictions the WHY of interpolation (including kriging) weighted least squares: dangerous! WLS assumes a well-specified model assigns huge residuals to high-variance observations, predicts very badly there Want a well-specified model? good luck, hard work

Interpolation Deviations from trend are meaningful. Can omit trend modeling (overall mean). prediction ŷ(x) at x after simulating at design points x 1,, x k : if x = x i is simulated, ŷ(x i ) = Y(x i ) if not, ŷ(x) = w 1 Y(x 1 ) + + w k Y(x k ) where w i is larger for x i closer to x

Kriging: Spatial Corr. deterministic simulation: ε(x) = Y(x) y(x) = 0. Y( ) is regarded as a random field each Y(x) is a random variable (Bayesian) Corr[Y(x),Y(x )] = r(x-x ) e.g. r(x-x ) = exp(- i θ i x i x i ) or r(x-x ) = exp(- i θ i (x i x i ) 2 )

Kriging Prediction prior: Y( ) is a Gaussian random field E[Y(x)] = b(x)β, often = β 0 Cov[Y(x),Y(x )] = τ 2 r(x-x ) observe Y = (Y(x 1 ),, Y(x k )) posterior mean: ŷ(x) = b(x)β + r(x) R -1 (Y Bβ) r i (x) = r(x-x i ), R ij = r(x j -x i ), B i = b(x i )

Choices in Kriging basis functions b( ) for trend b(x)β estimate coefficients β by least-squares: min i (Y(x i ) - b(x)β) 2 spatial correlation function r( ;θ) estimate coefficients θ by cross-validation or maximizing likelihood of Y(x 1 ),, Y(x k ) axis transformation affects predictions arrival & service rates vs. arrival rate, load

Pathologies of Kriging reversion to the trend fitting to noise WHY stochastic kriging

Kriging with Errors measurement error or nugget effect ε(x) = Y(x) y(x) 0 filter out noise that harms prediction improve numerical stability intrinsic vs. extrinsic uncertainty intrinsic: Var[ε(x)] from physical experiment or noise from stochastic simulation extrinsic: Var[Y(x)] representing our ignorance

How to Apply Kriging to Stochastic Simulation Kleijnen & van Beers control noise and its effect on prediction Siem & den Hertog reduce sensitivity to stochastic noise Yin, Ng & Ng: modified nugget effect Ankenman, Nelson & Staum stochastic kriging

Stochastic Kriging Intrinsic uncertainty ε(x) = Y(x) y(x) is independent of extrinsic uncertainty. If simulation effort n at x is large, ε(x;n) = Y(x;n) y(x) is approximately normal we can estimate its variance v(x)/n do empirical weighted least squares

The What of SK The kriging prediction ŷ(x) = b(x)β + r(x) R -1 (Y Bβ) where r i (x) = r(x-x i ), R ij = r(x j -x i ), B i = b(x i ) The stochastic kriging prediction ŷ(x) = b(x)β + r(x) (R+C/τ 2 ) -1 (Y Bβ) C = intrinsic covariance matrix τ 2 = extrinsic variance Awareness of noise alters inference.

Behavior of SK If sampling is independent, diagonal C: C ii = intrinsic noise in observing Y(x i ) The signal-to-noise ratio C ii /τ 2 governs smoothing: how far ŷ(x i ) is from y(x i ). Extreme cases: C 0: SK kriging τ 2 0: SK EWLS regression neglecting Y(x i ) if C ii >> τ 2!

M/M/1 Queue Examples high intrinsic variance for low service rate x a simulated game A tosses a coin, heads with prob. x B tosses another, heads with prob. 1-x HT B pays A $1; TH A pays B $1 y(x) = 2x-1 intrinsic variance v(x) highest in the middle

SK: How-To Overview pre-simulation: choose axes and correlation function r choose design points and effort: x i, n i simulate: for each i=1,, k, observe Y(x i ), estimate variance v(x i ) post-simulation: estimate parameters β, θ, τ 2 predict ŷ(x) for any x desired

Multi-Product Cycle Time simulate Jackson network with 3 nodes input x has 3 dimensions: (did up to 8) mix of 3 products (ranging from 0%-100%) load on bottleneck node (from 0.6-0.8) 201 design points (low-discrepancy) reduce heteroscedasticity by planning run lengths via approximation intrinsic std. error still varied over 10-fold

α 2 :α 3 = 1:6, x = 0.65 (gauss function) 10 9 8 fitted curve for prod 1) fitted curve for prod 2) fitted curve for prod 3) true values of product 1 true values of product 2 true values of product 3 CT 7 6 5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 alpha1

SK: Practical Advice Use the Gaussian correlation function r(x-x ) = exp(- i θ i (x i x i ) 2 ) for smoothness of random field, need r(x-x ) 1 slowly as x-x 0 misspecification: spatial homogeneity spatial transformation trend modeling?

SK: Practical Advice Don t extrapolate! Beware edge effect. bias due to one-sided smoothing near edge Placement of design points is important. Grids are not good in high dimension. The goal is not the same as in regression. For uniformity: low-discrepancy sequences. Kriging can use lots of CPU time, memory for >1000 design points.

SK: Practical Advice Make intrinsic variance Var[Y(x i )]/n i smaller than variability of Y(x 1 ),, Y(x k ) or of Y(x 1 )-b(x 1 )β,, Y(x k )-b(x k )β Can use two-stage procedure to control intrinsic variance. More design points are better than very low variance at a few points. Big k, but avoid excessive noise or cost.

SK: Posterior Variance Prediction is posterior mean: ŷ(x) = b(x)β + r(x) (R+C/τ 2 ) -1 (Y Bβ) Posterior variance: τ 2 (1 r(x) (R+C/τ 2 ) -1 r(x) T ) Don t trust posterior variance! misleading due to misspecification of GRF can guide effort in multi-stage procedures Kleijnen: bootstrapping & cross-validation

SK: Amelioration Spatial inhomogeneity: better-specified data-driven spatial transformation non-stationary covariance functions partitioning: treed GRFs Computational cost: r(x-x )=0 if x far from x; sparse matrices covariance tapering: set small corr to 0

Conclusions Global metamodeling may be feasible. harnessing computer idle time Consider stochastic kriging when: regression isn t enough computational budget is limited fewer than 1000 design points (for now) www.stochastickriging.net papers, MATLAB code