Data-Driven Bayesian Model Selection: Parameter Space Dimension Reduction using Automatic Relevance Determination Priors

Similar documents
Bayesian Model Selection and Parameter Estimation for Strongly Nonlinear Dynamical Systems

Bayesian Inference in Astronomy & Astrophysics A Short Course

Uncertainty Quantification for Machine Learning and Statistical Models

Bayesian room-acoustic modal analysis

Dynamic System Identification using HDMR-Bayesian Technique

Hierarchical sparse Bayesian learning for structural health monitoring. with incomplete modal data

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Uncertainty quantification for inverse problems with a weak wave-equation constraint

Intro BCS/Low Rank Model Inference/Comparison Summary References. UQTk. A Flexible Python/C++ Toolkit for Uncertainty Quantification

Bayesian System Identification based on Hierarchical Sparse Bayesian Learning and Gibbs Sampling with Application to Structural Damage Assessment

A note on Reversible Jump Markov Chain Monte Carlo

Frequentist-Bayesian Model Comparisons: A Simple Example

Introduction to Statistical modeling: handout for Math 489/583

Lecture : Probabilistic Machine Learning

CALIFORNIA INSTITUTE OF TECHNOLOGY

CSci 8980: Advanced Topics in Graphical Models Gaussian Processes

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Inference. Chris Mathys Wellcome Trust Centre for Neuroimaging UCL. London SPM Course

Forward Problems and their Inverse Solutions

Bayesian model selection for computer model validation via mixture model estimation

Stochastic Collocation Methods for Polynomial Chaos: Analysis and Applications

Penalized Loss functions for Bayesian Model Choice

arxiv: v1 [stat.co] 23 Apr 2018

Bayesian inference J. Daunizeau

SRNDNA Model Fitting in RL Workshop

Introduction to Bayesian Data Analysis

When one of things that you don t know is the number of things that you don t know

MCMC Sampling for Bayesian Inference using L1-type Priors

PATTERN RECOGNITION AND MACHINE LEARNING

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

Approximate Bayesian Computation: a simulation based approach to inference

Bayesian Hidden Markov Models and Extensions

SGN Advanced Signal Processing: Lecture 8 Parameter estimation for AR and MA models. Model order selection

Neutron inverse kinetics via Gaussian Processes

7. Estimation and hypothesis testing. Objective. Recommended reading

Transdimensional Markov Chain Monte Carlo Methods. Jesse Kolb, Vedran Lekić (Univ. of MD) Supervisor: Kris Innanen

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

Introduction to Bayesian methods in inverse problems

Markov Chain Monte Carlo methods

Utilizing Adjoint-Based Techniques to Improve the Accuracy and Reliability in Uncertainty Quantification

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Which model to use? How can we deal with these decisions automatically? Note flailing in data gaps and beyond ends for high M

Introduction to Machine Learning

Pattern Recognition and Machine Learning

Bayesian methods in economics and finance

Stochastic Spectral Approaches to Bayesian Inference

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

Contents. Part I: Fundamentals of Bayesian Inference 1

Overview. Bayesian assimilation of experimental data into simulation (for Goland wing flutter) Why not uncertainty quantification?

Markov chain Monte Carlo methods for visual tracking

Bayesian inference J. Daunizeau

F denotes cumulative density. denotes probability density function; (.)

(Extended) Kalman Filter

Recent Advances in Bayesian Inference for Inverse Problems

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Inference Control and Driving of Natural Systems

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Learning Bayesian belief networks

Large Scale Bayesian Inference

Variational Bayesian Inference Techniques

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Part 1: Expectation Propagation

Bayesian Methods for Machine Learning

Predictive Engineering and Computational Sciences. Research Challenges in VUQ. Robert Moser. The University of Texas at Austin.

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Bayesian Regression Linear and Logistic Regression

Regularized Regression A Bayesian point of view

Bayesian Methods and Uncertainty Quantification for Nonlinear Inverse Problems

New Advances in Uncertainty Analysis and Estimation

Calibrating Environmental Engineering Models and Uncertainty Analysis

y(x) = x w + ε(x), (1)

An introduction to Bayesian inference and model comparison J. Daunizeau

Bayesian linear regression

INTRODUCTION TO PATTERN RECOGNITION

Bayesian Inverse Problems

Recursive Deviance Information Criterion for the Hidden Markov Model

Bayesian Inference and MCMC

Functional Estimation in Systems Defined by Differential Equation using Bayesian Smoothing Methods

Recent advances in cosmological Bayesian model comparison

Nonparmeteric Bayes & Gaussian Processes. Baback Moghaddam Machine Learning Group

Doing Bayesian Integrals

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Advanced Statistical Methods. Lecture 6

Recent Advances in Bayesian Inference Techniques

CS Homework 3. October 15, 2009

DIC, AIC, BIC, PPL, MSPE Residuals Predictive residuals

CPSC 540: Machine Learning

MODULE -4 BAYEIAN LEARNING

Default Priors and Effcient Posterior Computation in Bayesian

Testing Restrictions and Comparing Models

Bayesian Inference for Discretely Sampled Diffusion Processes: A New MCMC Based Approach to Inference

Outlier detection in ARIMA and seasonal ARIMA models by. Bayesian Information Type Criteria

Large-scale Ordinal Collaborative Filtering

A Bayesian perspective on GMM and IV

Density Estimation. Seungjin Choi

Probabilistic machine learning group, Aalto University Bayesian theory and methods, approximative integration, model

Multimodal Nested Sampling

STA 4273H: Statistical Machine Learning

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Transcription:

Data-Driven : Parameter Space Dimension Reduction using Priors Mohammad Khalil mkhalil@sandia.gov, Livermore, CA Workshop on Uncertainty Quantification and Data-Driven Modeling Austin, Texas March 23-24, 217 is a multi-mission laboratory managed and operated by Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys Nuclear Security Administration under contract DE-AC4-94AL85.

Overview Predictive Optimal Predictive Optimal M. Khalil Data-Driven using 2 / 26

Why Model? Why Model? Predictive Optimal Model selection is the task of selecting a physical/statistical model from a set of candidate models, given data. When dealing with nontrivial physics under limited a priori understanding of the system, multiple plausible models can be envisioned to represent the system with a reasonable accuracy. A complex model may overfit the data but results in a higher model prediction uncertainty. A simpler model may misfit the data but results in a lower model prediction uncertainty. An optimal model provides a balance between data-fit and prediction uncertainty. Common approaches: Cross-validation Akaike information criterion (AIC) Bayesian information criterion (BIC) (Bayesian) Model evidence M. Khalil Data-Driven using 3 / 26

Inverse Problems Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work forward model noisy observations model parameters Forward Problem: Given model parameters, predict clean observations Inverse Problem: Given noisy observations, infer model parameters observations are inherently noisy with unknown (or weakly known) noise model sparse in space and time (insufficient resolution) problem typically ill-posed, i.e. no guarantee of solution existence nor uniqueness Predictive Optimal M. Khalil Data-Driven using 4 / 26

Bayes Theorem Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal The parameters φ are treated as a random vector. Using Bayes rule, one can write posterior likelihood prior evidence pdf 4 3 2 1 prior likelihood posterior p(φ,m) is the prior pdf of φ: induces regularization p(d φ, M) is the likelihood pdf: describes data misfit p(φ d,m) is the posterior pdf of φ the full Bayesian solution: Not a single point estimate but a probability density Completely characterizes the uncertainty in φ Used in simulations for prediction under uncertainty For parameter inference alone, it is sufficient to consider 1 1 u p(φ d,m) p(d φ,m)p(φ M) M. Khalil Data-Driven using 5 / 26

Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Stages of Bayesian Inference Bayesian inverse modeling from real data is often an iterative process: Select a model (parameters + priors) Using available data, perform model calibration: Parameter inference Using posterior parameter pdf, compute model evidence: Model selection Refine model or propose new model and repeat Stage 1 Stage 2 Stage 3 I have a model and parameter priors I have more than one plausible model None of the models is clearly the best Predictive Optimal Parameter inference: assume I have an accurate model Model selection: compute relative plausibility of models given data Model averaging: obtain posterior predictive density of QoI averaged over plausible models M. Khalil Data-Driven using 6 / 26

Model Evidence and Bayes Factor Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal When there are competing models, Bayesian model selection allows us to obtain their relative probabilities in light of the data and prior information The best model is then the one which strikes an optimum balance between quality of fit and predictivity Model evidence: An integral of the likelihood over the prior, or marginalized (averaged) likelihood p(d M) = p(d φ,m)p(φ,m)dφ Model posterior/plausibility: Obtained using Bayes Theorem p(m d) p(d M)p(M) Relative model posterior probabilities: Obtained using Bayes factor Posterior odds = Bayes' factor x prior odds Bayes' factor = relative model evidence M. Khalil Data-Driven using 7 / 26

Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal Model Evidence and Occam s Razor Bayes model evidence balances quality of fit vs unwarranted model complexity It does that by penalizing wasted parameter space and thereby rewarding highly predictive models Prior Likelihood Penalizes complex models Prior automatic Occam s razor effect Likelihood The parameter prior plays a decisive role as it reflects the available parameter space under the model M prior to assimilating data. M. Khalil Data-Driven using 8 / 26

Model Evidence: Nested Models Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Nested models are investigated often in practice: a more complex model, M 1, with prior p(φ,m), which reduces to a simpler nested model, M, for a certain value of the parameter, φ = φ = Question: Is the extra complexity of M 1 warranted by the data? Define: We have: Prior Likelihood Predictive Optimal Wasted parameter space Favors simpler model mismatch between prediction and likelihood Favors more complex model M. Khalil Data-Driven using 9 / 26

e M t P 3 r n n Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work : Occam s Razor at Work Generate 6 noisy data points from the true model given by y i = 1 + x 2 i + ǫ i ǫ i N (,1) Question: Not knowing the true model, what is the best model? We propose polynomials of increasing order: 1 8 6 4 2 M : y = a + ǫ. M 5 : y = a + a 1 x + a 2 x 2 + a 3 x 3 + a 4 x 4 + a 5 x 5 + ǫ 1 8 6 4 2 ior pr t e 1 12 8 6 4 2.8.6.4.2 Predictive Optimal 1 8 6 4-2 2 1 8 6 4-2 2-2 2 1 45 8 6 4 1 8 6 4 1 2 3 4 5 5 true 2 true rm 2 2 2 2-2 2-2 2-2 2-2 2 M. Khalil Data-Driven using 1 / 26

Challenges with Challenges with : ARD Priors Predictive Optimal Model evidence is extremely sensitive to prior parameter pdfs Missing out on better candidate models: The number of possible models grows rapidly with the number of possible terms in the physical/statistical model For the previous example, the number of possible models of order up to and including 6 is N M = number of k combinations up to and including 5 6 ( ) 6 = k k=1 = 6! 1! 5! + 6! 2! 4! + 6! 3! 3! + 6! 4! 2! + 6! 5! 1! + 6! 6!! = 63 For polynomials of maximum order of 1, 123 possible models! Solution: (ARD) M. Khalil Data-Driven using 11 / 26

Challenges with : ARD Priors A parametrized prior distribution known as ARD prior is assigned to the unknown model parameters ARD prior pdf is a Gaussian with zero mean and unknown variance (could also use Laplace priors, etc...) The hyper-parameters, α, are estimated using the data by performing evidence maximization or type-ii maximum likelihood estimation Prior : p(φ α,m) Posterior : p(φ d,α,m) Type II likelihood : p(d α,m) = p(d φ,m)p(φ α,m)dφ Predictive Optimal M. Khalil Data-Driven using 12 / 26

: ARD Priors Challenges with : ARD Priors Revisiting the previous example with the true model given by y i = 1+x 2 i +ǫ i ǫ i N (,1) Question: What is the best model nested under the model: y = a +a 1 x+a 2 x 2 +a 3 x 3 +a 4 x 4 +a 5 x 5 +ǫ 5 5 5 Predictive Optimal -5 1 2 3 Optimizer Iteration 5-5 1 2 3 Optimizer Iteration 5-5 1 2 3 Optimizer Iteration 5 1.4 1-4 1.2 1.8.6 Convergence could be improved with better optimizer.4.2 5 1 15 2 25 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration Type-II likelihood (model evidence) M. Khalil Data-Driven using 13 / 26

Nonlinear Modeling in Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Limit cycle oscillation (LCO) is observed in wind tunnel experiments for 2-D rigid airfoil in transitional Re regime Pure pitch LCO due to nonlinear aerodynamic loads Objective: Inverse modeling of nonlinear oscillations with an aim to understand and quantify the contribution of unsteady and nonlinear aerodynamics. Predictive Optimal Nor alized T me M. Khalil Data-Driven using 14 / 26

Research Group/Resources Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Philippe Bisaillon, Ph.D. candidate, Carleton University Rimple Sandhu, Ph.D. candidate, Carleton University Dominique Poirel, Royal Military College (RMC) of Canada Abhijit Sarkar, Carleton University Chris Pettit, United States Naval Academy Predictive Optimal HPC lab at Carleton University Wind tunnel at RMC M. Khalil Data-Driven using 15 / 26

Previous Work Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal Start with a candidate model set: I EA θ +D θ +Kθ +K θ 3 = D sign θ + 1 2 ρu2 c 2 sc M (θ, M 1 : C M = e 1 θ +e 2 θ +e3 θ 3 +e 4 θ 2 θ +σξ(τ) M 6 :. ) θ, θ C M + (B 1 +B 2 ) C M +C M = e 1 θ +e 2 θ +e3 θ 3 +e 4 θ 2 θ +e5 θ 5 B 1 B 2 B 1 B 2 + (2c 6c 7 +.5) θ + c... 6θ +σξ(τ) B 1 B 2 B 1 B 2 We observe the pitch degree-of-freedom (DOF): d k = θ(t k )+ǫ k We perform Bayesian model selection in discrete model space Sandhu et al., JCP, 216 Sandhu et al., CMAME, 214 Khalil et al., JSV, 213 M. Khalil Data-Driven using 16 / 26

Use of ARD Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Start with an encompassing model: I EA θ +D θ +Kθ +K θ 3 = D sign θ + 1 2 ρu2 c 2 sc M C M B +C M = a 1 θ +a 2 θ +a3 θ 3 +a 4 θ 2 θ +a5 θ 5 +a 6 θ 4 θ + c 6 B θ +σξ(τ) We would like to find the optimal model nested under the overly-prescribed encompassing model Predictive Optimal M. Khalil Data-Driven using 17 / 26

Hybrid approach: ARD Priors vs Fixed Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors We assign prior distributions by categorizing parameters based on prior knowledge about the aerodynamics as Required ( φ α ) or Contentious (φα ) C M B +C M = a 1 θ +a 2 θ +a3 θ 3 +a 4 θ 2 θ +a5 θ 5 +a 6 θ 4 θ + c 6 B θ +σξ(τ) Predictive Optimal M. Khalil Data-Driven using 18 / 26

Hierarchical Bayes Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal Using hierarchical Bayes approach Posterior pdf p(α d) of hyper-parameter vector α: p(α d) p(d α)p(α) For a fixed hyper-prior p(α), Task: Stochastic optimization: α MAP = arg max α p(α d) Model evidence as a function of hyper-parameter, Task: Evidence computation: p(d α) = p(d φ)p(φ α)dφ Parameter likelihood computation, Task: State estimation: p(d φ) = n d k=1 p ( d k u j(k),φ ) p ( u j(k) d 1:k 1,φ ) du j(k) M. Khalil Data-Driven using 19 / 26

Numerical techniques Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Evidence computation: Chib-Jeliazkov method, Power posteriors, Nested sampling, Annealed importance sampling, Harmonic mean estimator, adaptive Gauss-Hermite quadrature; and many others MCMC sampler for Chib-Jeliazkov method: Metropolis-Hastings, Gibbs, tmcmc, adaptive Metropolis, Delayed Rejection Adaptive Metropolis (DRAM); and many others State estimation: Kalman filter, extended Kalman filter, unscented Kalman filter, ensemble Kalman filter, particle filter; and many others. Results are in: R. Sandhu, C. Pettit, M. Khalil, D. Poirel, A. Sarkar, Bayesian model selection using automatic relevance determination for nonlinear dynamical systems, Computer Methods in Applied Mechanics and Engineering, in press. Predictive Optimal M. Khalil Data-Driven using 2 / 26

Numerical Results: ARD Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal M. Khalil Data-Driven using 21 / 26

Numerical Results: ARD vs Flat Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors We compare selected marginal and joint pdfs for (a) ARD priors with optimal hyper-parameters, and (b) flat priors ARD priors able to remove superfluous parameters while having insignificant effect on the posterior pdfs of important parameters Predictive Optimal M. Khalil Data-Driven using 22 / 26

Predictive Predictive Predictive Optimal Collaborators: Jina Lee, Maher Salloum () Objective: Replace computationally expensive simulations of physical systems with response predictions constructed at the wavelet coefficient level Procedure: Perform compressed sensing of high-dimensional system response from full-order model simulations Model resulting low-dimensional wavelet coefficients using autoregressive-moving-average (ARMA) model x t = p ϕ i x t i + i=1 q θ j ǫ t j ǫ t N (,1) j=1 y t = x t +ζ t ζ t N (,γ 2) Parameters likelihood for ϕ i, θ j and γ involves a state estimation using the Kalman filter Model selection, i.e. determining model orders p and q, is performed using Akaike information criterion (AIC) M. Khalil Data-Driven using 23 / 26

Wavelet Coefficient Predictions For illustration we consider the transient response of the 2D heat equation on a square domain with randomly chosen holes (for added heterogeneity) Compressed sensing is performed and 7 dominant wavelet coefficients are modeled Predictive Predictive Optimal M. Khalil Data-Driven using 24 / 26

Optimal Predictive Optimal Optimal Collaborators: Layal Hakim, Guilhem Lacaze, Khachik Sargsyan, Habib Najm, Joe Oefelein () Objective: Calibrate a simple chemical model against computations from a detailed kinetic model Simple model with an embedded parameterization of model error using polynomial chaos expansions Optimal placement of model error achieved via Bayesian model selection (Bayes factor) Bayes' factors M. Khalil Data-Driven using 25 / 26

Predictive Optimal Presented a framework for data-driven model selection using ARD prior pdfs ARD priors enable the transformation of the model selection problem from the discrete model space into the continuous hyper-parameter space Allow for parameter space dimension reduction informed by noisy observations of the system Applications: Nonlinear dynamical systems modeled using stochastic ordinary differential equations (ARD priors) Predictive (AIC) Optimal (Bayes factor) ARD priors able to remove superfluous parameters while having insignificant effect on the posterior pdfs of important parameters M. Khalil Data-Driven using 26 / 26