Data-Driven Bayesian Model Selection: Parameter Space Dimension Reduction using Automatic Relevance Determination Priors

Data-Driven : Parameter Space Dimension Reduction using Priors Mohammad Khalil mkhalil@sandia.gov, Livermore, CA Workshop on Uncertainty Quantification and Data-Driven Modeling Austin, Texas March 23-24, 217 is a multi-mission laboratory managed and operated by Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys Nuclear Security Administration under contract DE-AC4-94AL85.

Overview Predictive Optimal Predictive Optimal M. Khalil Data-Driven using 2 / 26

Why Model? Why Model? Predictive Optimal Model selection is the task of selecting a physical/statistical model from a set of candidate models, given data. When dealing with nontrivial physics under limited a priori understanding of the system, multiple plausible models can be envisioned to represent the system with a reasonable accuracy. A complex model may overfit the data but results in a higher model prediction uncertainty. A simpler model may misfit the data but results in a lower model prediction uncertainty. An optimal model provides a balance between data-fit and prediction uncertainty. Common approaches: Cross-validation Akaike information criterion (AIC) Bayesian information criterion (BIC) (Bayesian) Model evidence M. Khalil Data-Driven using 3 / 26

Inverse Problems Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work forward model noisy observations model parameters Forward Problem: Given model parameters, predict clean observations Inverse Problem: Given noisy observations, infer model parameters observations are inherently noisy with unknown (or weakly known) noise model sparse in space and time (insufficient resolution) problem typically ill-posed, i.e. no guarantee of solution existence nor uniqueness Predictive Optimal M. Khalil Data-Driven using 4 / 26

Bayes Theorem Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal The parameters φ are treated as a random vector. Using Bayes rule, one can write posterior likelihood prior evidence pdf 4 3 2 1 prior likelihood posterior p(φ,m) is the prior pdf of φ: induces regularization p(d φ, M) is the likelihood pdf: describes data misfit p(φ d,m) is the posterior pdf of φ the full Bayesian solution: Not a single point estimate but a probability density Completely characterizes the uncertainty in φ Used in simulations for prediction under uncertainty For parameter inference alone, it is sufficient to consider 1 1 u p(φ d,m) p(d φ,m)p(φ M) M. Khalil Data-Driven using 5 / 26

Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Stages of Bayesian Inference Bayesian inverse modeling from real data is often an iterative process: Select a model (parameters + priors) Using available data, perform model calibration: Parameter inference Using posterior parameter pdf, compute model evidence: Model selection Refine model or propose new model and repeat Stage 1 Stage 2 Stage 3 I have a model and parameter priors I have more than one plausible model None of the models is clearly the best Predictive Optimal Parameter inference: assume I have an accurate model Model selection: compute relative plausibility of models given data Model averaging: obtain posterior predictive density of QoI averaged over plausible models M. Khalil Data-Driven using 6 / 26

Model Evidence and Bayes Factor Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal When there are competing models, Bayesian model selection allows us to obtain their relative probabilities in light of the data and prior information The best model is then the one which strikes an optimum balance between quality of fit and predictivity Model evidence: An integral of the likelihood over the prior, or marginalized (averaged) likelihood p(d M) = p(d φ,m)p(φ,m)dφ Model posterior/plausibility: Obtained using Bayes Theorem p(m d) p(d M)p(M) Relative model posterior probabilities: Obtained using Bayes factor Posterior odds = Bayes' factor x prior odds Bayes' factor = relative model evidence M. Khalil Data-Driven using 7 / 26

Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Predictive Optimal Model Evidence and Occam s Razor Bayes model evidence balances quality of fit vs unwarranted model complexity It does that by penalizing wasted parameter space and thereby rewarding highly predictive models Prior Likelihood Penalizes complex models Prior automatic Occam s razor effect Likelihood The parameter prior plays a decisive role as it reflects the available parameter space under the model M prior to assimilating data. M. Khalil Data-Driven using 8 / 26

Model Evidence: Nested Models Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work Nested models are investigated often in practice: a more complex model, M 1, with prior p(φ,m), which reduces to a simpler nested model, M, for a certain value of the parameter, φ = φ = Question: Is the extra complexity of M 1 warranted by the data? Define: We have: Prior Likelihood Predictive Optimal Wasted parameter space Favors simpler model mismatch between prediction and likelihood Favors more complex model M. Khalil Data-Driven using 9 / 26

e M t P 3 r n n Inverse Problems Bayes Theorem Stages of Bayesian Inference Bayes Factor Occam s Razor Model Evidence: Nested Models : Occam s Razor at Work : Occam s Razor at Work Generate 6 noisy data points from the true model given by y i = 1 + x 2 i + ǫ i ǫ i N (,1) Question: Not knowing the true model, what is the best model? We propose polynomials of increasing order: 1 8 6 4 2 M : y = a + ǫ. M 5 : y = a + a 1 x + a 2 x 2 + a 3 x 3 + a 4 x 4 + a 5 x 5 + ǫ 1 8 6 4 2 ior pr t e 1 12 8 6 4 2.8.6.4.2 Predictive Optimal 1 8 6 4-2 2 1 8 6 4-2 2-2 2 1 45 8 6 4 1 8 6 4 1 2 3 4 5 5 true 2 true rm 2 2 2 2-2 2-2 2-2 2-2 2 M. Khalil Data-Driven using 1 / 26

Challenges with Challenges with : ARD Priors Predictive Optimal Model evidence is extremely sensitive to prior parameter pdfs Missing out on better candidate models: The number of possible models grows rapidly with the number of possible terms in the physical/statistical model For the previous example, the number of possible models of order up to and including 6 is N M = number of k combinations up to and including 5 6 ( ) 6 = k k=1 = 6! 1! 5! + 6! 2! 4! + 6! 3! 3! + 6! 4! 2! + 6! 5! 1! + 6! 6!! = 63 For polynomials of maximum order of 1, 123 possible models! Solution: (ARD) M. Khalil Data-Driven using 11 / 26

Challenges with : ARD Priors A parametrized prior distribution known as ARD prior is assigned to the unknown model parameters ARD prior pdf is a Gaussian with zero mean and unknown variance (could also use Laplace priors, etc...) The hyper-parameters, α, are estimated using the data by performing evidence maximization or type-ii maximum likelihood estimation Prior : p(φ α,m) Posterior : p(φ d,α,m) Type II likelihood : p(d α,m) = p(d φ,m)p(φ α,m)dφ Predictive Optimal M. Khalil Data-Driven using 12 / 26

: ARD Priors Challenges with : ARD Priors Revisiting the previous example with the true model given by y i = 1+x 2 i +ǫ i ǫ i N (,1) Question: What is the best model nested under the model: y = a +a 1 x+a 2 x 2 +a 3 x 3 +a 4 x 4 +a 5 x 5 +ǫ 5 5 5 Predictive Optimal -5 1 2 3 Optimizer Iteration 5-5 1 2 3 Optimizer Iteration 5-5 1 2 3 Optimizer Iteration 5 1.4 1-4 1.2 1.8.6 Convergence could be improved with better optimizer.4.2 5 1 15 2 25 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration -5 1 2 3 Optimizer Iteration Type-II likelihood (model evidence) M. Khalil Data-Driven using 13 / 26

Nonlinear Modeling in Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Limit cycle oscillation (LCO) is observed in wind tunnel experiments for 2-D rigid airfoil in transitional Re regime Pure pitch LCO due to nonlinear aerodynamic loads Objective: Inverse modeling of nonlinear oscillations with an aim to understand and quantify the contribution of unsteady and nonlinear aerodynamics. Predictive Optimal Nor alized T me M. Khalil Data-Driven using 14 / 26

Research Group/Resources Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Philippe Bisaillon, Ph.D. candidate, Carleton University Rimple Sandhu, Ph.D. candidate, Carleton University Dominique Poirel, Royal Military College (RMC) of Canada Abhijit Sarkar, Carleton University Chris Pettit, United States Naval Academy Predictive Optimal HPC lab at Carleton University Wind tunnel at RMC M. Khalil Data-Driven using 15 / 26

Previous Work Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal Start with a candidate model set: I EA θ +D θ +Kθ +K θ 3 = D sign θ + 1 2 ρu2 c 2 sc M (θ, M 1 : C M = e 1 θ +e 2 θ +e3 θ 3 +e 4 θ 2 θ +σξ(τ) M 6 :. ) θ, θ C M + (B 1 +B 2 ) C M +C M = e 1 θ +e 2 θ +e3 θ 3 +e 4 θ 2 θ +e5 θ 5 B 1 B 2 B 1 B 2 + (2c 6c 7 +.5) θ + c... 6θ +σξ(τ) B 1 B 2 B 1 B 2 We observe the pitch degree-of-freedom (DOF): d k = θ(t k )+ǫ k We perform Bayesian model selection in discrete model space Sandhu et al., JCP, 216 Sandhu et al., CMAME, 214 Khalil et al., JSV, 213 M. Khalil Data-Driven using 16 / 26

Use of ARD Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Start with an encompassing model: I EA θ +D θ +Kθ +K θ 3 = D sign θ + 1 2 ρu2 c 2 sc M C M B +C M = a 1 θ +a 2 θ +a3 θ 3 +a 4 θ 2 θ +a5 θ 5 +a 6 θ 4 θ + c 6 B θ +σξ(τ) We would like to find the optimal model nested under the overly-prescribed encompassing model Predictive Optimal M. Khalil Data-Driven using 17 / 26

Hybrid approach: ARD Priors vs Fixed Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors We assign prior distributions by categorizing parameters based on prior knowledge about the aerodynamics as Required ( φ α ) or Contentious (φα ) C M B +C M = a 1 θ +a 2 θ +a3 θ 3 +a 4 θ 2 θ +a5 θ 5 +a 6 θ 4 θ + c 6 B θ +σξ(τ) Predictive Optimal M. Khalil Data-Driven using 18 / 26

Hierarchical Bayes Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal Using hierarchical Bayes approach Posterior pdf p(α d) of hyper-parameter vector α: p(α d) p(d α)p(α) For a fixed hyper-prior p(α), Task: Stochastic optimization: α MAP = arg max α p(α d) Model evidence as a function of hyper-parameter, Task: Evidence computation: p(d α) = p(d φ)p(φ α)dφ Parameter likelihood computation, Task: State estimation: p(d φ) = n d k=1 p ( d k u j(k),φ ) p ( u j(k) d 1:k 1,φ ) du j(k) M. Khalil Data-Driven using 19 / 26

Numerical techniques Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Evidence computation: Chib-Jeliazkov method, Power posteriors, Nested sampling, Annealed importance sampling, Harmonic mean estimator, adaptive Gauss-Hermite quadrature; and many others MCMC sampler for Chib-Jeliazkov method: Metropolis-Hastings, Gibbs, tmcmc, adaptive Metropolis, Delayed Rejection Adaptive Metropolis (DRAM); and many others State estimation: Kalman filter, extended Kalman filter, unscented Kalman filter, ensemble Kalman filter, particle filter; and many others. Results are in: R. Sandhu, C. Pettit, M. Khalil, D. Poirel, A. Sarkar, Bayesian model selection using automatic relevance determination for nonlinear dynamical systems, Computer Methods in Applied Mechanics and Engineering, in press. Predictive Optimal M. Khalil Data-Driven using 2 / 26

Numerical Results: ARD Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors Predictive Optimal M. Khalil Data-Driven using 21 / 26

Numerical Results: ARD vs Flat Priors Nonlinear Modeling in Previous Work Use of ARD Priors Hybrid approach: ARD Priors vs Fixed Priors Hierarchical Bayes Numerical techniques ARD Priors ARD vs Flat Priors We compare selected marginal and joint pdfs for (a) ARD priors with optimal hyper-parameters, and (b) flat priors ARD priors able to remove superfluous parameters while having insignificant effect on the posterior pdfs of important parameters Predictive Optimal M. Khalil Data-Driven using 22 / 26

Predictive Predictive Predictive Optimal Collaborators: Jina Lee, Maher Salloum () Objective: Replace computationally expensive simulations of physical systems with response predictions constructed at the wavelet coefficient level Procedure: Perform compressed sensing of high-dimensional system response from full-order model simulations Model resulting low-dimensional wavelet coefficients using autoregressive-moving-average (ARMA) model x t = p ϕ i x t i + i=1 q θ j ǫ t j ǫ t N (,1) j=1 y t = x t +ζ t ζ t N (,γ 2) Parameters likelihood for ϕ i, θ j and γ involves a state estimation using the Kalman filter Model selection, i.e. determining model orders p and q, is performed using Akaike information criterion (AIC) M. Khalil Data-Driven using 23 / 26

Wavelet Coefficient Predictions For illustration we consider the transient response of the 2D heat equation on a square domain with randomly chosen holes (for added heterogeneity) Compressed sensing is performed and 7 dominant wavelet coefficients are modeled Predictive Predictive Optimal M. Khalil Data-Driven using 24 / 26

Optimal Predictive Optimal Optimal Collaborators: Layal Hakim, Guilhem Lacaze, Khachik Sargsyan, Habib Najm, Joe Oefelein () Objective: Calibrate a simple chemical model against computations from a detailed kinetic model Simple model with an embedded parameterization of model error using polynomial chaos expansions Optimal placement of model error achieved via Bayesian model selection (Bayes factor) Bayes' factors M. Khalil Data-Driven using 25 / 26

Predictive Optimal Presented a framework for data-driven model selection using ARD prior pdfs ARD priors enable the transformation of the model selection problem from the discrete model space into the continuous hyper-parameter space Allow for parameter space dimension reduction informed by noisy observations of the system Applications: Nonlinear dynamical systems modeled using stochastic ordinary differential equations (ARD priors) Predictive (AIC) Optimal (Bayes factor) ARD priors able to remove superfluous parameters while having insignificant effect on the posterior pdfs of important parameters M. Khalil Data-Driven using 26 / 26