Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I

Similar documents
Bayesian methods for effective field theories

Bayesian Regression Linear and Logistic Regression

Physics 403. Segev BenZvi. Propagation of Uncertainties. Department of Physics and Astronomy University of Rochester

Dynamic System Identification using HDMR-Bayesian Technique

arxiv:astro-ph/ v1 14 Sep 2005

Physics 509: Bootstrap and Robust Parameter Estimation

Frequentist-Bayesian Model Comparisons: A Simple Example

Bayesian analysis in nuclear physics

Probing the covariance matrix

Markov Chain Monte Carlo, Numerical Integration

Introduction to Bayesian methods in inverse problems

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Strong Lens Modeling (II): Statistical Methods

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

DRAGON ADVANCED TRAINING COURSE IN ATMOSPHERE REMOTE SENSING. Inversion basics. Erkki Kyrölä Finnish Meteorological Institute

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Physics 403. Segev BenZvi. Numerical Methods, Maximum Likelihood, and Least Squares. Department of Physics and Astronomy University of Rochester

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Variational Methods in Bayesian Deconvolution

A Statistical Input Pruning Method for Artificial Neural Networks Used in Environmental Modelling

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Density Estimation. Seungjin Choi

CSC321 Lecture 18: Learning Probabilistic Models

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Introduction to Bayesian Data Analysis

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Bayesian Inference in Astronomy & Astrophysics A Short Course

Inverse Problems in the Bayesian Framework

STA 4273H: Sta-s-cal Machine Learning

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

MAT Inverse Problems, Part 2: Statistical Inversion

MCMC 2: Lecture 3 SIR models - more topics. Phil O Neill Theo Kypraios School of Mathematical Sciences University of Nottingham

A Bayesian Approach to Phylogenetics

VCMC: Variational Consensus Monte Carlo

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Quantile POD for Hit-Miss Data

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Development of Stochastic Artificial Neural Networks for Hydrological Prediction

(5) Multi-parameter models - Gibbs sampling. ST440/540: Applied Bayesian Analysis

Bayesian Data Analysis

ECE521 week 3: 23/26 January 2017

STA 4273H: Statistical Machine Learning

Logistic Regression. William Cohen

Statistics and Data Analysis

When one of things that you don t know is the number of things that you don t know

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

The University of Auckland Applied Mathematics Bayesian Methods for Inverse Problems : why and how Colin Fox Tiangang Cui, Mike O Sullivan (Auckland),

Inconsistency of Bayesian inference when the model is wrong, and how to repair it

A Probability Review

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Bagging During Markov Chain Monte Carlo for Smoother Predictions

Semi-Parametric Importance Sampling for Rare-event probability Estimation

Stat 516, Homework 1

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

GAUSSIAN PROCESS REGRESSION

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Machine Learning Lecture 2

STA414/2104 Statistical Methods for Machine Learning II

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

SAMSI Astrostatistics Tutorial. More Markov chain Monte Carlo & Demo of Mathematica software

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Bayesian System Identification based on Hierarchical Sparse Bayesian Learning and Gibbs Sampling with Application to Structural Damage Assessment

1 Bayesian Linear Regression (BLR)

Machine Learning Lecture 7

MODULE -4 BAYEIAN LEARNING

Modeling and state estimation Examples State estimation Probabilities Bayes filter Particle filter. Modeling. CSC752 Autonomous Robotic Systems

Generalization. A cat that once sat on a hot stove will never again sit on a hot stove or on a cold one either. Mark Twain

Lecture: Gaussian Process Regression. STAT 6474 Instructor: Hongxiao Zhu

Bayesian Machine Learning

Fast Likelihood-Free Inference via Bayesian Optimization

Large Scale Bayesian Inference

Approximate Bayesian computation for spatial extremes via open-faced sandwich adjustment

Additional Keplerian Signals in the HARPS data for Gliese 667C from a Bayesian re-analysis

Bayesian Inference. Chapter 1. Introduction and basic concepts

Markov Chain Monte Carlo

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Markov chain Monte Carlo methods in atmospheric remote sensing

SAMSI Astrostatistics Tutorial. Models with Gaussian Uncertainties (lecture 2)

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

STA 4273H: Statistical Machine Learning

Machine Learning Lecture 3

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Bayesian Inference and MCMC

Learning the hyper-parameters. Luca Martino

Introduction to Mobile Robotics Probabilistic Robotics

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

2 Chapter 2: Conditional Probability

Infer relationships among three species: Outgroup:

Overfitting, Bias / Variance Analysis

Machine Learning Lecture 2

Probability, CLT, CLT counterexamples, Bayes. The PDF file of this lecture contains a full reference document on probability and random variables.

Fundamentals of Data Assimila1on

2. A Basic Statistical Toolbox

Adaptive Rejection Sampling with fixed number of nodes

Transcription:

Bayesian rules of probability as principles of logic [Cox] Notation: pr(x I) is the probability (or pdf) of x being true given information I 1 Sum rule: If set {x i } is exhaustive and exclusive, pr(x i I) = 1 dx pr(x I) = 1 i implies marginalization (integrating out or integrating in) pr(x I) = pr(x, y j I) pr(x I) = dy pr(x, y I) j 2 Product rule: expanding a joint probability of x and y pr(x, y I) = pr(x y, I) pr(y I) = pr(y x, I) pr(x I) If x and y are mutually independent: pr(x y, I) = pr(x I), then pr(x, y I) pr(x I) pr(y I) Rearranging the second equality yields Bayes theorem pr(x y, I) = pr(y x, I) pr(x I) pr(y I)

Applying Bayesian methods to LEC estimation Definitions: a vector of LECs = coefficients of an expansion (a 0, a 1,...) D measured data (e.g., cross sections) I all background information (e.g., data errors, EFT details) Bayes theorem: How knowledge of a is updated pr(a D, I) }{{} posterior = pr(d a, I) }{{} likelihood pr(a I) }{{} prior / pr(d I) }{{} evidence Posterior: probability distribution for LECs given the data Likelihood: probability to get data D given a set of LECs Prior: What we know about the LECs a priori Evidence: Just a normalization factor here [Note: The evidence is important in model selection] The posterior lets us find the most probable values of parameters or the probability they fall in a specified range ( credibility interval )

Limiting cases in applying Bayes theorem Suppose we are fitting a parameter H 0 to some data D given a model M 1 and some information (e.g., about the data or the parameter) Bayes theorem tells us how to find the posterior distribution of H 0 : pr(h 0 D, M 1, I) = pr(d H 0, M 1, I) pr(h 0 M 1, I) pr(d I) [From P. Gregory, Bayesian Logical Data Analysis for the Physical Sciences ] Special cases: (a) If the data is overwhelming, the prior has no effect on the posterior (b) If the likelihood is unrestrictive, the posterior returns the prior

Toy model for natural EFT [Schindler/Phillips, Ann. Phys. 324, 682 (2009)] Real world : g(x) = (1/2 + tan (πx/2)) 2 Model 0.25 + 1.57x + 2.47x 2 + O(x 3 ) a true = {0.25, 1.57, 2.47, 1.29,...} Generate synthetic data D with noise with 5% relative error: D : d j = g j (1 + 0.05η j ) where g j g(x j ) η is normally distributed random noise σ j = 0.05 g j η j Pass 1: pr(a D, I) pr(d a, I) pr(a I) with pr(a I) constant = pr(a D, I) e χ2 /2 where χ 2 = N 1 σ 2 j=1 j ( d j ) 2 M a i x i That is, if we assume no prior information about the LECs (uniform prior), the fitting procedure is the same as least squares! i=0

Toy model for natural EFT [Schindler/Phillips, Ann. Phys. 324, 682 (2009)] Real world : g(x) = (1/2 + tan (πx/2)) 2 Model 0.25 + 1.57x + 2.47x 2 + O(x 3 ) a true = {0.25, 1.57, 2.47, 1.29,...} Generate synthetic data D with noise with 5% relative error: D : d j = g j (1 + 0.05η j ) where g j g(x j ) η is normally distributed random noise σ j = 0.05 g j η j Pass 1: pr(a D, I) pr(d a, I) pr(a I) with pr(a I) constant = pr(a D, I) e χ2 /2 where χ 2 = N 1 σ 2 j=1 j ( d j ) 2 M a i x i That is, if we assume no prior information about the LECs (uniform prior), the fitting procedure is the same as least squares! i=0

Toy model for natural EFT [Schindler/Phillips, Ann. Phys. 324, 682 (2009)] Real world : g(x) = (1/2 + tan (πx/2)) 2 Model 0.25 + 1.57x + 2.47x 2 + O(x 3 ) a true = {0.25, 1.57, 2.47, 1.29,...} Generate synthetic data D with noise with 5% relative error: D : d j = g j (1 + 0.05η j ) where g j g(x j ) η is normally distributed random noise σ j = 0.05 g j η j Pass 1: pr(a D, I) pr(d a, I) pr(a I) with pr(a I) constant = pr(a D, I) e χ2 /2 where χ 2 = N 1 σ 2 j=1 j ( d j ) 2 M a i x i That is, if we assume no prior information about the LECs (uniform prior), the fitting procedure is the same as least squares! i=0

Toy model Pass 1: Uniform prior Find the maximum of the posterior distribution; this is the same as fitting the coefficients with conventional χ 2 minimization. Pseudo-data: 0.03 < x < 0.32. Pass 1 results M χ 2 /dof a 0 a 1 a 2 true 0.25 1.57 2.47 1 2.24 0.203±0.01 2.55±0.11 2 1.64 0.25±0.02 1.6±0.4 3.33±1.3 3 1.85 0.27±0.04 0.95±1.1 8.16±8.1 4 1.96 0.33±0.07 1.9±2.7 44.7±32.6 5 1.39 0.57±0.3 14.8±6.9 276±117 Results highly unstable with changing order M (e.g., see a 1 ) The errors become large and also unstable But χ 2 /dof is not bad! Check the plot...

Toy model Pass 1: Uniform prior Would we know the results were unstable if we didn t know the underlying model? Maybe some unusual structure at M = 3... Insufficient data = not high or low enough in x, or not enough points, or available data not precise (entangled!) Determining parameters at finite order in x from data with contributions from all orders

Toy model Pass 2: A prior for naturalness Now, add in our knowledge of the coefficients in the form of a prior ( M ) ) 1 pr(a D) = exp ( a2 2πR 2R 2 i=0 R encodes naturalness assumption, and M is order of expansion. Same procedure: find the maximum of the posterior... Results for R = 5: Much more stable! M a 0 a 1 a 2 true 0.25 1.57 2.47 2 0.25±0.02 1.63±0.4 3.2±1.3 3 0.25±0.02 1.65±0.5 3±2.3 4 0.25±0.02 1.64±0.5 3±2.4 5 0.25±0.02 1.64±0.5 3±2.4 What to choose for R? = marginalize over R (integrate). We used a Gaussian prior; where did this come from? = Maximum entropy distribution for i a2 i = (M + 1)R2

Toy model Pass 2: A prior for naturalness Now, add in our knowledge of the coefficients in the form of a prior ( M ) ) 1 pr(a D) = exp ( a2 2πR 2R 2 i=0 R encodes naturalness assumption, and M is order of expansion. Same procedure: find the maximum of the posterior... Results for R = 5: Much more stable! M a 0 a 1 a 2 true 0.25 1.57 2.47 2 0.25±0.02 1.63±0.4 3.2±1.3 3 0.25±0.02 1.65±0.5 3±2.3 4 0.25±0.02 1.64±0.5 3±2.4 5 0.25±0.02 1.64±0.5 3±2.4 What to choose for R? = marginalize over R (integrate). We used a Gaussian prior; where did this come from? = Maximum entropy distribution for i a2 i = (M + 1)R2

Aside: Maximum entropy to determine prior pdfs Basic idea: least biased pr(x) from maximizing entropy [ ] pr(x) S[pr(x)] = dx pr(x) log m(x) subject to constraints from the prior information m(x) is an appropriate measure (often uniform) One constraint is normalization: dx pr(x) = 1 = alone it leads to uniform pr(x) If the average variance is assumed to be: i a2 i = (M + 1)R2, for fixed M and R ( ensemble naturalness ) maximize [ ] [ ] pr(a M, R) Q[pr(a M, R)] = da pr(a M, R) log + λ 0 1 da pr(a M, R) m(x) ] + λ 1 [(M + 1)R 2 da a 2 pr(a M, R)] Then δq = 0 and m(a) = const. = pr(a M, R) = δpr(a M, R) ( M i=0 ) ) 1 exp ( a2 2πR 2R 2

Aside: Maximum entropy to determine prior pdfs Basic idea: least biased pr(x) from maximizing entropy [ ] pr(x) S[pr(x)] = dx pr(x) log m(x) subject to constraints from the prior information m(x) is an appropriate measure (often uniform) One constraint is normalization: dx pr(x) = 1 = alone it leads to uniform pr(x) If the average variance is assumed to be: i a2 i = (M + 1)R2, for fixed M and R ( ensemble naturalness ) maximize [ ] [ ] pr(a M, R) Q[pr(a M, R)] = da pr(a M, R) log + λ 0 1 da pr(a M, R) m(x) ] + λ 1 [(M + 1)R 2 da a 2 pr(a M, R)] Then δq = 0 and m(a) = const. = pr(a M, R) = δpr(a M, R) ( M i=0 ) ) 1 exp ( a2 2πR 2R 2

Diagnostic tools 1: Triangle plots of posteriors from MCMC Sample the posterior with an implementation of Markov Chain Monte Carlo (MCMC) [note: MCMC not actually needed for this example!] M=3$(up$to$x 3 )$ Uniform$prior$ M=3$(up$to$x 3 )$ Gaussian$prior$with$R=5$ With uniform prior, parameters play off each other With naturalness prior, much less correlation; note that a 2 and a 3 return prior = no information from data (but marginalized)

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5)

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 0, g(x) = a 0 works only at lowest x (otherwise range too large) Very small error (sharp posterior), but wrong! Prior is irrelevant given a 0 values; we need to account for higher orders Bayesian solution: marginalize over higher M

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 1, g(x) = a 0 + a 1 x works with smallest x max only Errors (yellow band) from sampling posterior Prior is irrelevant given a i values; we need to account for higher orders Bayesian solution: marginalize over higher M

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 2, entire fit range is usable Priors on a 1, a 2 important for a 1 stability with x max For this problem, using higher M is the same as marginalization

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 3, uniform prior is off the screen at lower x max Prior gives a i stability with x max = accounts for higher orders not in model For this problem, higher M is the same as marginalization

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 4, uniform prior has lost a 0 as well Prior gives a i stability with x max For this problem, higher M is the same as marginalization

Diagnostic tools 2: Variable x max plots = change fit range Plot a i with M = 0, 1, 2, 3, 4, 5 as a function of endpoint of fit data (x max) Uniform prior Naturalness prior (R = 5) For M = 5, g(x) = a 0 uniform prior has lost a 0 as well (range too large) Prior gives a i stability with x max For this problem, higher M is the same as marginalization

Diagnostic tools 3: How do you know what R to use? Gaussian naturalness prior but let R vary over a large range actual&value& actual&value& actual&value& Error bands from posteriors (integrating over other variables) Light dashed lines are maximum likelihood (uniform prior) results Each a i has a reasonable plateau from about 2 to 10 = marginalize!

Diagnostic tools 4: error plots (à la Lepage) Plot residuals (data predicted) from truncated expansion 5% relative data error shown by bars on selected points Theory error dominates data error for residual over 0.05 or so Slope increase order = reflects truncation = EFT works! Intersection of different orders at breakdown scale

How the Bayes way fixes issues in the model problem By marginalizing over higher-order terms, we are able to use all the data, without deciding where to break; we find stability with respect to expansion order and amount of data Prior on naturalness suppresses overfitting by limiting how much different orders can play off each other Statistical and systematic uncertainties are naturally combined Diagnostic tools identify sensitivity to prior, whether the EFT is working, breakdown scale, theory vs. data error dominance,... M=3$ a 0 $is$determined$ $by$the$data$ M=3$ a 1 $is$determined$ $by$the$data$

How the Bayes way fixes issues in the model problem By marginalizing over higher-order terms, we are able to use all the data, without deciding where to break; we find stability with respect to expansion order and amount of data Prior on naturalness suppresses overfitting by limiting how much different orders can play off each other Statistical and systematic uncertainties are naturally combined Diagnostic tools identify sensitivity to prior, whether the EFT is working, breakdown scale, theory vs. data error dominance,... a 2 $is$a$combina.on$ M=3$ a 3 $is$undetermined$ M=3$

Bayesian model selection Determine the evidence for different models M 1 and M 2 via marginalization by integrating over possible sets of parameters a in different models, same D and information I. The evidence ratio for two different models: pr(m 2 D, I) pr(m 1 D, I) = pr(d M 2, I) pr(m 2 I) pr(d M 1, I) pr(m 1 I) The Bayes Ratio: implements Occam s Razor pr(d M 2, I) pr(d M 1, I) = pr(d a2, M 2, I) pr(a 2 M 2, I) da 2 pr(d a1, M 1, I) pr(a 1 M 1, I) da 1 Note these are integrations over all parameters a n ; not comparing fits! If the model is too simple, the particular dataset is very unlikely If the model is too complex, the model has too much phase space

Bayesian model selection: polynomial fitting degree%1% degree%2% degree%3% Likelihood% degree%4% degree%5% degree%1% degree%6% Evidence% [adapted%from%tom%minka,%h?p://alumni.media.mit.edu/~tpminka/statlearn/demo/]% degree% The likelihood considers the single most probable curve, and always increases with increasing degree. The evidence is a maximum at 3, the true degree!