Bayesian Inference in Astronomy & Astrophysics A Short Course

Similar documents
Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Introduction to Bayesian Inference

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Introduction to Bayesian Data Analysis

Introduction to Bayesian Inference: Supplemental Topics

Two Lectures: Bayesian Inference in a Nutshell and The Perils & Promise of Statistics With Large Data Sets & Complicated Models

Miscellany : Long Run Behavior of Bayesian Methods; Bayesian Experimental Design (Lecture 4)

Bayesian inference in a nutshell

Why Try Bayesian Methods? (Lecture 5)

Introduction to Bayesian Inference Lecture 1: Fundamentals

Markov Chain Monte Carlo methods

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

Statistical Data Analysis Stat 3: p-values, parameter estimation

Introduction to Bayesian thinking

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Seminar über Statistik FS2008: Model Selection

Bayesian Inference and MCMC

Introduction to Bayesian Inference Lecture 1: Fundamentals

Bayesian Methods for Machine Learning

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

Bayesian Phylogenetics:

New Bayesian methods for model comparison

Density Estimation. Seungjin Choi

7. Estimation and hypothesis testing. Objective. Recommended reading

Introduction to Bayesian Inference: Supplemental Topics

Parameter Estimation. William H. Jefferys University of Texas at Austin Parameter Estimation 7/26/05 1

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Introduction to Bayes

Principles of Bayesian Inference

Model comparison and selection

Bayesian inference: what it means and why we care

MCMC for big data. Geir Storvik. BigInsight lunch - May Geir Storvik MCMC for big data BigInsight lunch - May / 17

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Lecture 6: Model Checking and Selection

Statistical Methods in Particle Physics

Computational statistics

7. Estimation and hypothesis testing. Objective. Recommended reading

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

A Bayesian Approach to Phylogenetics

Doing Bayesian Integrals

an introduction to bayesian inference

Lecture : Probabilistic Machine Learning

Bayesian Estimation with Sparse Grids

Frequentist-Bayesian Model Comparisons: A Simple Example

Data Analysis I. Dr Martin Hendry, Dept of Physics and Astronomy University of Glasgow, UK. 10 lectures, beginning October 2006

Variational Scoring of Graphical Model Structures

David Giles Bayesian Econometrics

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

COS513 LECTURE 8 STATISTICAL CONCEPTS

Should all Machine Learning be Bayesian? Should all Bayesian models be non-parametric?

A Very Brief Summary of Statistical Inference, and Examples

Introduction to Bayesian Methods

Answers and expectations

Spatial Statistics Chapter 4 Basics of Bayesian Inference and Computation

Bayesian search for other Earths

Definition 3.1 A statistical hypothesis is a statement about the unknown values of the parameters of the population distribution.

Introduction to Bayesian inference Lecture 1: Fundamentals

BAYESIAN METHODS FOR VARIABLE SELECTION WITH APPLICATIONS TO HIGH-DIMENSIONAL DATA

Statistics: Learning models from data

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Stat 5101 Lecture Notes

Bayesian Computation: Posterior Sampling & MCMC

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

The Metropolis-Hastings Algorithm. June 8, 2012

2. A Basic Statistical Toolbox

Unsupervised Learning

Bayesian Estimation of DSGE Models 1 Chapter 3: A Crash Course in Bayesian Inference

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Advanced Statistical Methods. Lecture 6

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

April 20th, Advanced Topics in Machine Learning California Institute of Technology. Markov Chain Monte Carlo for Machine Learning

Data Analysis and Uncertainty Part 2: Estimation

Bayesian analysis in nuclear physics

Parametric Techniques Lecture 3

Introductory Statistics Course Part II

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Penalized Loss functions for Bayesian Model Choice

Bayesian Statistical Methods. Jeff Gill. Department of Political Science, University of Florida

ML estimation: Random-intercepts logistic model. and z

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

17 : Markov Chain Monte Carlo

A6523 Signal Modeling, Statistical Inference and Data Mining in Astrophysics Spring

CosmoSIS Webinar Part 1

Introduction to Bayesian methods in inverse problems

Lecture 2: Basic Concepts of Statistical Decision Theory

Data-Driven Bayesian Model Selection: Parameter Space Dimension Reduction using Automatic Relevance Determination Priors

CPSC 540: Machine Learning

Bayesian inference for multivariate extreme value distributions

Recent advances in cosmological Bayesian model comparison

Hypothesis testing (cont d)

Multimodal Nested Sampling

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Monte Carlo in Bayesian Statistics

Foundations of Statistical Inference

Bayesian Inference. Anders Gorm Pedersen. Molecular Evolution Group Center for Biological Sequence Analysis Technical University of Denmark (DTU)

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Transcription:

Bayesian Inference in Astronomy & Astrophysics A Short Course Tom Loredo Dept. of Astronomy, Cornell University p.1/37

Five Lectures Overview of Bayesian Inference From Gaussians to Periodograms Learning How To Count: Poisson Processes Miscellany: Frequentist Behavior, Experimental Design Why Try Bayesian Methods? p.2/37

Overview of Bayesian Inference What to do What s different about it How to do it: Tools for Bayesian calculation p.3/37

What To Do: The Bayesian Recipe Assess hypotheses by calculating their probabilities p(h i...) conditional on known and/or presumed information using the rules of probability theory. But... what does p(h i...) mean? p.4/37

What is distributed in p(x)? Frequentist: Probability describes randomness Venn, Boole, Fisher, Neymann, Pearson... x is a random variable if it takes different values throughout an infinite (imaginary?) ensemble of identical sytems/experiments. p(x) describes how x is distributed throughout the ensemble. P x is distributed Probability frequency (pdf histogram). x p.5/37

Bayesian: Probability describes uncertainty Bernoulli, Laplace, Bayes, Gauss... p(x) describes how probability (plausibility) is distributed among the possible choices for x in the case at hand. Analog: a mass density, ρ(x) P p is distributed x has a single, uncertain value Relationships between probability and frequency were demonstrated mathematically (large number theorems, Bayes s theorem). x p.6/37

Interpreting Abstract Probabilities Symmetry/Invariance/Counting Resolve possibilities into equally plausible microstates using symmetries Count microstates in each possibility Frequency from probability Bernoulli s laws of large numbers: In repeated trials, given P (success), predict Nsuccess N total P as N p.7/37

Probability from frequency Bayes s An Essay Towards Solving a Problem in the Doctrine of Chances Bayes s theorem Probability Frequency! p.8/37

Bayesian Probability: A Thermal Analogy Intuitive notion Quantification Calibration Hot, cold Temperature, T Cold as ice = 273K Boiling hot = 373K uncertainty Probability, P Certainty = 0, 1 p = 1/36: plausible as snake s eyes p = 1/1024: plausible as 10 heads p.9/37

The Bayesian Recipe Assess hypotheses by calculating their probabilities p(h i...) conditional on known and/or presumed information using the rules of probability theory. Probability Theory Axioms ( grammar ): OR (sum rule) P (H 1 + H 2 I) = P (H 1 I) + P (H 2 I) P (H 1, H 2 I) AND (product rule) P (H 1, D I) = P (H 1 I) P (D H 1, I) = P (D I) P (H 1 D, I) p.10/37

Direct Probabilities ( vocabulary ): Certainty: If A is certainly true given B, P (A B) = 1 Falsity: If A is certainly false given B, P (A B) = 0 Other rules exist for more complicated types of information; for example, invariance arguments, maximum (information) entropy, limit theorems (CLT; tying probabilities to frequencies), bold (or desperate!) presumption... p.11/37

Normalization: Three Important Theorems For exclusive, exhaustive H i P (H i ) = 1 Bayes s Theorem: i P (H i D, I) = P (H i I) P (D H i, I) P (D I) posterior prior likelihood p.12/37

Marginalization: Note that for exclusive, exhaustive {B i }, P (A, B i I) = P (B i A, I)P (A I) = P (A I) i i = P (B i I)P (A B i, I) i We can use {B i } as a basis to get P (A I). Example: Take A = D, B i = H i ; then P (D I) = i = i P (D, H i I) P (H i I)P (D H i, I) prior predictive for D = Average likelihood for H i p.13/37

Inference With Parametric Models Parameter Estimation I = Model M with parameters θ (+ any add l info) H i = statements about θ; e.g. θ [2.5, 3.5], or θ > 0 Probability for any such statement can be found using a probability density function (pdf) for θ: P (θ [θ, θ + dθ] ) = f(θ)dθ = p(θ )dθ p.14/37

Posterior probability density: p(θ D, M) = p(θ M) L(θ) dθ p(θ M) L(θ) Summaries of posterior: Best fit values: mode, posterior mean Uncertainties: Credible regions (e.g., HPD regions) Marginal distributions: Interesting parameters ψ, nuisance parameters φ Marginal dist n for ψ: p(ψ D, M) = dφ p(ψ, φ D, M) Generalizes propagation of errors p.15/37

Model Uncertainty: Model Comparison I = (M 1 + M 2 +...) Specify a set of models. H i = M i Hypothesis chooses a model. Posterior probability for a model: p(m i D, I) = p(m i I) p(d M i, I) p(d I) p(m i )L(M i ) But L(M i ) = p(d M i ) = dθ i p(θ i M i )p(d θ i, M i ). Likelihood for model = Average likelihood for its parameters L(M i ) = L(θ i ) p.16/37

Model Uncertainty: Model Averaging Models have a common subset of interesting parameters, ψ. Each has different set of nuisance parameters φ i (or different prior info about them). H i = statements about ψ. Calculate posterior PDF for ψ: p(ψ D, I) = i i p(m i D, I) p(ψ D, M i ) L(M i ) dθ i p(ψ, φ i D, M i ) The model choice is itself a (discrete) nuisance parameter here. p.17/37

An Automatic Occam s Razor Predictive probabilities can favor simpler models: p(d M i ) = dθ i p(θ i M) L(θ i ) P(D H) Simple H Complicated H D obs D p.18/37

The Occam Factor: p, L Likelihood Prior δθ p(d M i ) = θ dθ i p(θ i M) L(θ i ) p(ˆθ i M)L(ˆθ i )δθ i θ L(ˆθ i ) δθ i θ i = Maximum Likelihood Occam Factor Models with more parameters often make the data more probable for the best fit. Occam factor penalizes models for wasted volume of parameter space. p.19/37

What s the Difference? Bayesian Inference (BI): Specify at least two competing hypotheses and priors Calculate their probabilities using probability theory Parameter estimation: p(θ D, M) = p(θ M)L(θ) dθ p(θ M)L(θ) Model Comparison: O dθ1 p(θ 1 M 1 ) L(θ 1 ) dθ2 p(θ 2 M 2 ) L(θ 2 ) p.20/37

Frequentist Statistics (FS): Specify null hypothesis H 0 such that rejecting it implies an interesting effect is present Specify statistic S(D) that measures departure of the data from null expectations Calculate p(s H 0 ) = dd p(d H 0 )δ[s S(D)] (e.g. by Monte Carlo simulation of data) Evaluate S(D obs ); decide whether to reject H 0 based on, e.g., >S obs ds p(s H 0 ) p.21/37

The role of subjectivity: Crucial Distinctions BI exchanges (implicit) subjectivity in the choice of null & statistic for (explicit) subjectivity in the specification of alternatives. Makes assumptions explicit Guides specification of further alternatives that generalize the analysis Automates identification of statistics: BI is a problem-solving approach FS is a solution-characterization approach The types of mathematical calculations: BI requires integrals over hypothesis/parameter space FS requires integrals over sample/data space p.22/37

Complexity of Statistical Integrals Inference with independent data: Consider N data, D = {x i }; and model M with m parameters (m N). Suppose L(θ) = p(x 1 θ) p(x 2 θ) p(x N θ). Frequentist integrals: dx 1 p(x 1 θ) dx 2 p(x 2 θ) dx N p(x N θ)f(d) Seek integrals with properties independent of θ. Such rigorous frequentist integrals usually can t be found. Approximate (e.g., asymptotic) results are easy via Monte Carlo (due to independence). p.23/37

Bayesian integrals: d m θ g(θ) p(θ M) L(θ) Such integrals are sometimes easy if analytic (especially in low dimensions). Asymptotic approximations require ingredients familiar from frequentist calculations. For large m (> 4 is often enough!) the integrals are often very challenging because of correlations (lack of independence) in parameter space. p.24/37

How To Do It Tools for Bayesian Calculation Asymptotic (large N) approximation: Laplace approximation Low-D Models (m< 10): Randomized Quadrature: Quadrature + dithering Subregion-Adaptive Quadrature: ADAPT, DCUHRE, BAYESPACK Adaptive Monte Carlo: VEGAS, miser High-D Models (m 5 10 6 ): Posterior Sampling Rejection method Markov Chain Monte Carlo (MCMC) p.25/37

Laplace Approximations Suppose posterior has a single dominant (interior) mode at ˆθ, with m parameters p(θ M)L(θ) p(ˆθ M)L(ˆθ) exp [ 1 ] 2 (θ ˆθ)I(θ ˆθ) where I = 2 ln[p(θ M)L(θ)] 2 θ ˆθ, Info matrix p.26/37

Bayes Factors: dθ p(θ M)L(θ) p(ˆθ M)L(ˆθ) (2π) m/2 I 1/2 Marginals: Profile likelihood L p (θ) max φ L(θ, φ) p(θ D, M) L p (θ) I(θ) 1/2 p.27/37

The Laplace approximation: Uses same ingredients as common frequentist calculations Uses ratios approximation is often O(1/N) Using unit info prior in i.i.d. setting Schwarz criterion; Bayesian Information Criterion (BIC) ln B ln L(ˆθ) ln L(ˆθ, ˆφ) + 1 2 (m 2 m 1 ) ln N Bayesian counterpart to adjusting χ 2 for d.o.f., but accounts for parameter space volume. p.28/37

Low-D (m< 10): Quadrature & Monte Carlo Quadrature/Cubature Rules: dθ f(θ) i w i f(θ i ) + O(n 2 ) or O(n 4 ) Smoothness fast convergence in 1-D Curse of dimensionality O(n 2/m ) or O(n 4/m ) in m-d p.29/37

Monte Carlo Integration: dθ g(θ)p(θ) θ i p(θ) g(θ i ) + O(n 1/2 ) O(n 1 ) with quasi-mc Ignores smoothness poor performance in 1-D Avoids curse: O(n 1/2 ) regardless of dimension Practical problem: multiplier is large (variance of g) hard if m> 6 (need good importance sampler p) p.30/37

Randomized Quadrature: Quadrature rule + random dithering of abscissas get benefits of both methods Most useful in settings resembling Gaussian quadrature Subregion-Adaptive Quadrature/MC: Concentrate points where most of the probability lies via recursion Adaptive quadrature: Use a pair of lattice rules (for error estim n), subdivide regions w/ large error (ADAPT, DCUHRE, BAYESPACK by Genz et al.) Adaptive Monte Carlo: Build the importance sampler on-the-fly (e.g., VEGAS, miser in Numerical Recipes) p.31/37

Subregion-Adaptive Quadrature Concentrate points where most of the probability lies via recursion. Use a pair of lattice rules (for error estim n), subdivide regions w/ large error. ADAPT in action (galaxy polarizations) p.32/37

Posterior Sampling General Approach: Draw samples of θ, φ from p(θ, φ D, M); then: Integrals, moments easily found via i f(θ i, φ i ) {θ i } are samples from p(θ D, M) But how can we obtain {θ i, φ i }? Rejection Method: P( ) θ θ Hard to find efficient comparison function if m> 6. p.33/37

Markov Chain Monte Carlo (MCMC) Let Λ(θ) = ln [p(θ M) p(d θ, M)] Then p(θ D, M) = e Λ(θ) Z Z dθ e Λ(θ) Bayesian integration looks like problems addressed in computational statmech and Euclidean QFT. Markov chain methods are standard: Metropolis; Metropolis-Hastings; molecular dynamics; hybrid Monte Carlo; simulated annealing; thermodynamic integration p.34/37

A Complicated Marginal Distribution Nascent neutron star properties inferred from neutrino data from SN 1987A Two variables derived from 9-dimensional posterior distribution. p.35/37

The MCMC Recipe: Create a time series of samples θ i from p(θ): Draw a candidate θ i+1 from a kernel T (θ i+1 θ i ) Enforce detailed balance by accepting with p = α [ α(θ i+1 θ i ) = min 1, T (θ ] i θ i+1 )p(θ i+1 ) T (θ i+1 θ i )p(θ i ) Choosing T to minimize burn-in and corr ns is an art. Coupled, parallel chains eliminate this for select problems ( exact sampling ). p.36/37

Summary Bayesian/frequentist differences: Probabilities for hypotheses vs. for data Problem solving vs. solution characterization Integrals: Parameter space vs. sample space Computational techniques for Bayesian inference: Large N: Laplace approximation Exact: Adaptive quadrature for low-d Posterior sampling for hi-d p.37/37